Pandas DataFrame – Different ways to Create and Edit

By Debasis Das (17-Feb-2021)

In this post we will see Different ways of creating a pandas DataFrame and editing it

Lets first import the Python Pandas and numpy module

import pandas as pd
import numpy as np
import random
pd.set_option('display.width', 1000)

Creating a dataframe by reading from a CSV

salesDataDf = pd.read_csv("SalesData.csv",low_memory=False)
print(salesDataDf)

Output:

    Region  Country Product  WK_1  WK_2  WK_3  Wk_4
0  America      USA  Laptop  1241  1160  1929  1174
1  America      USA   Phone  1098  1092  1089  1819
2  America   Canada  Laptop  1441  1099  1950  1394
3  America   Canada   Phone  1990  1057  1656  1060
4   Europe  Belgium  Laptop  1084  1116  1002  1566
5   Europe  Belgium   Phone  1574  1958  1793  1213
6   Europe  Finland  Laptop  1325  1374  1300  1579
7   Europe  Finland   Phone  1347  1736  1782  1921

Create a DataFrame from a List

seasons = ['Winter','Spring','Summer','Fall']
df = pd.DataFrame(seasons)
print(df)

Output:

        0
0  Winter
1  Spring
2  Summer
3    Fall

Creating a DataFrame from an Array of Array:

data = [[100,200,300],[400,500,600],['John','Jane','Mary','Jin']]
df = pd.DataFrame(data)
print(df)

Output:

  0 1 2 3
0 100 200 300 None
1 400 500 600 None
2 John Jane Mary Jin

Creating a DataFrame by passing an arrays of values and an array of column header names:

df = pd.DataFrame([['Debasis Das',100],['John Doe',98],['Jane Doe',93]], columns = ["name","score"],dtype = float)
print(df)

Output:

  name score
0 Debasis Das 100.0
1 John Doe 98.0
2 Jane Doe 93.0

Create a DataFrame from a Dictionary:

dictionary = {"names":["John Doe","Jane Doe","Mary Jane"], "score":[90,91,93]}
df = pd.DataFrame(dictionary)
print(df)

Output:

  names score
0 John Doe 90
1 Jane Doe 91
2 Mary Jane 93

Create a DataFrame from a List of Dictionaries:

listOfDict = [{"names":"John Doe", "age":30},{"names":"Jane Doe", "age":10, "score":98.0}]
df = pd.DataFrame(listOfDict)
print(df)

Output:

  names age score
0 John Doe 30 NaN
1 Jane Doe 10 98.0

Add a column to the DataFrame based on a condition:

df = pd.DataFrame({'Score':[100,20,30,80,90]})
print(df)
df.loc[df.Score <= 80, "Grade"] = "B"
df.loc[df.Score > 80, "Grade"] = "A"
df.loc[df.Score < 35, "Grade"] = "F"
print(df)

Output:

  Score
0 100
1 20
2 30
3 80
4 90

After adding a new column

 Score Grade
0 100 A
1 20 F
2 30 F
3 80 B
4 90 A

Creating a DataFrame from a numpy array:

df = pd.DataFrame(np.random.randint(low=80, high=100, size=(3, 5)), columns=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'],index=['Temp Morning', 'Temp Afternoon', 'Temp Evening'])
print(df)

Output:

             Monday Tuesday Wednesday Thursday Friday
Temp Morning   98     88      91         87     91
Temp Afternoon 92     83      80         83     81
Temp Evening   84     94      92         96     84   

Creating a DataFrame with Random Values in few columns

salesData = {"Region":["Americas","Americas","Americas","Americas","Europe","Europe","Europe","Europe"],
 "Country":["USA","USA","Mexico","Mexico","Belgium","Belgium","Finland","Finland"],
 "Product":["Phone","TV","Phone","TV","Phone","TV","Phone","TV"],
 "wk1":np.random.uniform(1000,2000,8),
 "wk2":np.random.uniform(1000,2000,8),
 "wk3":np.random.uniform(1000,2000,8),
 "wk4":np.random.uniform(1000,2000,8)
 }
salesDF = pd.DataFrame(salesData)
print(salesDF)

Output:

Adding New Columns to an Existing DataFrame:

salesDF["Total"] = salesDF['wk1'] + salesDF['wk2'] + salesDF['wk3'] + salesDF['wk4']
salesDF["Total (K)"] = salesDF["Total"] /1000
print(salesDF)

Output: In this sample we added 2 new columns (Total and Total In K Format)

Creating a DataFrame using Assign Function

# Assign returns a copy of the data frame as a new object with the new columns added to the original data frame.
salesDF1 = salesDF[["Region","Country","Product","wk1"]]
salesDF2 = salesDF1.assign(week1_k=lambda x: salesDF1['wk1']/1000)
salesDF2

Output:

Transpose a DataFrame:

data = {'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data, index=['Row1', 'Row2', 'Row3'])
df_transpose = df.T

print("\nBefore Transpose")
print(df)

print("\nAfter Transpose")
print(df_transpose)

Selecting a Subset of a DataFrame using loc

data = {'Col1': [1, 2, 3], 'Col2': [4, 5, 6], 'Col3': [7, 8, 9]}
df = pd.DataFrame(data, index=['Row1', 'Row2', 'Row3'])
print(df)
df1 = df.loc['Row1':'Row2','Col1':'Col2']
print("\nSubset of the Dataframe")
print(df1)

Merge Dataframes

data1 = {"Country":["USA","Mexico","Brazil"],"Jan_Sales":[1000,2000,3000]}
data2 = {"Country":["Canada","Mexico","Brazil","Belgium"],"Feb_Sales":[4000,5000,6000,7000]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
#Pandas merge function can automatically detect which columns are common between the dataframes 
#Pandas will use the common columns to merge the two dataframes
df1
Country	Jan_Sales
0	USA	1000
1	Mexico	2000
2	Brazil	3000

df2
Country	Feb_Sales
0	Canada	4000
1	Mexico	5000
2	Brazil	6000
3	Belgium	7000
df3 = df1.merge(df2)
df3
Country	Jan_Sales	Feb_Sales
0	Mexico	2000	5000
1	Brazil	3000	6000

Inner Merge

df4 = df1.merge(df2,how='inner')
df4
Country	Jan_Sales	Feb_Sales
0	Mexico	2000	5000
1	Brazil	3000	6000

Outer Merge

df5 = df1.merge(df2,how='outer')
df5
  Country	Jan_Sales	Feb_Sales
0	USA	1000.0	NaN
1	Mexico	2000.0	5000.0
2	Brazil	3000.0	6000.0
3	Canada	NaN	4000.0
4	Belgium	NaN	7000.0

Left Merge

df6 = df1.merge(df2,how='left')
df6
  Country	Jan_Sales	Feb_Sales
0	USA	1000	NaN
1	Mexico	2000	5000.0
2	Brazil	3000	6000.0

Right Merge

df7 = df1.merge(df2,how='right')
df7
Country	Jan_Sales	Feb_Sales
0	Canada	NaN	4000
1	Mexico	2000.0	5000
2	Brazil	3000.0	6000
3	Belgium	NaN	7000

 

Posted in Data Mining, Data Science, Python Tagged with: , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*