Pandas DataFrame – Pivot, Groupby, Filter, Query

Created By: Debasis Das (18-Feb-2021)

http://www.knowstack.com/notebooks/DataFrame_Pivot_GroupBy_Filter.html

In this post we will explore the following DataFrame functions

  • pivot_table
  • groupby
  • filter
  • query
import pandas as pd
import numpy as np

df = pd.read_csv("SalesData.csv",low_memory=False)
df

The working DataFrame is as below and we will use different DataFrame functions to slice and dice the data.

Pivoted view of the above Data 

df1 = pd.pivot_table(df,index=["Region","Country","Product"],aggfunc=np.sum)
df1

Pivot using only 2 columns, the data corresponding to the Products will be rolled up.

df2 = pd.pivot_table(df,index=["Region","Country"],aggfunc=np.sum)
df2

 

Multi-Level Column Pivot 

DataFrame pivot with multi level in columns (Week followed by product)

df3 = pd.pivot_table(df,index=["Region","Country"],columns=["Product"],aggfunc=np.sum)
df3

df4 = pd.pivot_table(df,index=["Region","Country"],columns=["Product"],values = ["WK_1","WK_2"],aggfunc=np.sum)
df4

 

Adding a new column to the pivot table 

df5 = pd.pivot_table(df,index=["Region","Country"], aggfunc=np.sum)
df5['Total'] = df5['WK_1'] + df5['WK_2']+ df5['WK_3'] + df5['WK_4']
df5

 

DataFrame Filter

fdf1 = df[df["Region"] == "America"]
fdf1

Using the DataFrame Filter function

fdf3 = df.filter(["Region", "Country", "WK_1"]) 
fdf3

DataFrame groupby function

gdf1 = df.groupby(['Region']).sum()
gdf1

gdf3 = df.groupby(['Region',"Country","Product"],sort=True).sum()
gdf3

groupby and get_group

gdf4 = df.groupby(["Country","Product"]).get_group(('Canada','Laptop'))
gdf4

gdf5 = df.groupby(["Region","Product"]).get_group(('America','Phone'))
gdf5

Iterating through a grouped data frame

gdf6 = df.groupby('Country')
for name,group in gdf6:
    print(name)
    print(group)

Using Numpy functions for groupby aggregates

gdf7 = df.groupby(['Region','Product']).aggregate(np.sum)
gdf7

Using Multiple Aggregate functions in DataFrame groupby

gdf8 = df.groupby(['Region','Country']).aggregate([np.sum,np.mean])
gdf8

Using a different aggregate function for columns and renaming the column to reflect the function used

gdf10 = df.groupby(['Region']).agg({'WK_1': np.sum, 'WK_2':np.mean}).rename(columns = {'WK_2':'WK_2 (Mean)'})
gdf10

DataFrame Query

Query to see if one column value is greater than another column

qdf1 = df[df.WK_2 > df.WK_1]
qdf1

In the below sample week 2 values are greater than week 1 values

qdf2 = df[df.WK_1 > 1500]
qdf2

Multi Column Query

qdf3 = df[(df.WK_1 > 1500) & (df.WK_2 > 1900)]
qdf3

Using the DataFrame.query function

qdf4 = df.query('WK_1 > 1500 & WK_2 > 1900')
qdf4

Posted in Data Mining, Python Tagged with: , , ,

Leave a Reply

Your email address will not be published. Required fields are marked *

*