R Data Frame Tutorial
By Debasis Das (29-Aug-2016)
In this R Data Frame tutorial we will load a data frame from a csv file and work on the data frame
Our csv has the following columns
- geo
- country
- product
- price
- unitsales
- revenue
You can download the csv used in this tutorial here UnitSales
Check the current working directory
> getwd()
[1] "/Users/debasisdas"
Change the working directory
#set the working directory to the place where the csv is located > setwd("/Users/debasisdas/Technology/R/") > getwd() [1] "/Users/debasisdas/Technology/R"
Load a data frame from a csv
> unitsales <- read.csv("UnitSales.csv")
> unitsales
geo country product price salesunit revenue
1 americas usa computer 800 100 80000
2 americas usa speaker 50 20 1000
3 americas usa monitor 150 200 30000
4 americas usa accessories 30 100 3000
5 americas mexico computer 800 200 160000
6 americas mexico speaker 50 32 1600
7 americas mexico monitor 150 212 31800
8 americas mexico accessories 30 432 12960
9 americas brazil computer 800 456 364800
10 americas brazil speaker 50 777 38850
11 americas brazil monitor 150 876 131400
12 americas brazil accessories 30 987 29610
13 asia india computer 800 654 523200
14 asia india speaker 50 456 22800
15 asia india monitor 150 678 101700
16 asia india accessories 30 102 3060
17 asia china computer 800 111 88800
18 asia china speaker 50 21 1050
19 asia china monitor 150 32 4800
20 asia china accessories 30 32 960
21 asia rest of asia computer 800 32 25600
22 asia rest of asia speaker 50 11 550
23 asia rest of asia monitor 150 11 1650
24 asia rest of asia accessories 30 11 330
Check the class of the object created by reading from csv
> class(unitsales)
[1] "data.frame"
Check the dimensions of the unitsales object
> dim(unitsales) [1] 24 6 > length(unitsales) [1] 6 > attributes(unitsales) $names [1] "geo" "country" "product" "price" "salesunit" "revenue" $class [1] "data.frame" $row.names [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Check the number of rows and number of columns in a data frame
> nrow(unitsales) [1] 24 > ncol(unitsales) [1] 6
Data frame head function
> #Instead of printing out the entire data frame, it is often desirable to preview it with the head function beforehand.
> head(unitsales)
geo country product price salesunit revenue
1 americas usa computer 800 100 80000
2 americas usa speaker 50 20 1000
3 americas usa monitor 150 200 30000
4 americas usa accessories 30 100 3000
5 americas mexico computer 800 200 160000
6 americas mexico speaker 50 32 1600
Get specific columns from the data frame
> #print only specific columns
> unitsales[c("geo","country","product","salesunit")]
geo country product salesunit
1 americas usa computer 100
2 americas usa speaker 20
3 americas usa monitor 200
4 americas usa accessories 100
5 americas mexico computer 200
6 americas mexico speaker 32
7 americas mexico monitor 212
8 americas mexico accessories 432
9 americas brazil computer 456
10 americas brazil speaker 777
11 americas brazil monitor 876
12 americas brazil accessories 987
13 asia india computer 654
14 asia india speaker 456
15 asia india monitor 678
16 asia india accessories 102
17 asia china computer 111
18 asia china speaker 21
19 asia china monitor 32
20 asia china accessories 32
21 asia rest of asia computer 32
22 asia rest of asia speaker 11
23 asia rest of asia monitor 11
24 asia rest of asia accessories 11
Summary of data frame
> #print the summary of the unitsales data frame
> summary(unitsales)
geo country product price salesunit revenue
americas:12 brazil :4 accessories:6 Min. : 30.0 Min. : 11.0 Min. : 330
asia :12 china :4 computer :6 1st Qu.: 45.0 1st Qu.: 32.0 1st Qu.: 1638
india :4 monitor :6 Median :100.0 Median :106.5 Median : 24200
mexico :4 speaker :6 Mean :257.5 Mean :272.6 Mean : 69147
rest of asia:4 3rd Qu.:312.5 3rd Qu.:456.0 3rd Qu.: 82200
usa :4 Max. :800.0 Max. :987.0 Max. :523200
Subset of data frame
> #subset
> #get the data only for usa
> subset(unitsales, country=="usa")
geo country product price salesunit revenue
1 americas usa computer 800 100 80000
2 americas usa speaker 50 20 1000
3 americas usa monitor 150 200 30000
4 americas usa accessories 30 100 3000
Subset a data frame based on a condition
> #get the data only for usa and computers > subset(unitsales, country=="usa" & product == "computer") geo country product price salesunit revenue 1 americas usa computer 800 100 80000 > #get the total revenue for all geos > sum(unitsales["revenue"]) [1] 1659520 > #get the total revenue for geo = americas > sum(subset(unitsales,geo=="americas")["revenue"]) [1] 885020
Aggregate a data frame
> #aggregate by country and revenue > aggregate(unitsales$revenue, by = list(country=unitsales$country), FUN=sum) country x 1 brazil 564660 2 china 95610 3 india 650760 4 mexico 206360 5 rest of asia 28130 6 usa 114000 > #aggregate by geo and revenue > aggregate(unitsales$revenue, by = list(geo=unitsales$geo), FUN=sum) geo x 1 americas 885020 2 asia 774500
Very informative blog.
I would like to thank you for the information.