Chapter 16 Tidyverse
For this document you will need to install and load the family of packages tidyverse
functions. To install the package refer to Section 10.
The tidyverse is a collection of packages that share a unique underlying philosophy, frame work, and syntax. There are approximately 20 tidyverse packages, but the core ones are ggplot2
, dplyr
, tidyr
, readr
, purr
, tibble
, stringr
, and forcats
. You can install these packages individually or all at once using by simply using the command install.packages("tidyverse")
.
In general, the tidyverse syntax is structured in a way where we think about “actions” instead of “objects”. In other words, we think about coding in terms of verbs instead of nouns.
The overall tidyverse structure and syntax is unique. Some believe that this method of coding is more user friendly to beginners. Beginners can do more complex things faster. The major criticisms of tidyverse is that the help files, structure, and syntax is too much of a deviation from base R. It is also sometimes not flexible enough for unique high level commands. Base R (or traditional R) is very similar to a variety of the other languages like Python or C. Techniques learned used based R can be much more versatile depending on your needs.
16.1 Piping Operator
The tidyverse syntax structure and form can sometimes be used like traditional base R functions, but were designed to use a “piping” operator. This operator is not in base R, so you will either need to define it yourself, or load it as a package. The piping operator feeds what ever is on the left of the operator as the first argument for the function on the right side of the operator. For example, here we feed the vector vec
into the first argument of the base R function mean()
## [1] 5.5
This operator was designed to be used when we have a sequence of multiple operations. With this operator we “pipe” the output of one function into the next using ‘%>%’. The idea is to focus on actions and not objects.
16.2 Tibbles vs Data Frames
We also have a new type of object with tidyverse called a tibble. A tibble is a new type of 2D object, and is very similar to a data frame. We have actually already used tibbles and tidyverse a little bit when we were loading data. In section 6.1 we discussed how to load a csv file using the read_csv()
which required the readr
package, a package in the tidyverse suite. When we load a csv file using read_csv()
we are actually loading in a tibble object, not a data frame.
Tibbles and data frames are very similar. There is one main difference. Consider the diamonds
data set below. This is a data set that is part of the tidyverse
packages. When the packages are loaded we can call this data set at any time, just as we do for a built in base R data set. The diamonds
data set is a tibble, and not a data frame. When we print it, or type its name to display it, only the first 10 rows will be displayed and all columns that fit on the screen or output space. The other thing that we notice is that the column type is displayed, <type>
. Below each row we can see if the column has doubles <dbl>
, ordered factors <ord>
, integers <int>
, characters <chr>
, logical values <logi>
, etc.
## [1] "tbl_df" "tbl" "data.frame"
## [1] 53940 10
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
In contrast if we type the name of the data set in a data frame format then all the rows, and all the columns will be displayed. If the data set does not fit in the print space the format will just be manipulated. We also do not have column types displayed below the column names for a data frame.
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
16.3 Key Functions
There are a few key functions and operations that we will focus on in the tidyverse suite. Tidyverse is a gigantic collections of functions and objects, but these are a few of the main ones to help you get started.
Note, in general, for tidyverse help files arguments typically start with a “.”, this in contrast to many of the base R help files where arguments are in all caps.
select()
: Select variables in a data frame.filter()
: Subset a data frame, retaining all rows that satisfy your conditions.arrange()
: Orders the rows of a data frame by the values of selected columns.rename()
: Changes the names of individual variables usingnew_name = old_name
syntaxmutate()
: Adds new variables and preserves existing ones.group_by()
: Takes an existing tibble and converts it into a grouped tibble where operations can then be performed “by group”.summarize()/summarise()
: Summarizes results for each group (rows), and summary statistics (columns).
16.3.1 General properties
In general, all the functions above have the following properties:
The first argument is a data frame or a tibble.
The subsequent arguments are used to determine what to do with the data-frame/tibble in the first argument.
The returned value is a data frame or a tibble.
The inputted data-frames/tibbles should be well formatted to start off with. Each row should be an observation, and each column should be a variable.
When we refer to column names for the data frame or tibble in the first argument we do not need to use quotes around the column names.
16.3.2 select()
We use this function to isolate particular columns that we isolate.
## # A tibble: 53,940 x 2
## price cut
## <int> <ord>
## 1 326 Ideal
## 2 326 Premium
## 3 327 Good
## 4 334 Premium
## 5 335 Good
## 6 336 Very Good
## 7 336 Very Good
## 8 337 Very Good
## 9 337 Fair
## 10 338 Very Good
## # … with 53,930 more rows
To store the output we need to use an assignment operator.
## # A tibble: 53,940 x 2
## price cut
## <int> <ord>
## 1 326 Ideal
## 2 326 Premium
## 3 327 Good
## 4 334 Premium
## 5 335 Good
## 6 336 Very Good
## 7 336 Very Good
## 8 337 Very Good
## 9 337 Fair
## 10 338 Very Good
## # … with 53,930 more rows
You can also use the operator “:”, and negative signs with the select()
function. With the “name1:name2” operator we can select all columns between the column named “name1” and “name2”. With negative signs we can omit all variables that are preceeded with a negative sign. These methods are typically not allowed in standard indexing when using names, as covered in 5.1.
## # A tibble: 53,940 x 6
## cut color clarity depth table price
## <ord> <ord> <ord> <dbl> <dbl> <int>
## 1 Ideal E SI2 61.5 55 326
## 2 Premium E SI1 59.8 61 326
## 3 Good E VS1 56.9 65 327
## 4 Premium I VS2 62.4 58 334
## 5 Good J SI2 63.3 58 335
## 6 Very Good J VVS2 62.8 57 336
## 7 Very Good I VVS1 62.3 57 336
## 8 Very Good H SI1 61.9 55 337
## 9 Fair E VS2 65.1 61 337
## 10 Very Good H VS1 59.4 61 338
## # … with 53,930 more rows
## # A tibble: 53,940 x 8
## carat color clarity depth table x y z
## <dbl> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 E SI2 61.5 55 3.95 3.98 2.43
## 2 0.21 E SI1 59.8 61 3.89 3.84 2.31
## 3 0.23 E VS1 56.9 65 4.05 4.07 2.31
## 4 0.290 I VS2 62.4 58 4.2 4.23 2.63
## 5 0.31 J SI2 63.3 58 4.34 4.35 2.75
## 6 0.24 J VVS2 62.8 57 3.94 3.96 2.48
## 7 0.24 I VVS1 62.3 57 3.95 3.98 2.47
## 8 0.26 H SI1 61.9 55 4.07 4.11 2.53
## 9 0.22 E VS2 65.1 61 3.87 3.78 2.49
## 10 0.23 H VS1 59.4 61 4 4.05 2.39
## # … with 53,930 more rows
16.3.3 filter()
The function filter()
is like select()
but we focus on the rows we wish to keep instead of the columns. The arguments inside the filter()
function correspond to conditions we wish to keep. Again, when referring to columns inside of the tidyverse functions we do not need to put the column names in quotes.
## [1] 61.7494
## # A tibble: 28,909 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 2 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 3 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 4 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 5 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 6 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 7 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
## 8 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
## 9 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 10 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68
## # … with 28,899 more rows
We can filter on multiple conditions.
diamondsFiltered <- diamonds %>%
filter(depth > mean(depth), cut == "Good", price > 350)
diamondsFiltered
## # A tibble: 3,548 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
## 2 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
## 3 0.3 Good I SI2 63.3 56 351 4.26 4.3 2.71
## 4 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
## 5 0.31 Good H SI1 64 54 402 4.29 4.31 2.75
## 6 0.26 Good D VS2 65.2 56 403 3.99 4.02 2.61
## 7 0.32 Good H SI2 63.1 56 403 4.34 4.37 2.75
## 8 0.32 Good H SI2 63.8 56 403 4.36 4.38 2.79
## 9 0.3 Good I SI1 63.2 55 405 4.25 4.29 2.7
## 10 0.3 Good H SI1 63.7 57 554 4.28 4.26 2.72
## # … with 3,538 more rows
16.3.5 rename()
The rename()
is used to replace the colnames()
function. Every argument in the rename()
function should have the structure NewName = OldName
. That is, we should have the new column name on the left and original column name on the right. For example lets say we want rename the column cut
to Cut
.
## # A tibble: 53,940 x 10
## carat Cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
We can do this as many columns as we would like. Now lets try renaming the cut
and the color
columns.
## # A tibble: 53,940 x 10
## carat Cut Color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
Remember, to save your results, you still need to use the assignment operator and rename the object.
16.3.6 mutate()
We use the mutate()
function to add or change a variable. Like the preceeds functions before it, you still do not need quotes around the column names to refer to them. Suppose to change the price
column to be in hundreds of dollars (instead of dollars).
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 3.26 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 3.26 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 3.27 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 3.34 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 3.35 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 3.36 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 3.36 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 3.37 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 3.37 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 3.38 4 4.05 2.39
## # … with 53,930 more rows
We can adjust multiple columns at once, and even add columns.
## # A tibble: 53,940 x 11
## carat cut color clarity depth table price x y z depthNEW
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 3.26 3.95 3.98 2.43 618.
## 2 0.21 Premium E SI1 59.8 61 3.26 3.89 3.84 2.31 601.
## 3 0.23 Good E VS1 56.9 65 3.27 4.05 4.07 2.31 572.
## 4 0.290 Premium I VS2 62.4 58 3.34 4.2 4.23 2.63 627.
## 5 0.31 Good J SI2 63.3 58 3.35 4.34 4.35 2.75 636.
## 6 0.24 Very Good J VVS2 62.8 57 3.36 3.94 3.96 2.48 631.
## 7 0.24 Very Good I VVS1 62.3 57 3.36 3.95 3.98 2.47 626.
## 8 0.26 Very Good H SI1 61.9 55 3.37 4.07 4.11 2.53 622.
## 9 0.22 Fair E VS2 65.1 61 3.37 3.87 3.78 2.49 654.
## 10 0.23 Very Good H VS1 59.4 61 3.38 4 4.05 2.39 597.
## # … with 53,930 more rows
In addition, there is also the transmute()
function which does the same thing as mutate()
but drops all other variables.
## # A tibble: 53,940 x 2
## price depthNEW
## <dbl> <dbl>
## 1 3.26 618.
## 2 3.26 601.
## 3 3.27 572.
## 4 3.34 627.
## 5 3.35 636.
## 6 3.36 631.
## 7 3.36 626.
## 8 3.37 622.
## 9 3.37 654.
## 10 3.38 597.
## # … with 53,930 more rows
16.3.7 group_by()
The group_by()
function is typically used with the sumarize()/summarise()
function. We use group_by()
to group sets of observations all together. The arguments dictate the groups to create by specify columns, which are typically factor or character columns.
16.3.8 summarize()/summarise()
The functions summarize()
and summarise()
are the same. The arguments inside this function specify the summary statistics to create useing NewColumnName = <statistic>
. We use this function with group_by()
, so that way we can create summary statistics for each group. When we go from one function to another we still use the piping operator, %>%
.
Here is an example where we group by cut, and then calculate the mean price for each cut. This results in a new data frame with a new column called PriceMean
.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## cut PriceMean
## <ord> <dbl>
## 1 Fair 4359.
## 2 Good 3929.
## 3 Very Good 3982.
## 4 Premium 4584.
## 5 Ideal 3458.
We can also do this by multiple groups and summary statistics.
diamonds %>%
group_by(cut, color) %>%
summarise(PriceMean = mean(price), PriceMedian = median(price))
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)
## # A tibble: 35 x 4
## # Groups: cut [5]
## cut color PriceMean PriceMedian
## <ord> <ord> <dbl> <dbl>
## 1 Fair D 4291. 3730
## 2 Fair E 3682. 2956
## 3 Fair F 3827. 3035
## 4 Fair G 4239. 3057
## 5 Fair H 5136. 3816
## 6 Fair I 4685. 3246
## 7 Fair J 4976. 3302
## 8 Good D 3405. 2728.
## 9 Good E 3424. 2420
## 10 Good F 3496. 2647
## # … with 25 more rows
We can do several summary statistics at once.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## cut PriceMean PriceMedian
## <ord> <dbl> <dbl>
## 1 Fair 4359. 3282
## 2 Good 3929. 3050.
## 3 Very Good 3982. 2648
## 4 Premium 4584. 3185
## 5 Ideal 3458. 1810
16.4 Examples
16.4.1 Example 1
Get a new column which is the product of depth
and carat
, call it DxC
. Calculate the (arithmetic) mean of this new variable, and the (arithmetic) mean of price
by each cut
.
diamonds %>%
mutate(DxC = depth * carat) %>%
group_by(cut) %>%
summarise(AvgDxC = mean(DxC), AvgCut = mean(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
## cut AvgDxC AvgCut
## <ord> <dbl> <dbl>
## 1 Fair 67.2 4359.
## 2 Good 52.9 3929.
## 3 Very Good 49.9 3982.
## 4 Premium 54.6 4584.
## 5 Ideal 43.4 3458.
16.4.2 Example 2
Isolate the observations that have cut
as “Ideal”. Only keep the cut, carat, depth, and price columns.
## # A tibble: 21,551 x 4
## cut carat depth price
## <ord> <dbl> <dbl> <int>
## 1 Ideal 0.23 61.5 326
## 2 Ideal 0.23 62.8 340
## 3 Ideal 0.31 62.2 344
## 4 Ideal 0.3 62 348
## 5 Ideal 0.33 61.8 403
## 6 Ideal 0.33 61.2 403
## 7 Ideal 0.33 61.1 403
## 8 Ideal 0.23 61.9 404
## 9 Ideal 0.32 60.9 404
## 10 Ideal 0.3 61 405
## # … with 21,541 more rows
16.4.3 Example 3
Consider only the observations where price
is larger than the median price
. Determine the (arithmetic) mean and min value for the depth
variable by color
. Sort the results in order from smallest to largest value for (arithmetic) mean depth
for each group. Display only the first 15 rows of the resulting matrix.
diamonds %>%
filter(price > median(price)) %>%
group_by(color) %>%
summarize(mean_depth = mean(depth), min_depth = min(depth)) %>%
arrange(mean_depth) %>%
head(n = 10)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 3
## color mean_depth min_depth
## <ord> <dbl> <dbl>
## 1 D 61.7 55.5
## 2 E 61.7 53.1
## 3 F 61.8 55.4
## 4 G 61.8 43
## 5 I 61.8 50.8
## 6 H 61.8 54.7
## 7 J 61.9 43