Chapter 14 Tidyverse

For this document you will need to install and load the family of packages tidyverse functions. To install the package refer to Section 8.

library(tidyverse)

The tidyverse is a collection of packages that share a unique underlying philosophy, frame work, and syntax. There are approximately 20 tidyverse packages, but the core ones are ggplot2, dplyr, tidyr, readr, purr, tibble, stringr, and forcats. You can install these packages individually or all at once using by simply using the command install.packages("tidyverse").

In general, the tidyverse syntax is structured in a way where we think about “actions” instead of “objects”. In other words, we think about coding in terms of verbs instead of nouns.

The overall tidyverse structure and syntax is unique. Some believe that this method of coding is more user friendly to beginners. Beginners can do more complex things faster. The major criticisms of tidyverse is that the help files, structure, and syntax is too much of a deviation from base R. It is also sometimes not flexible enough for unique high level commands. Base R (or traditional R) is very similar to a variety of the other languages like Python or C. Techniques learned used based R can be much more versatile depending on your needs.

14.1 Piping Operator

The tidyverse syntax structure and form can sometimes be used like traditional base R functions, but were designed to use a “piping” operator. This operator is not in base R, so you will either need to define it yourself, or load it as a package. The piping operator feeds what ever is on the left of the operator as the first argument for the function on the right side of the operator. For example, here we feed the vector vec into the first argument of the base R function mean()

vec <- 1:10 
vec %>% mean()
## [1] 5.5

This operator was designed to be used when we have a sequence of multiple operations. With this operator we “pipe” the output of one function into the next using ‘%>%’. The idea is to focus on actions and not objects.

14.2 Tibbles vs Data Frames

We also have a new type of object with tidyverse called a tibble. A tibble is a new type of 2D object, and is very similar to a data frame. We have actually already used tibbles and tidyverse a little bit when we were loading data. In section 6.1 we discussed how to load a csv file using the read_csv() which required the readr package, a package in the tidyverse suite. When we load a csv file using read_csv() we are actually loading in a tibble object, not a data frame.

Tibbles and data frames are very similar. There is one main difference. Consider the diamonds data set below. This is a data set that is part of the tidyverse packages. When the packages are loaded we can call this data set at any time, just as we do for a built in base R data set. The diamonds data set is a tibble, and not a data frame. When we print it, or type its name to display it, only the first 10 rows will be displayed and all columns that fit on the screen or output space. The other thing that we notice is that the column type is displayed, <type>. Below each row we can see if the column has doubles <dbl>, ordered factors <ord>, integers <int>, characters <chr>, logical values <logi>, etc.

# The tidyverse data set diamonds is a tibble
class(diamonds) 
## [1] "tbl_df"     "tbl"        "data.frame"
dim(diamonds)  # 53940 rows, 10 columns 
## [1] 53940    10
# Tibbles only show first ten rows, and however many columns fill up the screen
diamonds
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

In contrast if we type the name of the data set in a data frame format then all the rows, and all the columns will be displayed. If the data set does not fit in the print space the format will just be manipulated. We also do not have column types displayed below the column names for a data frame.

diamonds_df <- data.frame(diamonds)
diamonds_df[1:10,]
##    carat       cut color clarity depth table price    x    y    z
## 1   0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2   0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3   0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 8   0.26 Very Good     H     SI1  61.9    55   337 4.07 4.11 2.53
## 9   0.22      Fair     E     VS2  65.1    61   337 3.87 3.78 2.49
## 10  0.23 Very Good     H     VS1  59.4    61   338 4.00 4.05 2.39

14.3 Key Functions

There are a few key functions and operations that we will focus on in the tidyverse suite. Tidyverse is a gigantic collections of functions and objects, but these are a few of the main ones to help you get started.

Note, in general, for tidyverse help files arguments typically start with a “.”, this in contrast to many of the base R help files where arguments are in all caps.

  • select(): Select variables in a data frame.

  • filter(): Subset a data frame, retaining all rows that satisfy your conditions.

  • arrange(): Orders the rows of a data frame by the values of selected columns.

  • rename(): Changes the names of individual variables using new_name = old_name syntax

  • mutate(): Adds new variables and preserves existing ones.

  • group_by(): Takes an existing tibble and converts it into a grouped tibble where operations can then be performed “by group”.

  • summarize()/summarise(): Summarizes results for each group (rows), and summary statistics (columns).

14.3.1 General properties

In general, all the functions above have the following properties:

  • The first argument is a data frame or a tibble.

  • The subsequent arguments are used to determine what to do with the data-frame/tibble in the first argument.

  • The returned value is a data frame or a tibble.

  • The inputted data-frames/tibbles should be well formatted to start off with. Each row should be an observation, and each column should be a variable.

  • When we refer to column names for the data frame or tibble in the first argument we do not need to use quotes around the column names.

14.3.2 select()

We use this function to isolate particular columns that we isolate.

diamonds %>% select(price, cut)
## # A tibble: 53,940 x 2
##    price cut      
##    <int> <ord>    
##  1   326 Ideal    
##  2   326 Premium  
##  3   327 Good     
##  4   334 Premium  
##  5   335 Good     
##  6   336 Very Good
##  7   336 Very Good
##  8   337 Very Good
##  9   337 Fair     
## 10   338 Very Good
## # … with 53,930 more rows

To store the output we need to use an assignment operator.

PriceCut <- diamonds %>% select(price, cut)
PriceCut
## # A tibble: 53,940 x 2
##    price cut      
##    <int> <ord>    
##  1   326 Ideal    
##  2   326 Premium  
##  3   327 Good     
##  4   334 Premium  
##  5   335 Good     
##  6   336 Very Good
##  7   336 Very Good
##  8   337 Very Good
##  9   337 Fair     
## 10   338 Very Good
## # … with 53,930 more rows

You can also use the operator “:”, and negative signs with the select() function. With the “name1:name2” operator we can select all columns between the column named “name1” and “name2”. With negative signs we can omit all variables that are preceeded with a negative sign. These methods are typically not allowed in standard indexing when using names, as covered in 5.1.

# Select all columns between cut and price. 
PriceCut <- diamonds %>% select(cut:price)
PriceCut
## # A tibble: 53,940 x 6
##    cut       color clarity depth table price
##    <ord>     <ord> <ord>   <dbl> <dbl> <int>
##  1 Ideal     E     SI2      61.5    55   326
##  2 Premium   E     SI1      59.8    61   326
##  3 Good      E     VS1      56.9    65   327
##  4 Premium   I     VS2      62.4    58   334
##  5 Good      J     SI2      63.3    58   335
##  6 Very Good J     VVS2     62.8    57   336
##  7 Very Good I     VVS1     62.3    57   336
##  8 Very Good H     SI1      61.9    55   337
##  9 Fair      E     VS2      65.1    61   337
## 10 Very Good H     VS1      59.4    61   338
## # … with 53,930 more rows
# Select all but price and cut
NotPriceCut <- diamonds %>% select(-price, -cut)
NotPriceCut
## # A tibble: 53,940 x 8
##    carat color clarity depth table     x     y     z
##    <dbl> <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 0.23  E     SI2      61.5    55  3.95  3.98  2.43
##  2 0.21  E     SI1      59.8    61  3.89  3.84  2.31
##  3 0.23  E     VS1      56.9    65  4.05  4.07  2.31
##  4 0.290 I     VS2      62.4    58  4.2   4.23  2.63
##  5 0.31  J     SI2      63.3    58  4.34  4.35  2.75
##  6 0.24  J     VVS2     62.8    57  3.94  3.96  2.48
##  7 0.24  I     VVS1     62.3    57  3.95  3.98  2.47
##  8 0.26  H     SI1      61.9    55  4.07  4.11  2.53
##  9 0.22  E     VS2      65.1    61  3.87  3.78  2.49
## 10 0.23  H     VS1      59.4    61  4     4.05  2.39
## # … with 53,930 more rows

14.3.3 filter()

The function filter() is like select() but we focus on the rows we wish to keep instead of the columns. The arguments inside the filter() function correspond to conditions we wish to keep. Again, when referring to columns inside of the tidyverse functions we do not need to put the column names in quotes.

# What is the mean value for the depth column?
mean(diamonds$depth)
## [1] 61.7494
diamondsFiltered <- diamonds %>% filter(depth> mean(depth))
diamondsFiltered
## # A tibble: 28,909 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  2 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  3 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  4 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  5 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  6 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
##  7 0.3   Good      J     SI1      64      55   339  4.25  4.28  2.73
##  8 0.23  Ideal     J     VS1      62.8    56   340  3.93  3.9   2.46
##  9 0.31  Ideal     J     SI2      62.2    54   344  4.35  4.37  2.71
## 10 0.3   Ideal     I     SI2      62      54   348  4.31  4.34  2.68
## # … with 28,899 more rows

We can filter on multiple conditions.

diamondsFiltered <- diamonds %>% filter(depth> mean(depth),
                                        cut == "Good",
                                        price > 350)
diamondsFiltered
## # A tibble: 3,548 x 10
##    carat cut   color clarity depth table price     x     y     z
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.3  Good  J     SI1      63.4    54   351  4.23  4.29  2.7 
##  2  0.3  Good  J     SI1      63.8    56   351  4.23  4.26  2.71
##  3  0.3  Good  I     SI2      63.3    56   351  4.26  4.3   2.71
##  4  0.23 Good  E     VS1      64.1    59   402  3.83  3.85  2.46
##  5  0.31 Good  H     SI1      64      54   402  4.29  4.31  2.75
##  6  0.26 Good  D     VS2      65.2    56   403  3.99  4.02  2.61
##  7  0.32 Good  H     SI2      63.1    56   403  4.34  4.37  2.75
##  8  0.32 Good  H     SI2      63.8    56   403  4.36  4.38  2.79
##  9  0.3  Good  I     SI1      63.2    55   405  4.25  4.29  2.7 
## 10  0.3  Good  H     SI1      63.7    57   554  4.28  4.26  2.72
## # … with 3,538 more rows

14.3.4 arrange()

The arrange() function is much like sort() or order() in base R.

14.3.5 rename()

The rename() is used to replace the colnames() function. Every argument in the rename() function should have the structure NewName = OldName. That is, we should have the new column name on the left and original column name on the right. For example lets say we want rename the column cut to Cut.

diamonds %>% rename(Cut = cut)
## # A tibble: 53,940 x 10
##    carat Cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

We can do this as many columns as we would like. Now lets try renaming the cut and the color columns.

diamonds %>% rename(Cut = cut, 
                    Color = color)
## # A tibble: 53,940 x 10
##    carat Cut       Color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Remember, to save your results, you still need to use the assignment operator and rename the object.

diamonds_new <- diamonds %>% rename(Cut = cut, 
                                    Color = color)

14.3.6 mutate()

We use the mutate() function to add or change a variable. Like the preceeds functions before it, you still do not need quotes around the column names to refer to them. Suppose to change the price column to be in hundreds of dollars (instead of dollars).

diamonds %>% mutate(price = price/100)
## # A tibble: 53,940 x 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55  3.26  3.95  3.98  2.43
##  2 0.21  Premium   E     SI1      59.8    61  3.26  3.89  3.84  2.31
##  3 0.23  Good      E     VS1      56.9    65  3.27  4.05  4.07  2.31
##  4 0.290 Premium   I     VS2      62.4    58  3.34  4.2   4.23  2.63
##  5 0.31  Good      J     SI2      63.3    58  3.35  4.34  4.35  2.75
##  6 0.24  Very Good J     VVS2     62.8    57  3.36  3.94  3.96  2.48
##  7 0.24  Very Good I     VVS1     62.3    57  3.36  3.95  3.98  2.47
##  8 0.26  Very Good H     SI1      61.9    55  3.37  4.07  4.11  2.53
##  9 0.22  Fair      E     VS2      65.1    61  3.37  3.87  3.78  2.49
## 10 0.23  Very Good H     VS1      59.4    61  3.38  4     4.05  2.39
## # … with 53,930 more rows

We can adjust multiple columns at once, and even add columns.

diamondsNEW <- diamonds %>% mutate(price = price/100,
                                   depthNEW = 10*depth + price)
diamondsNEW
## # A tibble: 53,940 x 11
##    carat cut       color clarity depth table price     x     y     z depthNEW
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55  3.26  3.95  3.98  2.43     618.
##  2 0.21  Premium   E     SI1      59.8    61  3.26  3.89  3.84  2.31     601.
##  3 0.23  Good      E     VS1      56.9    65  3.27  4.05  4.07  2.31     572.
##  4 0.290 Premium   I     VS2      62.4    58  3.34  4.2   4.23  2.63     627.
##  5 0.31  Good      J     SI2      63.3    58  3.35  4.34  4.35  2.75     636.
##  6 0.24  Very Good J     VVS2     62.8    57  3.36  3.94  3.96  2.48     631.
##  7 0.24  Very Good I     VVS1     62.3    57  3.36  3.95  3.98  2.47     626.
##  8 0.26  Very Good H     SI1      61.9    55  3.37  4.07  4.11  2.53     622.
##  9 0.22  Fair      E     VS2      65.1    61  3.37  3.87  3.78  2.49     654.
## 10 0.23  Very Good H     VS1      59.4    61  3.38  4     4.05  2.39     597.
## # … with 53,930 more rows

In addition, there is also the transmute() function which does the same thing as mutate() but drops all other variables.

diamondsNEW <- diamonds %>% transmute(price = price/100,
                                      depthNEW = 10*depth + price)
diamondsNEW
## # A tibble: 53,940 x 2
##    price depthNEW
##    <dbl>    <dbl>
##  1  3.26     618.
##  2  3.26     601.
##  3  3.27     572.
##  4  3.34     627.
##  5  3.35     636.
##  6  3.36     631.
##  7  3.36     626.
##  8  3.37     622.
##  9  3.37     654.
## 10  3.38     597.
## # … with 53,930 more rows

14.3.7 group_by()

The group_by() function is typically used with the sumarize()/summarise() function. We use group_by() to group sets of observations all together. The arguments dictate the groups to create by specify columns, which are typically factor or character columns.

14.3.8 summarize()/summarise()

The functions summarize() and summarise() are the same. The arguments inside this function specify the summary statistics to create useing NewColumnName = <statistic>. We use this function with group_by(), so that way we can create summary statistics for each group. When we go from one function to another we still use the piping operator, %>%.

Here is an example where we group by cut, and then calculate the mean price for each cut. This results in a new data frame with a new column called PriceMean.

diamonds %>%
  group_by(cut) %>%
  summarise(PriceMean = mean(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   cut       PriceMean
##   <ord>         <dbl>
## 1 Fair          4359.
## 2 Good          3929.
## 3 Very Good     3982.
## 4 Premium       4584.
## 5 Ideal         3458.

We can also do this by multiple groups and summary statistics.

diamonds %>%
  group_by(cut, color) %>%
  summarise(PriceMean = mean(price), 
            PriceMedian = median(price))
## `summarise()` regrouping output by 'cut' (override with `.groups` argument)
## # A tibble: 35 x 4
## # Groups:   cut [5]
##    cut   color PriceMean PriceMedian
##    <ord> <ord>     <dbl>       <dbl>
##  1 Fair  D         4291.       3730 
##  2 Fair  E         3682.       2956 
##  3 Fair  F         3827.       3035 
##  4 Fair  G         4239.       3057 
##  5 Fair  H         5136.       3816 
##  6 Fair  I         4685.       3246 
##  7 Fair  J         4976.       3302 
##  8 Good  D         3405.       2728.
##  9 Good  E         3424.       2420 
## 10 Good  F         3496.       2647 
## # … with 25 more rows

We can do several summary statistics at once.

diamonds %>%
  group_by(cut) %>%
  summarise(PriceMean = mean(price), 
            PriceMedian = median(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
##   cut       PriceMean PriceMedian
##   <ord>         <dbl>       <dbl>
## 1 Fair          4359.       3282 
## 2 Good          3929.       3050.
## 3 Very Good     3982.       2648 
## 4 Premium       4584.       3185 
## 5 Ideal         3458.       1810

14.4 Examples

14.4.1 Example 1

Get a new column which is the product of depth and carat, call it DxC. Calculate the (arithmetic) mean of this new variable, and the (arithmetic) mean of price by each cut.

diamonds %>%
  mutate(DxC = depth*carat) %>%
  group_by(cut) %>%
  summarise(AvgDxC = mean(DxC), 
            AvgCut = mean(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 3
##   cut       AvgDxC AvgCut
##   <ord>      <dbl>  <dbl>
## 1 Fair        67.2  4359.
## 2 Good        52.9  3929.
## 3 Very Good   49.9  3982.
## 4 Premium     54.6  4584.
## 5 Ideal       43.4  3458.

14.4.2 Example 2

Isolate the observations that have cut as “Ideal”. Only keep the cut, carat, depth, and price columns.

diamonds %>%
  filter(cut == "Ideal") %>%
  select(cut, carat, depth, price)
## # A tibble: 21,551 x 4
##    cut   carat depth price
##    <ord> <dbl> <dbl> <int>
##  1 Ideal  0.23  61.5   326
##  2 Ideal  0.23  62.8   340
##  3 Ideal  0.31  62.2   344
##  4 Ideal  0.3   62     348
##  5 Ideal  0.33  61.8   403
##  6 Ideal  0.33  61.2   403
##  7 Ideal  0.33  61.1   403
##  8 Ideal  0.23  61.9   404
##  9 Ideal  0.32  60.9   404
## 10 Ideal  0.3   61     405
## # … with 21,541 more rows

14.4.3 Example 3

Consider only the observations where price is larger than the median price. Determine the (arithmetic) mean and min value for the depth variable by color. Sort the results in order from smallest to largest value for (arithmetic) mean depth for each group. Display only the first 15 rows of the resulting matrix.

diamonds %>%
  filter(price >median(price)) %>%
  group_by(color) %>%
  summarize(mean_depth = mean(depth),
            min_depth = min(depth)) %>%
  arrange(mean_depth)%>%
  head(n = 10)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 3
##   color mean_depth min_depth
##   <ord>      <dbl>     <dbl>
## 1 D           61.7      55.5
## 2 E           61.7      53.1
## 3 F           61.8      55.4
## 4 G           61.8      43  
## 5 I           61.8      50.8
## 6 H           61.8      54.7
## 7 J           61.9      43

Additional Resources

To learn more about Tidyverse, check out the official website, a book on helpful information, and the official cheat sheets.