Chapter 6 Working with Data Sets

In this section we discuss different methods for loading data sets into our R session. There are many different files we can create and import. We will focus our attention on loading csv files because they tend to be easier to import, and they are one of the more typical file types that are used. In the second half of the this chapter we introduce some more basic data manipulation strategies and helpful functions when working with data sets beyond indexing.

6.1 Getting Data Sets in Our Working Environment

Built-In Data

As discussed last week, there are built in objects which are not loaded into the global environment, but can be called upon at any time. For example pi returns the value 3.1415927. Similarly, there are built in data sets that are ready to be used and loaded at a moments notice.

  1. To see a list of built in data sets type in the console:
data()
  1. These data sets can be used even if they are not listed in the global environment. For example, if you would like to load the data set in the global environment, run the following command:
data("cars")

Importing From Your Computer

Although built-in data sets are convient, most the time we need to load our own datasets. We load our own data sets by using a function specifically designed for the file type of interest. This function usually uses the file path location as an argument. This can be done in many different ways; however, we will only go over two.

Option 1

  1. Download the file InsectData.csv from ELearn. Save this file in a spot in your computer you will remember.

  2. In the Environment window (upper left window), click on the Import Dataset button. A drop down menu will appear. Select the From Text (base)… option. Find the file InsectData.csv and select it.

  3. A pop up menu will appear giving you options for loading in the file, and showing a preview of what the file will look like once loaded. Select the appropriate options and click Import.

  4. A new line of code has generated in the console which will read the data into your current environment. Copy and paste this into your R script document if you would like to save this line of code for later. You will have to reload this file into your environment each time you start a new R session and would like to use this file.

Option 2

  1. Download the file InsectData.csv from ELearn. Save this file in a spot in your computer you will remember.

  2. In the lower right hand window select the File tab. Now search for the file which you have saved InsectData.csv.

  3. Click on the file InsectData.csv in order to see a dropdown menu. Select Import Dataset…

  4. A window will appear which will give you options and a preview of your file. Select appropriate options if needed then click Import.

  5. A new line of code has generated in the console which will read the data into your current environment. Copy and paste this into your R script document if you would like to save this line of code for later. You will have to reload this file into your environment each time you start a new R session and would like to use this file.

Import From Online

We can also download data sets from online in a variety of different ways. Below is one option. With this method we are using the same InsectData.csv file, but it has been posted online. We feed the url of where the data set has been posted into the read.csv() function in order to open the file.

the_url <- "https://raw.githubusercontent.com/rpkgarcia/LearnRBook/main/data_sets/InsectData.csv"
the_data <- read.csv(the_url)

6.2 Basic Data Manipulation

Lets recall a few useful things about data frames. As we learned already, data sets are contained in an object called a data frame. One can view this as a specialized table or matrix of rows and columns, where each column is a data variable, such as height or age, and each row is a single observation. All of the values within a column must be the same data type (numeric,factor, logical, etc.). Data frames can be created or called within R, imported from text or spreadsheet files, or imported from the web.

group <- c("G1", "G2", "G1", "G1", "G2")
age <- c(35, 30, 31, 28, 40)
height <- c(65, 70, 60, 72, 68)
pets <- c(TRUE, TRUE, FALSE, FALSE, TRUE)

mydata <- data.frame(group, age, height, pets)
mydata
##   group age height  pets
## 1    G1  35     65  TRUE
## 2    G2  30     70  TRUE
## 3    G1  31     60 FALSE
## 4    G1  28     72 FALSE
## 5    G2  40     68  TRUE

The summary() function is a powerful command that gives you some summary statistics about the variables in the data frame.

summary(mydata)
##     group                age           height      pets        
##  Length:5           Min.   :28.0   Min.   :60   Mode :logical  
##  Class :character   1st Qu.:30.0   1st Qu.:65   FALSE:2        
##  Mode  :character   Median :31.0   Median :68   TRUE :3        
##                     Mean   :32.8   Mean   :67                  
##                     3rd Qu.:35.0   3rd Qu.:70                  
##                     Max.   :40.0   Max.   :72

The summary statistics are listed below the names of the variables. Since pets is a logical variable, R gives you the frequencies of each unique value. In this example there are three values of TRUE and two values of FALSE. Since age and weight are numeric, R computes and returns the minimum, 1st quartile (25th percentile), median, mean, 3rd quartile (75th percentile), and maximum values. If you have many data values, this is a quick way to get a feel for how the data are distributed.

Just like we did for vectors, we can also use the table() to cross-tabulate categorical data. Let’s create a frequency table for the different groups.

table(mydata$group)
## 
## G1 G2 
##  3  2

We can also create a frequency table of pet status for both groups.

table(mydata$group, mydata$pets)
##     
##      FALSE TRUE
##   G1     2    1
##   G2     0    2

Subset

We already discussed how powerful indexing techniques can be, and various different ways to use indexing to subset a data set. We also have the subset() function which accomplishes much of the same tasks, and can be used as an alternative to many indexing operations. For example, we can subset a data frame by isolating all rows that belong to group “G1”.

group1 <- subset(mydata, mydata$group == "G1")
group1
##   group age height  pets
## 1    G1  35     65  TRUE
## 3    G1  31     60 FALSE
## 4    G1  28     72 FALSE

To subset by all values which are NOT equal to a condition we can use the logical operator !=.

group2 <- subset(mydata, mydata$group != "G1")
group2
##   group age height pets
## 2    G2  30     70 TRUE
## 5    G2  40     68 TRUE

Adding Columns

One can add a new variable (column) to a data frame by defining a new variable and assigning values to it. Below we add a weight variable to the data frame.

wghts <- c(169, 161, 149, 165, 155)
wghts
## [1] 169 161 149 165 155
mydata$weight <- wghts
mydata
##   group age height  pets weight
## 1    G1  35     65  TRUE    169
## 2    G2  30     70  TRUE    161
## 3    G1  31     60 FALSE    149
## 4    G1  28     72 FALSE    165
## 5    G2  40     68  TRUE    155

We can also add a new column using the cbind() function.

mydata <- cbind(mydata, wghts)
mydata
##   group age height  pets weight wghts
## 1    G1  35     65  TRUE    169   169
## 2    G2  30     70  TRUE    161   161
## 3    G1  31     60 FALSE    149   149
## 4    G1  28     72 FALSE    165   165
## 5    G2  40     68  TRUE    155   155

NA values

In addition, if we have a missing value, or a blank value, we can use the object NA to indicate the lack of a value.

fav_color <- c("Red", NA, "Purple", NA, "Red")

mydata <- cbind(mydata, fav_color)
mydata
##   group age height  pets weight wghts fav_color
## 1    G1  35     65  TRUE    169   169       Red
## 2    G2  30     70  TRUE    161   161      <NA>
## 3    G1  31     60 FALSE    149   149    Purple
## 4    G1  28     72 FALSE    165   165      <NA>
## 5    G2  40     68  TRUE    155   155       Red

We can drop check for NA values using the is.na() function.

is.na(mydata)
##      group   age height  pets weight wghts fav_color
## [1,] FALSE FALSE  FALSE FALSE  FALSE FALSE     FALSE
## [2,] FALSE FALSE  FALSE FALSE  FALSE FALSE      TRUE
## [3,] FALSE FALSE  FALSE FALSE  FALSE FALSE     FALSE
## [4,] FALSE FALSE  FALSE FALSE  FALSE FALSE      TRUE
## [5,] FALSE FALSE  FALSE FALSE  FALSE FALSE     FALSE

We can remove all rows with NA values using na.omit().

na.omit(mydata)
##   group age height  pets weight wghts fav_color
## 1    G1  35     65  TRUE    169   169       Red
## 3    G1  31     60 FALSE    149   149    Purple
## 5    G2  40     68  TRUE    155   155       Red

NULL

One can drop a variable (column) by setting it equal to the R value NULL.

mydata$wghts <- NULL
mydata
##   group age height  pets weight fav_color
## 1    G1  35     65  TRUE    169       Red
## 2    G2  30     70  TRUE    161      <NA>
## 3    G1  31     60 FALSE    149    Purple
## 4    G1  28     72 FALSE    165      <NA>
## 5    G2  40     68  TRUE    155       Red

Be careful using these methods. Once a variable or row is dropped, it’s gone.

Adding Rows

Rows can be added to a data frame using the rbind() (row bind) function. Because our columns have different data types, we will create a list object and then add it as a new row.

newobs <- list("G1", 23, 62, FALSE, 160, "Blue")
newdata <- rbind(mydata, newobs)
newdata
##   group age height  pets weight fav_color
## 1    G1  35     65  TRUE    169       Red
## 2    G2  30     70  TRUE    161      <NA>
## 3    G1  31     60 FALSE    149    Purple
## 4    G1  28     72 FALSE    165      <NA>
## 5    G2  40     68  TRUE    155       Red
## 6    G1  23     62 FALSE    160      Blue

We can also use rbind() to append one data frame to another. We can do this with the variables group1 and group2 created above still exist in your R environment.

group1
##   group age height  pets
## 1    G1  35     65  TRUE
## 3    G1  31     60 FALSE
## 4    G1  28     72 FALSE
group2
##   group age height pets
## 2    G2  30     70 TRUE
## 5    G2  40     68 TRUE
rbind(group1, group2)
##   group age height  pets
## 1    G1  35     65  TRUE
## 3    G1  31     60 FALSE
## 4    G1  28     72 FALSE
## 2    G2  30     70  TRUE
## 5    G2  40     68  TRUE

Summary

  • How to get data
    • Use default data sets data()
    • Load csv files with read.csv() and read_csv()
    • Download data from online
  • Data manipulation functions to keep in mind
    • table(), summary()
    • subset()
    • rbind(), cbind()
    • na.omit(), is.na()