Chapter 6 Working with Data Sets
In this section we discuss different methods for loading data sets into our R session. There are many different files we can create and import. We will focus our attention on loading csv files because they tend to be easier to import, and they are one of the more typical file types that are used. In the second half of the this chapter we introduce some more basic data manipulation strategies and helpful functions when working with data sets beyond indexing.
6.1 Getting Data Sets in Our Working Environment
Built-In Data
As discussed last week, there are built in objects which are not loaded into the global environment, but can be called upon at any time. For example pi
returns the value 3.1415927. Similarly, there are built in data sets that are ready to be used and loaded at a moments notice.
- To see a list of built in data sets type in the console:
- These data sets can be used even if they are not listed in the global environment. For example, if you would like to load the data set in the global environment, run the following command:
Importing From Your Computer
Although built-in data sets are convient, most the time we need to load our own datasets. We load our own data sets by using a function specifically designed for the file type of interest. This function usually uses the file path location as an argument. This can be done in many different ways; however, we will only go over two.
Option 1
Download the file
InsectData.csv
from ELearn. Save this file in a spot in your computer you will remember.In the Environment window (upper left window), click on the Import Dataset button. A drop down menu will appear. Select the From Text (base)… option. Find the file
InsectData.csv
and select it.A pop up menu will appear giving you options for loading in the file, and showing a preview of what the file will look like once loaded. Select the appropriate options and click Import.
A new line of code has generated in the console which will read the data into your current environment. Copy and paste this into your R script document if you would like to save this line of code for later. You will have to reload this file into your environment each time you start a new R session and would like to use this file.
Option 2
Download the file
InsectData.csv
from ELearn. Save this file in a spot in your computer you will remember.In the lower right hand window select the File tab. Now search for the file which you have saved
InsectData.csv
.Click on the file
InsectData.csv
in order to see a dropdown menu. Select Import Dataset…A window will appear which will give you options and a preview of your file. Select appropriate options if needed then click Import.
A new line of code has generated in the console which will read the data into your current environment. Copy and paste this into your R script document if you would like to save this line of code for later. You will have to reload this file into your environment each time you start a new R session and would like to use this file.
Import From Online
We can also download data sets from online in a variety of different ways. Below is one option. With this method we are using the same InsectData.csv
file, but it has been posted online. We feed the url of where the data set has been posted into the read.csv()
function in order to open the file.
6.2 Basic Data Manipulation
Lets recall a few useful things about data frames. As we learned already, data sets are contained in an object called a data frame. One can view this as a specialized table or matrix of rows and columns, where each column is a data variable, such as height or age, and each row is a single observation. All of the values within a column must be the same data type (numeric,factor, logical, etc.). Data frames can be created or called within R, imported from text or spreadsheet files, or imported from the web.
group <- c("G1", "G2", "G1", "G1", "G2")
age <- c(35, 30, 31, 28, 40)
height <- c(65, 70, 60, 72, 68)
pets <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
mydata <- data.frame(group, age, height, pets)
mydata
## group age height pets
## 1 G1 35 65 TRUE
## 2 G2 30 70 TRUE
## 3 G1 31 60 FALSE
## 4 G1 28 72 FALSE
## 5 G2 40 68 TRUE
The summary()
function is a powerful command that gives you some summary statistics about the variables in the data frame.
## group age height pets
## Length:5 Min. :28.0 Min. :60 Mode :logical
## Class :character 1st Qu.:30.0 1st Qu.:65 FALSE:2
## Mode :character Median :31.0 Median :68 TRUE :3
## Mean :32.8 Mean :67
## 3rd Qu.:35.0 3rd Qu.:70
## Max. :40.0 Max. :72
The summary statistics are listed below the names of the variables. Since pets is a logical variable, R gives you the frequencies of each unique value. In this example there are three values of TRUE
and two values of FALSE
. Since age and weight are numeric, R computes and returns the minimum, 1st quartile (25th percentile), median, mean, 3rd quartile (75th percentile), and maximum values. If you have many data values, this is a quick way to get a feel for how the data are distributed.
Just like we did for vectors, we can also use the table()
to cross-tabulate categorical data. Let’s create a frequency table for the different groups.
##
## G1 G2
## 3 2
We can also create a frequency table of pet status for both groups.
##
## FALSE TRUE
## G1 2 1
## G2 0 2
Subset
We already discussed how powerful indexing techniques can be, and various different ways to use indexing to subset a data set. We also have the subset()
function which accomplishes much of the same tasks, and can be used as an alternative to many indexing operations. For example, we can subset a data frame by isolating all rows that belong to group “G1”.
## group age height pets
## 1 G1 35 65 TRUE
## 3 G1 31 60 FALSE
## 4 G1 28 72 FALSE
To subset by all values which are NOT equal to a condition we can use the logical operator !=
.
## group age height pets
## 2 G2 30 70 TRUE
## 5 G2 40 68 TRUE
Adding Columns
One can add a new variable (column) to a data frame by defining a new variable and assigning values to it. Below we add a weight
variable to the data frame.
## [1] 169 161 149 165 155
## group age height pets weight
## 1 G1 35 65 TRUE 169
## 2 G2 30 70 TRUE 161
## 3 G1 31 60 FALSE 149
## 4 G1 28 72 FALSE 165
## 5 G2 40 68 TRUE 155
We can also add a new column using the cbind()
function.
## group age height pets weight wghts
## 1 G1 35 65 TRUE 169 169
## 2 G2 30 70 TRUE 161 161
## 3 G1 31 60 FALSE 149 149
## 4 G1 28 72 FALSE 165 165
## 5 G2 40 68 TRUE 155 155
NA values
In addition, if we have a missing value, or a blank value, we can use the object NA
to indicate the lack of a value.
## group age height pets weight wghts fav_color
## 1 G1 35 65 TRUE 169 169 Red
## 2 G2 30 70 TRUE 161 161 <NA>
## 3 G1 31 60 FALSE 149 149 Purple
## 4 G1 28 72 FALSE 165 165 <NA>
## 5 G2 40 68 TRUE 155 155 Red
We can drop check for NA
values using the is.na()
function.
## group age height pets weight wghts fav_color
## [1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## [5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We can remove all rows with NA
values using na.omit()
.
## group age height pets weight wghts fav_color
## 1 G1 35 65 TRUE 169 169 Red
## 3 G1 31 60 FALSE 149 149 Purple
## 5 G2 40 68 TRUE 155 155 Red
NULL
One can drop a variable (column) by setting it equal to the R
value NULL
.
## group age height pets weight fav_color
## 1 G1 35 65 TRUE 169 Red
## 2 G2 30 70 TRUE 161 <NA>
## 3 G1 31 60 FALSE 149 Purple
## 4 G1 28 72 FALSE 165 <NA>
## 5 G2 40 68 TRUE 155 Red
Be careful using these methods. Once a variable or row is dropped, it’s gone.
Adding Rows
Rows can be added to a data frame using the rbind()
(row bind) function. Because our columns have different data types, we will create a list object and then add it as a new row.
## group age height pets weight fav_color
## 1 G1 35 65 TRUE 169 Red
## 2 G2 30 70 TRUE 161 <NA>
## 3 G1 31 60 FALSE 149 Purple
## 4 G1 28 72 FALSE 165 <NA>
## 5 G2 40 68 TRUE 155 Red
## 6 G1 23 62 FALSE 160 Blue
We can also use rbind()
to append one data frame to another. We can do this with the variables group1
and group2
created above still exist in your R environment.
## group age height pets
## 1 G1 35 65 TRUE
## 3 G1 31 60 FALSE
## 4 G1 28 72 FALSE
## group age height pets
## 2 G2 30 70 TRUE
## 5 G2 40 68 TRUE
## group age height pets
## 1 G1 35 65 TRUE
## 3 G1 31 60 FALSE
## 4 G1 28 72 FALSE
## 2 G2 30 70 TRUE
## 5 G2 40 68 TRUE