Chapter 16 Text Data
In this section we give an introduction to strings and string operations, how to extracting and manipulating string objects, and an introduction to general search methods.
We have focus on character objects in particular because a lot of the “messy” data comes in character form. For example, web pages can be scraped, email can be analyzed for network properties and survey responses must be processed and compared. Even if you only care about numbers, it helps to be able to extract them from text and manipulate them easily.
In general we will try to stick to the following distinction. However, many people will use the term “character” and “string” interchangeably.
- Character: a symbol in a written language, specifically what you can enter at a keyboard: letters, numerals, punctuation, space, newlines, etc.
'L', 'i', 'n', 'c', 'o', 'l'
- String: a sequence of characters bound together
Lincoln
Note: R does not have a separate type for characters and strings
## [1] "character"
## [1] "character"
16.1 Making Strings
Use single or double quotes to construct a string, but in general its recommeded to use double quotes. This is because the R console showcases character strings in double quotes regardless of how the string was created, and sometimes we might have single or double quotes in the string itself.
## [1] "Lincoln"
## [1] "Lincoln"
## [1] "Abraham Lincoln's Hat"
## [1] "As Lincoln never said, 'Four score and seven beers ago'"
## [1] "As Lincoln never said, \"Four score and seven beers ago\""
The space, " "
is a character; so are multiple spaces " "
and the empty string, ""
.
Some characters are special, so we have “escape characters” to specify them in strings.
- quotes within strings: \"
- tab: \t
- new line \n
and carriage return \r
– use the former rather than the latter when possible.
Recall that strings (or character objects) are one of the atomic data types, like numeric
or logical
. Thus strings can go into scalars, vectors, arrays, lists, or be the type of a column in a data frame. We can use the nchar()
to get the length of a single string.
## [1] 1
## [1] 3
## [1] 7
## [1] 23
## [1] 7 9 5
We can use print()
to display the string, and cat()
is used to write the string directly to the console. If you’re debugging, message()
is R’s preferred syntax.
## [1] "Abraham Lincoln"
## Abraham Lincoln
## Fillmore Pierce Buchanan Davis Johnson
## FillmorePierceBuchananDavisJohnson
16.2 Substring Operations
Substring: a smaller string from the big string, but still a string in its own right.
A string is not a vector or a list, so we cannot use subscripts like [[ ]]
or [ ]
to extract substrings; we use substr()
instead.
## [1] "as Bo"
We can also use substr
to replace elements:
## [1] "Christmas Bogus"
The function substr()
can also be used for vectors.
substr()
vectorizes over all its arguments:
## [1] "Fillmore" "Pierce" "Buchanan" "Davis" "Johnson"
## [1] "Fi" "Pi" "Bu" "Da" "Jo"
## [1] "re" "ce" "an" "is" "on"
## [1] "" "" "" "" ""
## [1] "r" "" "a" "" "n"
16.3 Dividing Strings into Vectors
strsplit()
divides a string according to key characters, by splitting each element of the character vector x
at appearances of the pattern split
.
## [[1]]
## [1] "parsley" " sage" " rosemary" " thyme"
## [[1]]
## [1] "parsley" "sage" "rosemary" "thyme"
Pattern is recycled over elements of the input vector:
## [[1]]
## [1] "parsley" "sage" "rosemary" "thyme"
##
## [[2]]
## [1] "Garfunkel" "Oates"
##
## [[3]]
## [1] "Clement" "McKenzie"
Note that it outputs a list
of character vectors.
16.4 Converting Objects into Strings
Explicitly converting one variable type to another is called casting. Notice that the number “7.2e12” is printed as supplied, but “7.2e5” is not. This is because if a number is exceeding large, small, or close to zero, then R will by default use scientific notation for that number.
## [1] "7.2"
## [1] "7.2e+12"
## [1] "7.2" "7.2e+12"
## [1] "720000"
16.5 Versatility of the paste() Function
The paste()
function is very flexible. With one vector argument, works like as.character()
.
## [1] "41" "42" "43" "44" "45"
With 2 or more vector arguments, it combines them with recycling.
## [1] "Fillmore 41" "Pierce 42" "Buchanan 43" "Davis 44" "Johnson 45"
## [1] "Fillmore R" "Pierce D" "Buchanan R" "Davis D" "Johnson R"
## [1] "Fillmore ( R 41 )" "Pierce ( D 42 )" "Buchanan ( R 43 )"
## [4] "Davis ( D 44 )" "Johnson ( R 45 )"
We can changing the separator between pasted-together terms.
## [1] "Fillmore_ (_41_)" "Pierce_ (_42_)" "Buchanan_ (_43_)" "Davis_ (_44_)"
## [5] "Johnson_ (_45_)"
## [1] "Fillmore (41)" "Pierce (42)" "Buchanan (43)" "Davis (44)"
## [5] "Johnson (45)"
We can also condense multiple strings together using the collapse
argument.
## [1] "Fillmore (41); Pierce (42); Buchanan (43); Davis (44); Johnson (45)"
Default value of collapse
is NULL
– that is, it won’t use it.
16.7 Text of Some Importance
Consider the following quote from Abraham Lincoln. Often times we will want to study or analyze a block of text. To
“If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said”the judgments of the Lord are true and righteous altogether."
We can read in the file with the following commands.
the_url <- "https://raw.githubusercontent.com/rpkgarcia/LearnRBook/main/data_sets/al1.txt"
al1 <- readLines(the_url, warn = FALSE)
# How many lines in the file
length(al1)
## [1] 1
## [1] "If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said “the judgments of the Lord are true and righteous altogether”."
Lets create a new vector where each element is a portion of text seperated by a comman “,”.
## [1] "If we shall suppose that American slavery is one of those offenses which"
## [2] " in the providence of God"
## [3] " must needs come"
## [4] " but which"
## [5] " having continued through His appointed time"
## [6] " He now wills to remove"
## [7] " and that He gives to both North and South this terrible war as the woe due to those by whom the offense came"
## [8] " shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope"
## [9] " fervently do we pray"
## [10] " that this mighty scourge of war may speedily pass away. Yet"
## [11] " if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk"
## [12] " and until every drop of blood drawn with the lash shall be paid by another drawn with the sword"
## [13] " as was said three thousand years ago"
## [14] " so still it must be said “the judgments of the Lord are true and righteous altogether”."
16.8 Search
We can search through text strings for certain patterns. Some particularly helpful functions for doing this are grep()
and grepl()
. The grep()
function
Narrowing down entries: use grep()
to find which strings have a matching search term
## [1] 2 8 11
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
## [13] FALSE FALSE
## [1] " in the providence of God"
## [2] " shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope"
## [3] " if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk"
16.9 Word Count Tables
Now lets break up the data set by spaces. We do this in hopes that it will separate each word as an element.
## [1] "If" "we" "shall" "suppose" "that" "American"
We can now tabulate how often each word appears using the table()
function. Then we can sort the frequencies in order using sort()
.
## al1.words
## the of and shall that to
## 9 6 5 4 4 4
## we be by those as do
## 4 3 3 3 2 2
## drawn God He in it must
## 2 2 2 2 2 2
## said this until war which, wills
## 2 2 2 2 2 2
## with years “the a ago, all
## 2 2 1 1 1 1
## altogether”. always American another any appointed
## 1 1 1 1 1 1
## are ascribe attributes away. believers blood
## 1 1 1 1 1 1
## bondsman’s both but came, come, continue
## 1 1 1 1 1 1
## continued departure discern divine drop due
## 1 1 1 1 1 1
## every fervently fifty Fondly from gives
## 1 1 1 1 1 1
## God, having Him? His hope, hundred
## 1 1 1 1 1 1
## if If is judgments lash living
## 1 1 1 1 1 1
## Lord may mighty needs North now
## 1 1 1 1 1 1
## offense offenses one paid pass piled
## 1 1 1 1 1 1
## pray, providence remove, righteous scourge slavery
## 1 1 1 1 1 1
## so South speedily still sunk, suppose
## 1 1 1 1 1 1
## sword, terrible therein thousand three through
## 1 1 1 1 1 1
## time, toil true two unrequited was
## 1 1 1 1 1 1
## wealth which whom woe Yet,
## 1 1 1 1 1
Notice that punctuation using these methods is still present.
## He
## 2
## <NA>
## NA
In addition, all our words and string subsets are case sensitive.
# What happens when we look for a word that is not in our
# word count table?
which(names(wc) == "That")
## integer(0)
## <NA>
## NA