Chapter 16 Text Data

In this section we give an introduction to strings and string operations, how to extracting and manipulating string objects, and an introduction to general search methods.

We have focus on character objects in particular because a lot of the “messy” data comes in character form. For example, web pages can be scraped, email can be analyzed for network properties and survey responses must be processed and compared. Even if you only care about numbers, it helps to be able to extract them from text and manipulate them easily.

In general we will try to stick to the following distinction. However, many people will use the term “character” and “string” interchangeably.

  • Character: a symbol in a written language, specifically what you can enter at a keyboard: letters, numerals, punctuation, space, newlines, etc.
'L', 'i', 'n', 'c', 'o', 'l'
  • String: a sequence of characters bound together
Lincoln

Note: R does not have a separate type for characters and strings

class("L")
## [1] "character"
class("Lincoln")
## [1] "character"

16.1 Making Strings

Use single or double quotes to construct a string, but in general its recommeded to use double quotes. This is because the R console showcases character strings in double quotes regardless of how the string was created, and sometimes we might have single or double quotes in the string itself.

'Lincoln'
## [1] "Lincoln"
"Lincoln"
## [1] "Lincoln"
"Abraham Lincoln's Hat"
## [1] "Abraham Lincoln's Hat"
"As Lincoln never said, 'Four score and seven beers ago'"
## [1] "As Lincoln never said, 'Four score and seven beers ago'"
'As Lincoln never said, "Four score and seven beers ago"'
## [1] "As Lincoln never said, \"Four score and seven beers ago\""

The space, " " is a character; so are multiple spaces " " and the empty string, "".

Some characters are special, so we have “escape characters” to specify them in strings. - quotes within strings: \" - tab: \t - new line \n and carriage return \r – use the former rather than the latter when possible.

Recall that strings (or character objects) are one of the atomic data types, like numeric or logical. Thus strings can go into scalars, vectors, arrays, lists, or be the type of a column in a data frame. We can use the nchar() to get the length of a single string.

length("Abraham Lincoln's beard")
## [1] 1
length(c("Abraham", "Lincoln's", "beard"))
## [1] 3
nchar("Abraham")
## [1] 7
nchar("Abraham Lincoln's beard")
## [1] 23
nchar(c("Abraham", "Lincoln's", "beard"))
## [1] 7 9 5

We can use print() to display the string, and cat() is used to write the string directly to the console. If you’re debugging, message() is R’s preferred syntax.

presidents = c("Fillmore","Pierce","Buchanan","Davis","Johnson")

print("Abraham Lincoln")
## [1] "Abraham Lincoln"
cat("Abraham Lincoln")
## Abraham Lincoln
cat(presidents)
## Fillmore Pierce Buchanan Davis Johnson
message(presidents)
## FillmorePierceBuchananDavisJohnson

16.2 Substring Operations

Substring: a smaller string from the big string, but still a string in its own right.

A string is not a vector or a list, so we cannot use subscripts like [[ ]] or [ ] to extract substrings; we use substr() instead.

phrase <- "Christmas Bonus"
substr(phrase, start=8, stop=12)
## [1] "as Bo"

We can also use substr to replace elements:

substr(phrase, 13, 13) = "g"
phrase
## [1] "Christmas Bogus"

The function substr() can also be used for vectors.

substr() vectorizes over all its arguments:

presidents
## [1] "Fillmore" "Pierce"   "Buchanan" "Davis"    "Johnson"
substr(presidents,1,2)   # First two characters
## [1] "Fi" "Pi" "Bu" "Da" "Jo"
substr(presidents,nchar(presidents)-1,nchar(presidents))   # Last two
## [1] "re" "ce" "an" "is" "on"
substr(presidents,20,21)    # No such substrings so return the null string
## [1] "" "" "" "" ""
substr(presidents,7,7)      # Explain!
## [1] "r" ""  "a" ""  "n"

16.3 Dividing Strings into Vectors

strsplit() divides a string according to key characters, by splitting each element of the character vector x at appearances of the pattern split.

scarborough.fair = "parsley, sage, rosemary, thyme"
strsplit(scarborough.fair, ",")
## [[1]]
## [1] "parsley"   " sage"     " rosemary" " thyme"
strsplit(scarborough.fair, ", ")
## [[1]]
## [1] "parsley"  "sage"     "rosemary" "thyme"

Pattern is recycled over elements of the input vector:

strsplit (c(scarborough.fair, "Garfunkel, Oates", "Clement, McKenzie"), ", ")
## [[1]]
## [1] "parsley"  "sage"     "rosemary" "thyme"   
## 
## [[2]]
## [1] "Garfunkel" "Oates"    
## 
## [[3]]
## [1] "Clement"  "McKenzie"

Note that it outputs a list of character vectors.

16.4 Converting Objects into Strings

Explicitly converting one variable type to another is called casting. Notice that the number “7.2e12” is printed as supplied, but “7.2e5” is not. This is because if a number is exceeding large, small, or close to zero, then R will by default use scientific notation for that number.

as.character(7.2)            # Obvious
## [1] "7.2"
as.character(7.2e12)         # Obvious
## [1] "7.2e+12"
as.character(c(7.2,7.2e12))  # Obvious
## [1] "7.2"     "7.2e+12"
as.character(7.2e5)          # Not quite so obvious
## [1] "720000"

16.5 Versatility of the paste() Function

The paste() function is very flexible. With one vector argument, works like as.character().

paste(41:45)
## [1] "41" "42" "43" "44" "45"

With 2 or more vector arguments, it combines them with recycling.

paste(presidents,41:45)
## [1] "Fillmore 41" "Pierce 42"   "Buchanan 43" "Davis 44"    "Johnson 45"
paste(presidents,c("R","D"))  # Not historically accurate!
## [1] "Fillmore R" "Pierce D"   "Buchanan R" "Davis D"    "Johnson R"
paste(presidents,"(",c("R","D"),41:45,")")
## [1] "Fillmore ( R 41 )" "Pierce ( D 42 )"   "Buchanan ( R 43 )"
## [4] "Davis ( D 44 )"    "Johnson ( R 45 )"

We can changing the separator between pasted-together terms.

paste(presidents, " (", 41:45, ")", sep="_")
## [1] "Fillmore_ (_41_)" "Pierce_ (_42_)"   "Buchanan_ (_43_)" "Davis_ (_44_)"   
## [5] "Johnson_ (_45_)"
paste(presidents, " (", 41:45, ")", sep="")
## [1] "Fillmore (41)" "Pierce (42)"   "Buchanan (43)" "Davis (44)"   
## [5] "Johnson (45)"

We can also condense multiple strings together using the collapse argument.

paste(presidents, " (", 41:45, ")", sep="", collapse="; ")
## [1] "Fillmore (41); Pierce (42); Buchanan (43); Davis (44); Johnson (45)"

Default value of collapse is NULL – that is, it won’t use it.

16.6 gsub , sub

gsub is all occurances, and sub is first occurance.

16.7 Text of Some Importance

Consider the following quote from Abraham Lincoln. Often times we will want to study or analyze a block of text. To

“If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said”the judgments of the Lord are true and righteous altogether."

We can read in the file with the following commands.

the_url <- "https://raw.githubusercontent.com/rpkgarcia/LearnRBook/main/data_sets/al1.txt"
al1 <- readLines(the_url, warn = FALSE)

# How many lines in the file 
length(al1)
## [1] 1
# See the object
al1
## [1] "If we shall suppose that American slavery is one of those offenses which, in the providence of God, must needs come, but which, having continued through His appointed time, He now wills to remove, and that He gives to both North and South this terrible war as the woe due to those by whom the offense came, shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope, fervently do we pray, that this mighty scourge of war may speedily pass away. Yet, if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk, and until every drop of blood drawn with the lash shall be paid by another drawn with the sword, as was said three thousand years ago, so still it must be said “the judgments of the Lord are true and righteous altogether”."

Lets create a new vector where each element is a portion of text seperated by a comman “,”.

al1.phrases <- strsplit(al1, ",")[[1]]
al1.phrases 
##  [1] "If we shall suppose that American slavery is one of those offenses which"                                                                         
##  [2] " in the providence of God"                                                                                                                        
##  [3] " must needs come"                                                                                                                                 
##  [4] " but which"                                                                                                                                       
##  [5] " having continued through His appointed time"                                                                                                     
##  [6] " He now wills to remove"                                                                                                                          
##  [7] " and that He gives to both North and South this terrible war as the woe due to those by whom the offense came"                                    
##  [8] " shall we discern therein any departure from those divine attributes which the believers in a living God always ascribe to Him? Fondly do we hope"
##  [9] " fervently do we pray"                                                                                                                            
## [10] " that this mighty scourge of war may speedily pass away. Yet"                                                                                     
## [11] " if God wills that it continue until all the wealth piled by the bondsman’s two hundred and fifty years of unrequited toil shall be sunk"         
## [12] " and until every drop of blood drawn with the lash shall be paid by another drawn with the sword"                                                 
## [13] " as was said three thousand years ago"                                                                                                            
## [14] " so still it must be said “the judgments of the Lord are true and righteous altogether”."

16.9 Word Count Tables

Now lets break up the data set by spaces. We do this in hopes that it will separate each word as an element.

al1.words <- strsplit(al1, split=" ")[[1]]
head(al1.words)
## [1] "If"       "we"       "shall"    "suppose"  "that"     "American"

We can now tabulate how often each word appears using the table() function. Then we can sort the frequencies in order using sort().

wc <- table(al1.words)
wc <- sort(wc, decreasing=TRUE)
head(wc, 250)
## al1.words
##          the           of          and        shall         that           to 
##            9            6            5            4            4            4 
##           we           be           by        those           as           do 
##            4            3            3            3            2            2 
##        drawn          God           He           in           it         must 
##            2            2            2            2            2            2 
##         said         this        until          war       which,        wills 
##            2            2            2            2            2            2 
##         with        years         “the            a         ago,          all 
##            2            2            1            1            1            1 
## altogether”.       always     American      another          any    appointed 
##            1            1            1            1            1            1 
##          are      ascribe   attributes        away.    believers        blood 
##            1            1            1            1            1            1 
##   bondsman’s         both          but        came,        come,     continue 
##            1            1            1            1            1            1 
##    continued    departure      discern       divine         drop          due 
##            1            1            1            1            1            1 
##        every    fervently        fifty       Fondly         from        gives 
##            1            1            1            1            1            1 
##         God,       having         Him?          His        hope,      hundred 
##            1            1            1            1            1            1 
##           if           If           is    judgments         lash       living 
##            1            1            1            1            1            1 
##         Lord          may       mighty        needs        North          now 
##            1            1            1            1            1            1 
##      offense     offenses          one         paid         pass        piled 
##            1            1            1            1            1            1 
##        pray,   providence      remove,    righteous      scourge      slavery 
##            1            1            1            1            1            1 
##           so        South     speedily        still        sunk,      suppose 
##            1            1            1            1            1            1 
##       sword,     terrible      therein     thousand        three      through 
##            1            1            1            1            1            1 
##        time,         toil         true          two   unrequited          was 
##            1            1            1            1            1            1 
##       wealth        which         whom          woe         Yet, 
##            1            1            1            1            1

Notice that punctuation using these methods is still present.

# These are different
wc["He"] # exists
## He 
##  2
wc["he"] # does not exist
## <NA> 
##   NA

In addition, all our words and string subsets are case sensitive.

# What happens when we look for a word that is not in our 
# word count table? 

which(names(wc)  == "That")
## integer(0)
wc["That"]
## <NA> 
##   NA