Chapter 18 Regular Expressions

With basic string-manipulation functions, we saw how to do things like split up entries in a data file which are separated by commas.

text = "One Fish, Two Fish, Red Fish, Blue Fish"
strsplit(text, split = ",")
## [[1]]
## [1] "One Fish"   " Two Fish"  " Red Fish"  " Blue Fish"

or by single spaces

strsplit(text, split = " ")
## [[1]]
## [1] "One"   "Fish," "Two"   "Fish," "Red"   "Fish," "Blue"  "Fish"

or even a comma followed by a space

strsplit(text, split = ", ")
## [[1]]
## [1] "One Fish"  "Two Fish"  "Red Fish"  "Blue Fish"

But we don’t know how to deal with situations like splitting on a comma, optionally followed by some number of spaces.

Not only is it annoying to have such a simple thing defeat us, it’s an instance of a much broader class of problems. If we’re trying to extract data from webpages, we may want to get rid of all the formatting instructions buried in the source of the webpage. We might want to extract all the personal names from a document which are preceded by titles (such as Mr., Ms., Miss, Dr.), without knowing what those names are, or how long they are. And so forth.

We need a language for telling R about patterns of strings. The most basic such language is that of regular expressions. Regular expressions are used to match sets of strings. Start with string constants, and build up by allowing “this and then that”, “either this or that”, “repeat this”, etc. These rules get expressed in a grammar, with special symbols.

Every regular expression is a sequence of symbols, which specifies a set of text strings that follow some pattern that — match the regular expression. - Regular expressions are strings and therefore a regexp can be stored in a character variable. This means that regexps can be built up and changed using string-manipulating functions. A valid regular expression must conform to certain rules of grammar; it gets interpreted by the computer as rules for matching certain strings, but not others.

Every string is a valid regexp. We say “regexp” as short hand for the phrase a “regular expression as used in R”. fly matches end of “fruitfly”, “why walk when you can fly” but it does not match “time flies like an arrow; fruit flies like a banana; a banana flies poorly”.

fly_phrases = c("fruitfly", "why walk when you can fly", "time flies like an arrow; fruit flies like a banana; a banana flies poorly",
    "superman flew")

# Returns which elements have the term 'fly'
grep("fly", fly_phrases, value = T)
## [1] "fruitfly"                  "why walk when you can fly"

If we did wish to match one regexp OR another regexp we cold use the “|” symbol. This symbol is used in the same way it was in section 1.4

# Returns which elements have the term 'fly' or 'flies'
grep("fly|flies", fly_phrases, value = T)
## [1] "fruitfly"                                                                  
## [2] "why walk when you can fly"                                                 
## [3] "time flies like an arrow; fruit flies like a banana; a banana flies poorly"
# Returns which elements have the term 'fly' or 'flies'
grep("flies|fly", fly_phrases, value = T)
## [1] "fruitfly"                                                                  
## [2] "why walk when you can fly"                                                 
## [3] "time flies like an arrow; fruit flies like a banana; a banana flies poorly"
# Returns which elements have 'fly flies' or 'time' or
# 'flies'
grep("time|fruit fly|flies", fly_phrases, value = T)
## [1] "time flies like an arrow; fruit flies like a banana; a banana flies poorly"

Parentheses are used to create groups.

text_colors = c("Some people say grey", "others say gray,", "and both are greatly used")

# Searches for the same thing
grep("gr(e|a)y", text_colors, value = T)
## [1] "Some people say grey" "others say gray,"
grep("grey|gray", text_colors, value = T)
## [1] "Some people say grey" "others say gray,"

18.1 Ranges, Escaping

A character class is a list of characters enclosed between [ and ] which matches any single character in that list. We use these braces in regular expressions to indicate character ranges. For example the regexp [0123456789] matches any single digit.

text = c("1 Fish", "2 Fish", "red fish", "blue fish")

# Any numbers?
grep("[0123456789]", text, value = T)
## [1] "1 Fish" "2 Fish"

If the first character of the list is the caret ^, then it matches any character not in the list. For example, [^abc] matches anything except the characters a, b or c.

# Any letters EXCEPT a, b, c
grep("[^abc]", text, value = T)
## [1] "1 Fish"    "2 Fish"    "red fish"  "blue fish"

Another example: t[aeiou] matches any two-character sequence in which “t” is followed by a lowercase vowel.

text_baseball = c("Ohhhhh", "take me out", "to the ball game")

# Match any two-character sequence in which “t” is followed
# by a lowercase vowel.
grep("t[aeiou]", text_baseball, value = T)
## [1] "take me out"      "to the ball game"

This system of using ranges is used a lot, and there are certain named classes of characters are predefined, below are some of them.

  • [:lower:] : Lower-case letters, equivalent to [a-z]
  • [:upper:] : Upper-case letters, equivalent to [A-Z]
  • [:alpha:] : Alphabetic characters: [:lower:] and [:upper:].
  • [:digit:] : Digits: 0 1 2 3 4 5 6 7 8 9, equivalent to [0-9]
  • \d : digit, equivalent to [:digit:]
  • \D : Non-digit
  • [:alnum:] : Alphanumeric characters: [:alpha:] and [:digit:].
  • [:blank:] : Blank characters: space and tab, and possibly other locale-dependent characters such as non-breaking space.
  • [:punct:] : Punctuation characters: ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~.
  • [:graph:] : Graphical characters: [:alnum:] and [:punct:].
  • [:space:] : Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters.
  • \s : space
  • \S : not space
  • \w: word characters, equivalent to [[:alnum:]_] or [A-z0-9].
  • \W: not word, equivalent to [^A-z0-9_].
# Are any digits in the text?
grep("[[:digit:]]", text, value = T)
## [1] "1 Fish" "2 Fish"
# Are any upper case letters in the text?
grep("[[:upper:]]", text, value = T)
## [1] "1 Fish" "2 Fish"
# Are any lower case letters in the text?
grep("[[:lower:]]", text, value = T)
## [1] "1 Fish"    "2 Fish"    "red fish"  "blue fish"

18.2 Metacharacters

The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. However, we also use some of these characters to mean different things in order to match more intricate patterns. The characters reserved for special meanings (and not simply as characters to match) are called metacharacters. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are .  | ( ) [ { ^ $ * + ?.

We already saw | and [] in action above. Now lets talk about the . symbol. This symbol matches any single character. To match strings that have a “.” we need to use a backslash.

text = c("There was a monster", "Superman flew over there.",
    "He made it in time.", "Yumm, pancakes!", "MMmm Mmmm they were good...")

# Looks for every character
grep(".", text, value = T)
## [1] "There was a monster"         "Superman flew over there."  
## [3] "He made it in time."         "Yumm, pancakes!"            
## [5] "MMmm Mmmm they were good..."
# Matches only '.'
grep("\\.", text, value = T)
## [1] "Superman flew over there."   "He made it in time."        
## [3] "MMmm Mmmm they were good..."

18.3 Quantifiers

Here is a list of some of the metacharacters and some of their features. These are all quantifiers and they control how many times the regexp should be matched. The following features all are considered repetition quantifiers.

  • + : The preceding item will be matched one or more times.
  • * : The preceding item will be matched zero or more times.
  • ? : The preceding item is optional and will be matched at most once.
  • {n} : The preceding item is matched exactly “n” times.
  • {n,} : The preceding item is matched “n” or more times.
  • {n,m} : The preceding item is matched at least “n” times, but not more than “m” times.
# Detect the letter 'm'
grep("m", text, value = T)
## [1] "There was a monster"         "Superman flew over there."  
## [3] "He made it in time."         "Yumm, pancakes!"            
## [5] "MMmm Mmmm they were good..."
# Detect the letter 'm' exactly one time
grep("m{1}", text, value = T)
## [1] "There was a monster"         "Superman flew over there."  
## [3] "He made it in time."         "Yumm, pancakes!"            
## [5] "MMmm Mmmm they were good..."

If we want to match a regexp exactly {n}, {n,} or {n,m} times then it looks for this sequence consecutively. For example, the following command does not output “He made it in time.”, despite this phrase having two “m”s in it.

# Detect the letter 'm' exactly two or more times (in a
# row)
grep("m{2,}", text, value = T)
## [1] "Yumm, pancakes!"             "MMmm Mmmm they were good..."
# Another example
text = c("3, 2, 1, GO!", "On the count of 3,", "Count to 100",
    "Wish upon 1 star.")

# Detect exactly two digits in a row.
grep("[[:digit:]]{2}", text, value = T)
## [1] "Count to 100"

By default, quantifiers are “greedy” and they match as many repetitions as they can. Following a quantifier by ? makes it match as few as possible.

# Only look for one digit then stop.  It doesn't matter if
# there is more.
grep("[[:digit:]]?", text, value = T)
## [1] "3, 2, 1, GO!"       "On the count of 3," "Count to 100"      
## [4] "Wish upon 1 star."

18.4 Anchoring

Anchoring is used to find regexps at specific locations in a string. For example, at the beginning or end of a word, or at the beginning or end of a phrase.

  • $ means a pattern can only match at the end of a line or string
  • ^ means (outside of braces) the beginning of a line or string
  • < and > anchor to beginning or ending of words
  • \b anchors boundary (beginning or ending) of words, \B anywhere else
# matches elements ending in a lower-case letter or comma
grep("[a-z,]$", text, value = T)
## [1] "On the count of 3,"
# matches capital letters not at the beginning or ending of
# a word
grep("\\B[A-Z]", text, value = T)
## [1] "3, 2, 1, GO!"
# Find elements that end with a number
grep("[[:digit:]]$", text, value = T)
## [1] "Count to 100"
# Find elements that start with a number
grep("^[[:digit:]]", text, value = T)
## [1] "3, 2, 1, GO!"
# Find elements that have a word that starts or ends with u
grep("\\bu", text, value = T)
## [1] "Wish upon 1 star."
# Find elements that have a word that starts or ends with t
grep("\\bt", text, value = T)
## [1] "On the count of 3," "Count to 100"

18.5 Splitting on a Regexp

We can use regexps in many R functions, we are not limited to just grep() and grepl. For example, strsplit() will take a regexp as its split argument. This means we can make splits a string into new strings at each instance of the regexp, just like it would if split were a string.

# Load data:
al2 = readLines("http://www.stat.cmu.edu/~cshalizi/statcomp/14/lectures/04/al2.txt")
al2 = paste(al2, collapse = " ")

# Split words by space
al2.words1 = strsplit(al2, split = " ")

# Lets see what this looks like
head(sort(table(al2.words1)))
## al2.words1
##       -    "the    "Woe absorbs  accept achieve 
##       1       1       1       1       1       1
# Split words by space or by a punctuation symbol
al2.words2 = strsplit(al2, split = "(\\s|[[:punct:]])+")[[1]]
head(sort(table(al2.words2)))
## al2.words2
## absorbs  accept achieve against  agents     aid 
##       1       1       1       1       1       1

Closer examination shows there’s still a problem: “men’s” \(\rightarrow\) “men”, “s”. To handle possessives: look for any number of white spaces, or at least one punctuation mark followed by at least one space

al2.words3 = strsplit(al2, split = "\\s+|([[:punct:]]+[[:space:]]+)")[[1]]
head(sort(table(al2.words3)))
## al2.words3
##            "the    "Woe absorbs  accept achieve 
##       1       1       1       1       1       1

18.6 Stringr Package

The stringr package is one of the many packages that helps with text data. This package also has several vignettes with explanations on the different functions available, and how the package works. It is thought of as an extension of the standard base R functions. The stringr package is part of the tidyverse, but the functions and uses of this package are much more similar to base R then dplyr and ggplot2.

Two functions that are particularly helpful is str_extract() and str_extract_all(). These functions find and isolate particular occurrences of words and phrases. When used with regular expressions this can be very beneficial.

library(tidyverse)

fly_phrases <- c("fruitfly", "why walk when you can fly", "time flies like an arrow; fruit flies like a banana; a banana flies poorly",
    "superman flew")

# Returns occurrences of a phrase starting with a space,
# followed by 'fl', and ending with any number of letters
# (upper or lower case)
fl_occurances <- str_extract_all(fly_phrases, pattern = " fl[[:alpha:]]{1,}")

# Remove white space
fl_occurances <- lapply(fl_occurances, gsub, pattern = "[[:space:]]",
    replacement = "")
fl_occurances
## [[1]]
## character(0)
## 
## [[2]]
## [1] "fly"
## 
## [[3]]
## [1] "flies" "flies" "flies"
## 
## [[4]]
## [1] "flew"

18.7 Wordcloud Package

Another package that is particularly useful for working with text data is the wordcloud package. This package extends the standard Base R plotting, and allows us to build word clouds. It uses the same general principals and features that we do with Base R.

Lets use the same Abraham Lincoln speech from the previous section. This time we will remove all punctuation, make everything lower case, and replace all whitepsace with a single space for consistency.

library(wordcloud)

the_url <- "https://raw.githubusercontent.com/rpkgarcia/LearnRBook/main/data_sets/al1.txt"
al1 <- readLines(the_url, warn = FALSE)

# make everything lower case
al1 <- tolower(al1)

# remove puncuation
al1 <- gsub("[[:punct:]]", "", al1)

# replace any whitespace by a single white space
al1 <- gsub("[[:space:]]", " ", al1)

# see all individual words by spliting by white space
al1.words <- strsplit(al1, " ")

# Word count table
wc <- table(al1.words)
wc <- sort(wc, decreasing = T)

# Fixed issues from previous section
head(wc, 30)
## al1.words
##   the    of   and shall  that    to    we    be    by   god those which    as 
##    10     6     5     4     4     4     4     3     3     3     3     3     2 
##    do drawn    he    if    in    it  must  said  this until   war wills  with 
##     2     2     2     2     2     2     2     2     2     2     2     2     2 
## years     a   ago   all 
##     2     1     1     1
# The unique words in the table
names(wc)[1:6]
## [1] "the"   "of"    "and"   "shall" "that"  "to"
# Default plot (there is some randomness to how this plot
# is made, see Help file)
wordcloud(words = names(wc), freq = wc)

# Changing a few settings (Feel free to change more!! :)
# Have fun with it)
wordcloud(words = names(wc), freq = wc, min.freq = 2, random.order = F,
    colors = c("purple", "blue"))

18.8 Replacements

Assigning to regmatches() changes the matched string, just like substr()

sub() and gsub() work like regexpr() and gregexpr(), but with an extra replace argument

sub() produces a new string, assigning to regmatches() modifies the original one

Really, assigning to regmatches() creates a new string, destroys the old one, and assigns the new string the old name.