Chapter 9 *apply Family of Functions

9.1 Introduction to the *apply family

You may have seen loops (like for, and while) in another class, which are used to repeatedly execute some code. However, they are often slow in execution when it comes to processing large data sets in R.R has a more efficient and quick approach to perform iterations – **The *apply family**.

The *apply functions are used to vectorized functions in R and make them more efficient. The *apply functions are technically functionals. A functional in mathematics (and programming) is a function that accepts another function as an input. Below are the most common forms of apply functions.

  • apply()
  • lapply()
  • sapply()
  • tapply()
  • mapply()
  • replicate()

These functions apply a function to different components of a vector/list/dataframe/array in a non-sequential way. In general, if each element in your object is not dependent on the other elements of your object then an apply function is usually faster than a loop.

9.2 apply()

The apply()function is used to apply a function to the rows or columns of arrays (matrices). It assembles the returned values, and then returns it in a vector, array, or list.

If you want to apply a function on a data frame, make sure that the data frame is homogeneous (i.e. all columns have numeric values, all columns have character strings, etc). Otherwise, R will force all columns to have identical types using as.matrix(). This may not be what you want. In that case, you might consider using the lapply() or sapply() functions instead.

Description of the required apply() arguments:

  • X: A array (or matrix)
  • MARGIN: A vector giving the subscripts which the function will be applied over.
    • 1 indicates rows
    • 2 indicates columns
    • c(1, 2) indicates rows and columns
  • FUN: The function to be applied
  • ...: Additional arguments to be passed to “FUN”

Example: Built In Function

# Get column means
data <- matrix(1:9, nrow = 3, ncol = 3)
data
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
apply(data, 2, mean)
## [1] 2 5 8
# Get row means
apply(data, 1, sum)
## [1] 12 15 18

Example: User Defined Function

You can use user-defined functions as well.

apply(data, 2, function(x) {

    # Standard deviation formula
    y <- sum(x - mean(x))^2/(length(x) - 1)

    return(y)
})
## [1] 0 0 0

9.2.1 Returned Objects

The values that apply() returns depends on the function FUN. If FUN returns an element of length 1, then apply() will return a vector. If FUN always returns an element of length n>1, then apply() will return a matrix with n rows, and the number of columns will correspond to how many rows/columns were iterated over. Lastly, if FUN returns an object that would vary in length, then apply() will return a list where each element corresponds to a row or column that was iterated over. In short, apply() prioritizes returning a vector, array (or matrix), and list (in that order). What is returned depends on the output of FUN.

9.2.2 Example: Extra Arguments, Array Output

x <- cbind(x1 = 3, x2 = c(4:1, 2:5))

fun1 <- function(x, c1, c2) {
    mean_vec <- c(mean(x[c1]), mean(x[c2]))
    return(mean_vec)
}

apply(x, 1, fun1, c1 = "x1", c2 = c("x1", "x2"))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]  3.0    3  3.0    3  3.0    3  3.0    3
## [2,]  3.5    3  2.5    2  2.5    3  3.5    4

9.2.3 Example: List Output

mat <- matrix(c(-1, 1, 0, 2, -2, 20, 62, -2, -6), nrow = 3)

CheckPos <- function(Vec) {
    # Subset values of Vec that are positive
    PosVec <- Vec[Vec > 0]

    # Return only the even values
    return(PosVec)
}

# Check Positive values by column
apply(mat, 2, CheckPos)
## [[1]]
## [1] 1
## 
## [[2]]
## [1]  2 20
## 
## [[3]]
## [1] 62

9.3 lapply()

The lapply() function is used to apply a function to each element of the list. It collects the returned values into a list, and then returns that list which is of the same length.

Description of the required lapply() arguments:

  • X: A list
  • FUN: The function to be applied
  • ...: Additional arguments to be passed to FUN
data_lst <- list(item1 = 1:5, item2 = seq(4, 36, 8), item3 = c(1,
    3, 5, 7, 9))
data_lst
## $item1
## [1] 1 2 3 4 5
## 
## $item2
## [1]  4 12 20 28 36
## 
## $item3
## [1] 1 3 5 7 9
data_vector <- c(1, 2, 3, 4, 5, 6, 7, 8)
data_vector
## [1] 1 2 3 4 5 6 7 8
lapply(data_lst, sum)
## $item1
## [1] 15
## 
## $item2
## [1] 100
## 
## $item3
## [1] 25
lapply(data_vector, sum)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
x <- list(a = 1:10, beta = exp(-3:3), logic = c(TRUE, FALSE,
    FALSE, TRUE))

# compute the list mean for each list element
lapply(x, mean)
## $a
## [1] 5.5
## 
## $beta
## [1] 4.535125
## 
## $logic
## [1] 0.5

Unlike apply(), lapply() will always return a list. If the argument X is an object that is something other than a list then the as.list() function will be used to convert that object. Consider the built-in data set iris in R. If we use the as.list() function, each column will be converted into an element of a list.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(as.list(iris))
## List of 5
##  $ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Then if we use lapply() it will iterate over the columns. We can find all values within a column that are bigger than the column mean (just looking at the numeric columns though).

lapply(iris[, 1:4], function(column) {
    big_values <- column[column > mean(column)]
    return(big_values)
})
## $Sepal.Length
##  [1] 7.0 6.4 6.9 6.5 6.3 6.6 5.9 6.0 6.1 6.7 6.2 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7
## [20] 6.0 6.0 6.0 6.7 6.3 6.1 6.2 6.3 7.1 6.3 6.5 7.6 7.3 6.7 7.2 6.5 6.4 6.8 6.4
## [39] 6.5 7.7 7.7 6.0 6.9 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7
## [58] 6.3 6.4 6.0 6.9 6.7 6.9 6.8 6.7 6.7 6.3 6.5 6.2 5.9
## 
## $Sepal.Width
##  [1] 3.5 3.2 3.1 3.6 3.9 3.4 3.4 3.1 3.7 3.4 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6
## [20] 3.3 3.4 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1 3.2 3.5 3.6 3.4 3.5 3.2 3.5 3.8
## [39] 3.8 3.2 3.7 3.3 3.2 3.2 3.1 3.3 3.1 3.2 3.4 3.1 3.3 3.6 3.2 3.2 3.8 3.2 3.3
## [58] 3.2 3.8 3.4 3.1 3.1 3.1 3.1 3.2 3.3 3.4
## 
## $Petal.Length
##  [1] 4.7 4.5 4.9 4.0 4.6 4.5 4.7 4.6 3.9 4.2 4.0 4.7 4.4 4.5 4.1 4.5 3.9 4.8 4.0
## [20] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.8 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0 4.4 4.6 4.0
## [39] 4.2 4.2 4.2 4.3 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5.0
## [58] 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0 4.8 4.9 5.6 5.8 6.1 6.4 5.6
## [77] 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5.0 5.2 5.4 5.1
## 
## $Petal.Width
##  [1] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1.3 1.4 1.5 1.4 1.3 1.4 1.5 1.5 1.8 1.3 1.5 1.2
## [20] 1.3 1.4 1.4 1.7 1.5 1.2 1.6 1.5 1.6 1.5 1.3 1.3 1.3 1.2 1.4 1.2 1.3 1.2 1.3
## [39] 1.3 1.3 2.5 1.9 2.1 1.8 2.2 2.1 1.7 1.8 1.8 2.5 2.0 1.9 2.1 2.0 2.4 2.3 1.8
## [58] 2.2 2.3 1.5 2.3 2.0 2.0 1.8 2.1 1.8 1.8 1.8 2.1 1.6 1.9 2.0 2.2 1.5 1.4 2.3
## [77] 2.4 1.8 1.8 2.1 2.4 2.3 1.9 2.3 2.5 2.3 1.9 2.0 2.3 1.8

9.4 sapply()

The sapply() and lapply() are very similar. The main difference is that lapply() always returns a list, whereas sapply() tries to simplify the result into a vector or matrix. The output for sapply() is just like the output of apply(), it depends on the dimensions of the returned value for FUN.

  • If the return value is a list where every element is length 1, you get a vector.

  • If the return value is a list where every element is a vector of the same length (> 1), you get a matrix.

  • If the lengths vary, simplification is impossible and you get a list.

Description of the required sapply() arguments:

  • X: A list
  • FUN: The function to be applied
data_lst <- list(item1 = 1:5, item2 = seq(4, 36, 8), item3 = c(1,
    3, 5, 7, 9))
data_lst
## $item1
## [1] 1 2 3 4 5
## 
## $item2
## [1]  4 12 20 28 36
## 
## $item3
## [1] 1 3 5 7 9
sapply(data_lst, sum)
## item1 item2 item3 
##    15   100    25
lapply(data_lst, sum)
## $item1
## [1] 15
## 
## $item2
## [1] 100
## 
## $item3
## [1] 25

9.5 tapply()

The tapply() function breaks the data set up into groups and applies a function to each group.

Description of the required sapply() arguments:

  • X: A 1 dimensional object
  • INDEX: A grouping factor or a list of factors
  • FUN: The function to be applied
data <- data.frame(name = c("Amy", "Max", "Ray", "Kim", "Sam",
    "Eve", "Bob"), age = c(24, 22, 21, 23, 20, 24, 21), gender = factor(c("F",
    "M", "M", "F", "M", "F", "M")))

data
##   name age gender
## 1  Amy  24      F
## 2  Max  22      M
## 3  Ray  21      M
## 4  Kim  23      F
## 5  Sam  20      M
## 6  Eve  24      F
## 7  Bob  21      M
tapply(data$age, data$gender, min)
##  F  M 
## 23 20

9.6 mapply()

The mapply() function is a multivariate version of sapply(). It applies FUN to the first elements of each … argument, the second elements, the third elements, and so on.

Description of the required mapply() arguments:

  • FUN: The function to be applied
  • ...: Arguments to vectorize over (vectors or lists of strictly positive length, or all of zero length).
mapply(rep, times = 1:4, x = 4:1)
## [[1]]
## [1] 4
## 
## [[2]]
## [1] 3 3
## 
## [[3]]
## [1] 2 2 2
## 
## [[4]]
## [1] 1 1 1 1

9.7 replicate()

The replicate() function is a wrapper for sapply(). If we want to repeat an evaluation of an function call or an expression that does not require us to iterate through a data set or vector we can use replicate().

Description of the required replicate() arguments:

  • n: An integer containing the number of replications.
  • expr: The expression (or function call) to evaluate repeatedly.
replicate(n = 4, "Hello")
## [1] "Hello" "Hello" "Hello" "Hello"
replicate(n = 10, factorial(4))
##  [1] 24 24 24 24 24 24 24 24 24 24
replicate(n = 5, sample(c("red", "blue")))
##      [,1]   [,2]   [,3]   [,4]   [,5]  
## [1,] "red"  "red"  "blue" "red"  "blue"
## [2,] "blue" "blue" "red"  "blue" "red"

9.8 How to Pick a Method

It can be difficult at first to decide which of these apply function you may want to use. In general, we can use the flow chart below as a quick guide.

9.9 Vectorizing If-Statments

The ifelse() function is useful when we are vectorizing a simple if-else statment. When we have something more complicated we either need to do a combination of ifelse() functions that are nested together, which can be hard to read, or we need to consider another option.

Are second option is to use one of the *apply functions. There are many different types of apply functions, for now we will just focus on the most common one, the sapply() function.

Let us consider a more complicated example where we might want to consider an *apply function. In the below example we want to find each value of x that is an even positive number, odd positive number, and non-positive number

x <- c(-3, 0, 3)

if (x > 0) {
    if (x%%2 == 0) {
        type <- "even positive"
    } else {
        type <- "odd positive"
    }
} else {
    type <- "non-positive"
}
## Warning in if (x > 0) {: the condition has length > 1 and only the first
## element will be used

With the above code we have an warning message because if-statements can only accept logical vectors of length 1, and our expression is generating a logical vector of length greater than 1. To use sapply() to vectorized this operation we need to change this chain of if-statements into a function.

EvenOdd <- function(num) {
    if (num > 0) {
        if (num%%2 == 0) {
            type <- "even positive"
        } else {
            type <- "odd positive"
        }
    } else {
        type <- "non-positive"
    }
    return(type)
}

This will still generate an error message if we call EvenOdd(x) because our input is still a vector greater than 1. To use sapply() we can do the following:

sapply(x, EvenOdd)
## [1] "non-positive" "non-positive" "odd positive"

The first argument, labeled as X in the help file, is the object that contains the elements we want to apply the function to. The second argument, labeled as FUN in the help file, is the function that is applied to each element of X. We can make much more complicated if-statement changes and apply them to vectors using sapply().

If we wanted to vectorize a function that had multiple inputs like the Toyfun function seen in 8.3. we can do that with another *apply function, mapply(). This function iterates over mutltiple inputs simultaneously. Consider the following vectors

X_input <- c(1, 2, 3)
Y_input <- c(100, 50, 25)
Do_input <- c("Add", "Subtract", "Multiply")

Suppose we wish to add the first elements of X_input and Y_input, subtract the second elements X_input and Y_input, and multiply the last elements of X_input and Y_input. We can do this iteratively all at once with mapply(). If we look at the mapply() help file the first two arguments are FUN and .... The FUN argument is the function we wish to apply, and the ... represents the inputted values we want to iterative over. The vectors supplied in ... will all be called one at a time. That is, the first elements of all the vectors passed via ... will be called first, followed by the second elements, and so on.

mapply(Toyfun, X = X_input, Y = Y_input, Do = Do_input)
## [1] 101 -48  75

9.10 More Examples

To see some more examples of these functions in action. We will use the iris data set which is a built in data set in R. This data set has four numeric columns, and one factor column, Species. Each row is a flower, and there are four different measurements of each flower.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Find the maximum value for the numeric variables for each observation.

numeric_iris <- iris[, -5]
max_in_row <- apply(numeric_iris, 1, max)
head(max_in_row)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4

Determine the (arithmetic) mean of the sepal width for each species.

mean_species <- tapply(iris$Sepal.Width, INDEX = iris$Species,
    mean)
mean_species
##     setosa versicolor  virginica 
##      3.428      2.770      2.974

Determine the (arithmetic) mean and the median of sepal width for each species.

my_avgs <- function(vec) {
    the_mean <- mean(vec)
    the_median <- median(vec)
    return_object <- c(the_mean, the_median)
    names(return_object) <- c("mean", "median")
    return(return_object)
}

species_avgs <- tapply(iris$Sepal.Width, iris$Species, my_avgs)
species_avgs
## $setosa
##   mean median 
##  3.428  3.400 
## 
## $versicolor
##   mean median 
##   2.77   2.80 
## 
## $virginica
##   mean median 
##  2.974  3.000

Make a plot of the sepal width and sepal length. Make the points differ depending on the species type.

# Starting plot, make it blank
plot(iris$Sepal.Length, iris$Sepal.Width, col = "white")


# Custom function to add the points
add_points <- function(the_data, ...) {

    if (the_data[5] == "setosa") {
        points(x = the_data[1], y = the_data[2], col = "red",
            pch = 0)
    } else if (the_data[5] == "virginica") {
        points(x = the_data[1], y = the_data[2], col = "blue",
            pch = 2)
    } else {
        points(x = the_data[1], y = the_data[2], col = "green",
            pch = 10)

    }
}


# Use apply to add points
apply(iris, 1, add_points)

## NULL

Make a plot of the sepal width and sepal length. Make the points differ depending on the species type. Add the (arithmetic) mean of these two variables for each group.

# ------ PLOT FROM BEFORE Starting plot, make it blank
plot(iris$Sepal.Length, iris$Sepal.Width, col = "white")
apply(iris, 1, add_points)
## NULL
# ------


# Split the data into a list by factor
split_iris = split(iris, f = iris$Species)

# Iterate through the list and add (black) points to the
# plot
lapply(split_iris, function(species_data) {
    points(mean(species_data$Sepal.Length), mean(species_data$Sepal.Width),
        pch = 16)
})

## $setosa
## NULL
## 
## $versicolor
## NULL
## 
## $virginica
## NULL

Lets try using another example. Suppose we wish to use the following formula (below) with a = Sepal.Length, b = Sepal.Width, and c = Petal.Length. \[ \frac{-b + \sqrt{b^2-4ac} }{2a}\]

Now there is more efficient ways to do this in R, but lets practice how we would do it with mapply as an example.

my_formula <- function(a, b, c) {
    num <- (-b + sqrt(b^2 + 4 * a * c))
    den <- 2 * a

    answer <- num/den
    return(answer)
}

formula_results <- mapply(my_formula, a = iris$Sepal.Length,
    b = iris$Sepal.Width, c = iris$Petal.Length)
head(formula_results)
## [1] 0.2831638 0.3098526 0.2860609 0.3260870 0.2800000 0.3061340

9.10.1 Create a New Variable

petal_size <- function(PLength, BigCutOff, ModerateCutOff, SmallCutOff){
  if(PLength > BigCutOff){
    PetalSize <- "Big"
  } else if (PLength > ModerateCutOff){
    PetalSize <- "Moderate"
  } else if(PLength > SmallCutOff){
    PetalSize <- "Small"
  } else{
    PetalSize <- "Very Small"
  }
  return(PetalSize)
}

sapply(iris$Petal.Length,   # Data
       petal_size,          # Function
       BigCutOff = 5,       # Optional Arguments for function
       ModerateCutOff = 4, 
       SmallCutOff = 1.5)
##   [1] "Very Small" "Very Small" "Very Small" "Very Small" "Very Small"
##   [6] "Small"      "Very Small" "Very Small" "Very Small" "Very Small"
##  [11] "Very Small" "Small"      "Very Small" "Very Small" "Very Small"
##  [16] "Very Small" "Very Small" "Very Small" "Small"      "Very Small"
##  [21] "Small"      "Very Small" "Very Small" "Small"      "Small"     
##  [26] "Small"      "Small"      "Very Small" "Very Small" "Small"     
##  [31] "Small"      "Very Small" "Very Small" "Very Small" "Very Small"
##  [36] "Very Small" "Very Small" "Very Small" "Very Small" "Very Small"
##  [41] "Very Small" "Very Small" "Very Small" "Small"      "Small"     
##  [46] "Very Small" "Small"      "Very Small" "Very Small" "Very Small"
##  [51] "Moderate"   "Moderate"   "Moderate"   "Small"      "Moderate"  
##  [56] "Moderate"   "Moderate"   "Small"      "Moderate"   "Small"     
##  [61] "Small"      "Moderate"   "Small"      "Moderate"   "Small"     
##  [66] "Moderate"   "Moderate"   "Moderate"   "Moderate"   "Small"     
##  [71] "Moderate"   "Small"      "Moderate"   "Moderate"   "Moderate"  
##  [76] "Moderate"   "Moderate"   "Moderate"   "Moderate"   "Small"     
##  [81] "Small"      "Small"      "Small"      "Big"        "Moderate"  
##  [86] "Moderate"   "Moderate"   "Moderate"   "Moderate"   "Small"     
##  [91] "Moderate"   "Moderate"   "Small"      "Small"      "Moderate"  
##  [96] "Moderate"   "Moderate"   "Moderate"   "Small"      "Moderate"  
## [101] "Big"        "Big"        "Big"        "Big"        "Big"       
## [106] "Big"        "Moderate"   "Big"        "Big"        "Big"       
## [111] "Big"        "Big"        "Big"        "Moderate"   "Big"       
## [116] "Big"        "Big"        "Big"        "Big"        "Moderate"  
## [121] "Big"        "Moderate"   "Big"        "Moderate"   "Big"       
## [126] "Big"        "Moderate"   "Moderate"   "Big"        "Big"       
## [131] "Big"        "Big"        "Big"        "Big"        "Big"       
## [136] "Big"        "Big"        "Big"        "Moderate"   "Big"       
## [141] "Big"        "Big"        "Big"        "Big"        "Big"       
## [146] "Big"        "Moderate"   "Big"        "Big"        "Big"