[Intro to Computational Thinking] R Basics 2

May 12 2023

đź“ť Lesson Preview

Functions

If objects are the nouns of the R language, functions are the verbs. They do things to objects. Let’s use the mean function as an example. This function takes the mean of a numeric vector object. Remember our numeric_vector object from Lesson 1:

numeric_vector <- c(2.8, 2, 14.8)

numeric_vector
## [1]  2.8  2.0 14.8

To find the mean of this object, type:

mean(x = numeric_vector)
## [1] 6.533333

We use the assignment operator to place a function’s output into an object. For example:

numeric_vector_mean <- mean(x = numeric_vector)

Notice that we typed the function’s name then enclosed the object name in parentheses immediately afterwards. This is the basic syntax that all functions use, i.e. FUNCTION(ARGUMENTS). Even if you don’t want to explicitly include an argument, you still need to type the parentheses after the function.

Arguments

Arguments modify what functions do. In our most recent example, we gave the mean function one argument (x = numeric_vector) telling it that we wanted to find the mean of numeric_vector. Arguments use the ARGUMENT_LABEL = VALUE syntax. In this case, x is the argument label.

To find all of the arguments that a function can accept, look at the Arguments section of the function’s help file. To access the help file, type: ?FUNCTION. For example:

?mean

The help file will also tell you the default values that the arguments are set to. You do not need to explicitly set an argument if you want to use its default value.

You do need to be fairly precise with the syntax for your argument’s values. Values for logical arguments must be written as TRUE or FALSE. Arguments that accept character strings require quotation marks.

Let’s see how to use multiple arguments with the round() function. This function rounds a vector of numbers. We can use the digits argument to specify how many decimal places we want the numbers rounded to. To round the object numeric_vector_mean to one decimal place, type:

round(x = numeric_vector_mean, digits = 1)
## [1] 6.5

Note that arguments are always separated by commas.

Some arguments do not need to be explicitly labeled. For example, we could write:

# Find mean of numeric_vector
mean(numeric_vector)
## [1] 6.533333

R will do its best to figure out what you want and will only give up when it can’t. This will generate an error message. However, to avoid any misunderstandings between yourself and R, it is good practice to label your argument values. This will also make your code easier for other people to read, i.e. it will be more reproducible.

You can stack functions inside of arguments. For example, have R find the mean of numeric_vector and round it to one decimal place:

round(mean(numeric_vector), digits = 1)
## [1] 6.5

Stacking functions inside of each other can create code that is difficult to read. Another option that potentially makes more easily understandable code is piping using the pipe function (%>%) that you can access from the magrittr or dplyr packages. The basic idea behind the pipe function is that the output of one function is set as the first argument of the next. For example, to find the mean of numeric_vector and then round it to one decimal place use:

# Load magrittr package
library(magrittr)

# Find mean of numeric_vector and round to 1 decimal place
mean(numeric_vector) %>%
    round(digits = 1)
## [1] 6.5

The %>% pipe also allows you to pipe objects to arguments other than the first argument by inserting a . where you want the object to be piped to. For example:

TRUE %>% mean(1:10, na.rm = .)
## [1] 5.5

The value TRUE was piped as the argument to na.rm with na.rm = ..

From R version 4.1.0, there is a native pipe operator |>. It works similar to the magrittr/dplyr pipe, but is not exactly the same.

mean(numeric_vector) |>
    round(digits = 1)
## [1] 6.5

Creating functions

Functions are just objects. You can create your own functions with the function() function! Why would you create your own functions? If you ever find yourself repeating code with only minor variation, e.g. just changing the inputs, then it is time to create a function.

For example imagine we want to repeatedly find vector means and the round them to one digit. We can create a new function called mean_rounder() to do this for us.

mean_rounder <- function(x) {
  mean(x) %>% 
    round(digits = 1)
}

mean_rounder(numeric_vector)
## [1] 6.5

Notice that the contents of the function are enclosed inside of curly brackets {}.

When you make a function, remember to document it. You want to be able to use this function over and over. If it isn’t well documented, this will be harder.

A common convention is to use @param to document function arguments:1

# @param x a numeric vector to find the mean of

mean_rounder <- function(x) {
  mean(x) %>% 
    round(digits = 1)
}

Notice that we started the @param line with a #. In R (and many other languages), # is the comment character. Everything on a line that starts with # is ignored by R.

Control flow

We can use conditional evaluation (like if statements) and loops to control the flow of our programs.

If statements

So we have seen functions that always (try to) do the same thing for any inputs. There are many cases when you want to do different things depending on the type of input.

For example, imagine we want to create a new binary [0, 1] variable from a vector. We want a new vector with a 0 if the original value was less than 5 and 1 if it is greater than or equal to 5. We can use ifelse():

numeric_vector
## [1]  2.8  2.0 14.8
greater_than_5 <- ifelse(numeric_vector >= 5, 
                         yes = 1, 
                         no = 0) 

greater_than_5
## [1] 0 0 1

The way we read this is if the value is greater than or equal to 5 (yes) then the new value is 1, if not (no) then it is 0.

Note that >= is one of R’s logical operators. See the additional material for a table of common logical operators in R.

For more complex if statements, use if() {. . .} else().

for loops

Some times you want to do the same thing to each part of an object. We already saw an example with ifelse() which ran an if statement on each element of numeric_vector.

A general solution to iterating through elements of an object is the for() loop function. Here is an example using R’s built in swiss data set:

head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

For each column in the data set, we’ll take the mean, round it to 1 digit, and then print the results with R’s cat()–concatenate–function.

for (i in 1:ncol(swiss)) {
    swiss[, i] %>%
    mean() %>%
    round(digits = 1) %>%
    paste(names(swiss)[i], ., "\n") %>%  # the . directs the pipe
    cat()
}
## Fertility 70.1 
## Agriculture 50.7 
## Examination 16.5 
## Education 11 
## Catholic 41.1 
## Infant.Mortality 19.9

The first line reads: “for each i in the vector 1 through the number of columns in the swiss data frame . . .”

We then use i in the subscripts [] to select each column of swiss to find and print its mean.

FYI: the following code uses only Base R for an alternative approach:

means_df <- data.frame()
swiss_names <- names(swiss)

for (i in 1:ncol(swiss)) {
  tmp <- data.frame(name = swiss_names[i], 
                    mean_value = mean_rounder(swiss[, i]))
  means_df <- rbind(tmp, means_df)
  
}
means_df
##               name mean_value
## 1 Infant.Mortality       19.9
## 2         Catholic       41.1
## 3        Education       11.0
## 4      Examination       16.5
## 5      Agriculture       50.7
## 6        Fertility       70.1

Apply

Loops in R are easy-ish to read, but can be slow for many iterations. The apply family of functions does a something similar and is usually faster.

For example, let’s find the mean of each column in swiss:

means_apply <- apply(swiss, MARGIN = 2, mean_rounder)
as.data.frame(means_apply)
##                  means_apply
## Fertility               70.1
## Agriculture             50.7
## Examination             16.5
## Education               11.0
## Catholic                41.1
## Infant.Mortality        19.9

For data frames and matrices, we need to specify whether we are applying the function to each row (MARGIN = 1) or each column (MARGIN = 2).

Advanced: Parallelisation

By default, all computation in R is done sequentially–one after the other. However, your computer likely has more processor cores and so can run computations in “parallel”.

To find the number of cores on your computer use:

library(parallel)
detectCores()
## [1] 8

Not all computation can be done in parallel. Somethings have to be done in sequence, because the output of one operation is needed for the computation of the subsequent operations. However, pretty much anything that can be looped or applied can be done in parallel. This will often improve the speed of the computation.2

mclapply() for macOS and Linux and parLapply() for Windows from base R’s parallel package works like lapply (list apply) for lists. Imagine we have a list with 3 numeric vectors:

three_list <- list(A = rnorm(n = 100, mean = 1),
                   B = rnorm(n = 100, mean = 2),
                   C = rnorm(n = 100, mean = 5))

Note rnorm() simulates data from the normal distribution.

Let’s compute the mean of each vector in the list in parallel:

mclapply(three_list, mean, mc.cores = 3)
## $A
## [1] 0.9701775
## 
## $B
## [1] 2.08874
## 
## $C
## [1] 5.052912

mc.cores sets the number of cores to use. Usually use 1 minus n, the number of cores you have available.

For Windows, we need to use parLapply(). First create a cluster with cl <- makeCluster(cores = 2) (for 2 cores) and then run the function.

🥅 Exercises

  • Create a new function to summarise numeric vectors in a method other than mean() of your choosing (explore R’s built in statistical capabilities).

  • Create a data frame with numeric and character columns. Then, for each column find the standard deviation (sd()) if it is numeric and return the message() "This column has characters" if it is a character type column.

  • nchar() counts the number of characters in a string. Create a list of character vectors with strings of different lengths. In parallel, count the number of characters in each vector.

Additional Materials

(Some) Logical Operators

Logical Operator Meaning
& and
| or or
== equality
< less than
> greater than
>= greater than or equal to
<= less than or equal to
!= not equal
isTRUE(x) test if x is TRUE

đź“š References


  1. This stands for “parameter”, a more general programming term for “argument”. If you want to go deep on documenting functions, see the roxygen2 package.↩︎

  2. It won’t necessarily be faster as there is “overhead” for breaking up the operation into parallel computation. Also, while the computation time can go down, the energy usage will be the same as in total the same computation will have happened.↩︎