R bootcamp

In this chapter, we will be exploring the basics of the R language. We will focus on topics which are normally taken for granted and never explained in basic data science courses, which generally immediately jump to data manipulation and plotting. I strongly believe that getting familiar with the fundamentals of R as a complete programming language from this “lower-level” perspective, although it might seem a little overwhelming at the beginning, will pay dividends over and over your scientific career.

When we get to data science work in later chapters, you will see that many things which otherwise remain quite obscure and magical boil down to a set of very simple principles and components.

This knowledge will make you much more confident in the results of your work, and much easier to debug issues and problems.

Finally, we call this chapter a “bootcamp” on purpose – we only have a limited amount of time to go through all of these topics. After all, the primary reason for the existence of this course is to make you competent researchers in computational population genomics, so the emphasis will be on practical applications and solving concrete data science issues. That said, if you ever want more information, I encourage you to take a look at relevant chapter of the Advanced R textbook.

And now, open RStudio, create a new R script (File -> New file -> R Script), save it somewhere on your computer as r-bootcamp.R (File -> Save) and let’s get started!

Exercise 0: Getting help

Before we even get started, there’s one thing you should remember: R (and R packages) have an absolutely stellar documentation and help system. What’s more, this documentation is standardized, has always the same format, and is accessible in the same way. The primary way of interacting with it from inside R (and RStudio) is the ? operator. For instance, to get help about the hist() function (histograms), you can type ?hist in the R console. This documentation has a consistent format and appears in the “Help” pane in your RStudio window.

There are a couple of things to look for:

  1. On the top of the documentation page, you will always see a brief description of the arguments of each function. This is what you’ll be looking for most of the time (“How do I do specify this or that? How do I modify the behavior of the function?”).

  2. On the bottom of the page are examples. These are small bits of code which often explain the behavior of some functionality in a very helpful way.

Whenever you’re lost or can’t remember some detail about some piece of R functionality, looking up ? documentation is always very helpful.

As a practice and to build a habit, whenever we introduce a new function in this course, use ?<name of the function> to open its documentation.

Exercise 1: Basic data types

Create the following variables in your R script and then evaluate this code in your R console:

Hint: I suggest you always write your code in a script in RStudio (click File -> New file -> R script). You can execute the line (or block) of code under cursor in the script window by pressing CTRL+Enter (on Windows or Linux) or CMD+Enter (on a Mac). For quick tests, feel free to type directly in the R console.

w1 <- 3.14
x1 <- 42
y1 <- "hello"
z1 <- TRUE

The <- operator can be read as “assign the value”. I.e., “assign the value 3.14 to a variable w1.

w1
[1] 3.14
x1
[1] 42
y1
[1] "hello"
z1
[1] TRUE

What are the data “types” you get when you apply function typeof() on each of these variables?

typeof(w1)
[1] "double"
typeof(x1)
[1] "double"
typeof(y1)
[1] "character"
typeof(z1)
[1] "logical"

You can test whether or not a specific variable is of a specific type using functions such as is.numeric(), is.integer(), is.character(), is.logical(). See what results you get when you apply these functions on these four variables w1, x1, y1, z1. Pay close attention to the difference (or lack thereof?) between applying is.numeric() and is.integer() on variables containing “numbers”.

Note: This might seem incredibly boring and useless but trust me. In your real data, you will be see, in data frames (discussed below) with thousands of rows, sometimes millions. Being able to make sure that the values you get in your data-frame columns are of the expected type is something you will be doing often.

is.numeric(w1)
[1] TRUE
is.integer(w1)
[1] FALSE
is.numeric(x1)
[1] TRUE
is.integer(x1)
[1] FALSE
is.character(y1)
[1] TRUE
is.logical(z1)
[1] TRUE
is.numeric(z1)
[1] FALSE
is.integer(z1)
[1] FALSE

To summarize (and oversimplify a little bit) R allows variables to have several types of data, most importantly:

  • integers (such as 42)
  • numerics (such as 42.13)
  • characters (such as "text value")
  • logicals (TRUE or FALSE)

We will also encounter two types of “non-values”. We will not be discussing them in detail here, but they will be relevant later. For the time being, just remember that there are also:

  • undefined values represented by NULL
  • missing values represented by NA

What do you think is the practical difference between NULL and NA? In other words, when you encounter one or the other in the data, how would you interpret this?

Exercise 2: Vectors

Vectors are, roughly speaking, collections of values. We create a vector by calling the c() function (the “c” stands for “concatenate”, or “joining together”).

Create the following variables containing these vectors. Then inspect their data types using typeof() again.

w2 <- c(1.0, 2.72, 3.14)
x2 <- c(1, 13, 42)
y2 <- c("hello", "folks", "!")
z2 <- c(TRUE, FALSE)
typeof(w2)
[1] "double"
typeof(x2)
[1] "double"
typeof(y2)
[1] "character"
typeof(z2)
[1] "logical"

We can use the function is.vector() to test that a given object really is a vector. Try this on your vector variables.

is.vector(w2)
[1] TRUE
is.vector(x2)
[1] TRUE
is.vector(y2)
[1] TRUE
is.vector(z2)
[1] TRUE

What happens when you call is.vector() on the variables x1, y1, etc. from the previous Exercise (i.e., those which contain single values)?

is.vector(42)
[1] TRUE

Yes, even scalars (i.e., singular values) are formally vectors!

This is why we see the [1] index when we type a single number:

1
[1] 1

In fact, even when we create a vector of length 1, we still get a scalar result:

c(1)
[1] 1

The conclusion is, R doesn’t actually distinguish between scalars and vectors! A scalar (a single value) is simply a vector of length 1. Think of it this way: in a strange mathematically-focused way, even a single tree is a forest. 🙃


Do elements of vectors need to be homogeneous (i.e., of the same data type)? Try creating a vector with values 1, "42", and "hello" using the c() function again. Can you do it? What happens when you try? Inspect the result in the R console (take a close look at how the result is presented in text and the quotes that you will see), or use the typeof() function again.

mixed_vector <- c(1, "42", "hello")
mixed_vector
[1] "1"     "42"    "hello"
typeof(mixed_vector)
[1] "character"

If vectors are not create with values of the same type, they are converted by a cascade of so-called “coercions”. A vector defined with a mixture of different values (i.e., the four atomic types we discussed in the first Exercise) will be coreced to be only one of those types, given certain rules.

Try to figure out some of these coercion rules. Make a couple of vectors with mixed values of different types using the function c(), and observe what type of vector you get in return.

Hint: Try creating a vector which has integers and strings, integers and floats, integers and logicals, floats and logicals, floats and strings, and logicals and strings. Observe the format of the result that you get, and build your intuition by calling typeof() on each vector object to verify this.

v1 <- c(1, "42", "hello")
v1
[1] "1"     "42"    "hello"
typeof(v1)
[1] "character"
v2 <- c(1, 42.13, 123)
v2
[1]   1.00  42.13 123.00
typeof(v2)
[1] "double"
v3 <- c(1, 42, TRUE)
v3
[1]  1 42  1
typeof(v3)
[1] "double"
v4 <- c(1.12, 42.13, FALSE)
v4
[1]  1.12 42.13  0.00
typeof(v4)
[1] "double"
v5 <- c(42.13, "hello")
v5
[1] "42.13" "hello"
typeof(v5)
[1] "character"
v6 <- c(TRUE, "hello")
v6
[1] "TRUE"  "hello"
typeof(v6)
[1] "character"

Out of all these data type explorations, this Exercise is probably the most crucial for any kind of data science work. Why is that? Think about what can happen when someone does manual data entry in Excel.

Imagine what kinds of trouble can happen if you just load a table data from somewhere, if the values are not properly formatted. For instance, if a “numeric” column of your table has accidentally some characters (which can very easily happen when manually entering data in Excel, etc.). This will be much clearer when we get to data frames below.


You can create vector of consecutive values of certain forms using everal approaches. Try these options:

  1. Create a sequence of values from i to j as i:j. Create a vector of numbers 1:20

  2. **Do the same using the function seq(). Read ?seq to find out what parameters you should specify (and how) to get the same result as the i:j shortcut.

  3. Modify the arguments given to seq() so that you create a vector of numbers from 20 to 1.

  4. Use the by = argument of seq() to create a vector of only odd values starting from 1.

# 1
1:20
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# another option is this ("give me sequence of the length N)
seq_len(20)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# 2
seq(from = 1, to = 20)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
# 3
seq(from = 20, to = 1)
 [1] 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
# 4
seq(1, 20, by = 2)
 [1]  1  3  5  7  9 11 13 15 17 19

This might look boring, but these functions are super useful to generate indices for data, adding indices as columns to tabular data, etc.**


Another very useful built-in helper function (especially when we get to the iteration Exercise below) is seq_along(). What does it give you when you run it on this vector, for instance?

v <- c(1, "42", "hello", 3.1416)
seq_along(v)
[1] 1 2 3 4

This function allows you to quickly iterate over elements of a vector (or a list) using indices into that vector (or a list).


Exercise 3: Lists

Lists are a little similar to vectors but very different in a couple of important respects. Remember how we tested what happens when we put different types of values in a vector (reminder: vectors must be “homogeneous” in terms of the data types of their elements!)? What happens when you create lists with different types of values using the code in the following chunk? Use typeof() on the resulting objects and compare your results to those you got on “mixed value” vectors above.

w3 <- list(1.0, "2.72", 3.14)
x3 <- list(1, 13, 42, "billion")
y3 <- list("hello", "folks", "!", 123, "wow a number follows again", 42)
z3 <- list(TRUE, FALSE, 13, "string")

When we type the list variable in the R console, we no longer see the “coercion” we observed for vectors (numbers remain numbers even though the list contains strings):

y3
[[1]]
[1] "hello"

[[2]]
[1] "folks"

[[3]]
[1] "!"

[[4]]
[1] 123

[[5]]
[1] "wow a number follows again"

[[6]]
[1] 42

Calling typeof() will (disappointingly) not tell us much about the data types of each individual element. Why is that?

typeof(y3)
[1] "list"

Try also a different function called for str() (for “structure”) and apply it on one of those lists. Is typeof() or str() more useful to inspect what kind of data is stored in a list (str will be very useful when we get to data frames for – spoiler alert! – exactly this reason). Why?

str(y3)
List of 6
 $ : chr "hello"
 $ : chr "folks"
 $ : chr "!"
 $ : num 123
 $ : chr "wow a number follows again"
 $ : num 42
is.list(w3)
[1] TRUE

Use is.vector() and is.list() on one of the lists above (like w3 perhaps). Why do you get the result that you got? Then run both functions on one of the vectors you created above (like w2). What does this mean?

  • Let’s take this list:
w3
[[1]]
[1] 1

[[2]]
[1] "2.72"

[[3]]
[1] 3.14

Lists are vectors!

is.vector(w3)
[1] TRUE

Lists are lists (obviously!):

is.list(w3)
[1] TRUE
  • Now let’s take this vector:
w2
[1] 1.00 2.72 3.14

Vectors are not lists!

is.list(w2)
[1] FALSE

So:

  1. Every list is also a vector.
  2. But vectors are not lists.

This makes sense because lists can (but don’t have to) contain values


Not only can lists contain arbitrary values of mixed types (atomic data types from Exercise 1 of this exercise), they can also contain “non-atomic” data as well, such as other lists! In fact, you can, in principle, create lists of lists of lists of… lists!

Try creating a list() which, in addition to a couple of normal values (numbers, strings, doesn’t matter) also contains one or two other lists (we call them “nested”). Don’t think about this too much, just create something arbitrary to get a bit of practice. Save this in a variable called weird_list and type it back in your R console, just to see how R presents such data back to you. In the next Exercise, we will learn how to explore this type of data better.

Here’s an example of such “nested list”:

weird_list <- list(
  1,
  "two",
  list(
    "three",
    4,
    list(5, "six", 7)
  )
)

When we type it out in the R console, we see that R tries to lay out the structure of this data with numerical indices (we’ll talk about indices below!) indicating the “depth” of each nested pieces of data (either a plain number or character, or another list!)

weird_list
[[1]]
[1] 1

[[2]]
[1] "two"

[[3]]
[[3]][[1]]
[1] "three"

[[3]][[2]]
[1] 4

[[3]][[3]]
[[3]][[3]][[1]]
[1] 5

[[3]][[3]][[2]]
[1] "six"

[[3]][[3]][[3]]
[1] 7

Note: If you are confused (or even annoyed) why we are even doing this, in the later discussion of data frames and spatial data structures, it will become much clearer why putting lists into other lists allows a whole another level of data science work. Please bear with me for now! This is just laying the groundwork for some very cool things later down the line.

Exercise 4: Boolean expressions and conditionals

Let’s recap some basic Boolean algebra in logic. The following basic rules apply (take a look at the truth table for a bit of a high school refresher) for the “and”, “or”, and “negation” operations:

  1. The AND operator (represented by & in R, or often in math):

Both conditions must be TRUE for the expression to be TRUE).

  • TRUE & TRUE == TRUE
  • TRUE & FALSE == FALSE
  • FALSE & TRUE == FALSE
  • FALSE & FALSE == FALSE
  1. The OR operator (represented by | in R, or often in math):

At least one condition must be TRUE for the expression to be TRUE).

  • TRUE | TRUE == TRUE
  • TRUE | FALSE == TRUE
  • FALSE | TRUE == TRUE
  • FALSE | FALSE == FALSE
  1. The NOT operator (represented by ! in R, or often ¬ in math):

The opposite of the expression.

  • !TRUE == FALSE
  • !FALSE == TRUE
  1. Comparison operators == (“equal to”), != (“not equal to”), </>("lesser / greater than"), and<=/>=` (“lesser/greater or equal than”):

Comparing two things with either of these results in TRUE or FALSE result.

Note: There are other operations and more complex rules, but we will be using these three exclusively (plus, the more complex rules can be derived using these basic operations anyway).


Create two logical vectors with three elements each using the c() function (pick random TRUE and FALSE values for each of them), and store them in variables named A and B. What happens when you do A & B, A | B, and !A or !B?

It turns out that we can compare not just single values (scalars) but also multiple values like vectors. When we do this, R performs the given operation for every pair of elements at once!

A <- c(TRUE, FALSE, TRUE)
B <- c(FALSE, FALSE, TRUE)
A & B
[1] FALSE FALSE  TRUE
A | B
[1]  TRUE FALSE  TRUE
!A
[1] FALSE  TRUE FALSE
!B
[1]  TRUE  TRUE FALSE

What happens when you apply base R functions all() and any() on your A and B (or !A and !B) vectors? Remember those because they are very useful?

These functions reduce a logical vector down to a single TRUE or FALSE value.

A
[1]  TRUE FALSE  TRUE
all(A)
[1] FALSE
any(A)
[1] TRUE

If this all feels too technical and mathematical, you’re kind of correct. That said, when you do data science, you will be using these logical expressions literally every single day. Think about a table which has a column with some values, like sequencing coverage. Every time you filter for samples with, for instance, coverage > 10, you’re performing this exact operation! You essentially ask, for each sample (each value in the column), which samples have coverage > 10 (giving you TRUE) and which have less than 10 (giving you FALSE). Filtering data is, in essence, about applying logical operations on vectors of TRUE and FALSE values (which boils down to “logical indexing” introduced below), even though those logical values rarely feature as data in the tables we generally work with. Keep this in mind!


Consider the following vectors of sample coverages and origins (let’s imagine these are columns in a table you got from a bioinformatics lab) and copy them into your R script:

coverage <- c(15.09, 48.85, 36.5, 1.12, 16.65, 0.79, 16.9, 46.09, 12.76, 11.51)
origin <- c("mod", "mod", "mod", "anc", "mod", "anc", "mod", "mod", "mod", "mod")

Create a variable is_high which will contain a TRUE / FALSE vector indicating whether a given coverage value is higher than 5. Then create a variable is_modern which will contain another logical vector indicating whether a given sample is "mod" (i.e., “modern”).

is_high <- coverage > 5
is_high
 [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
is_modern <- origin == "modern"
is_modern
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Use the AND operator & to test if every high coverage sample (is_high) is also a modern sample (is_modern).

Hint: Use the == operator in combination with the all() function.

This tests whether individual high coverage samples are also modern:

is_high == is_modern
 [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

And this tests whether all high coverage samples are also modern:

all(is_high == is_modern)
[1] FALSE

Note: You don’t always create dedicated throwaway temporary variables like this. You could easily do the same much more concisely (although maybe less readably). Both approaches are useful.

all(coverage > 5 & origin == "modern")
[1] FALSE

Exercise 5: Indexing into vectors and lists

To extract a specific element(s) of a vector or a list (or to assign its given position(s)), we use a so-called “indexing” operation. Generally speaking, we can do indexing in three ways:

  1. numerical-based indexing (by specifying a set of integer numbers),

  2. logical-based indexing (by specifying a vector of TRUE / FALSE values of the same length as the vector we’re indexing into)

  3. name-based indexing (by specifying names of elements to index)

Let’s practice those for vectors and lists separately. Later, when we introduce data frames, we will return to the topic of indexing again.

Vectors

1. Numerical-based indexing

To extract an i-th element of a vector xyz, we can use the [] operator like this: xyz[i]. For instance, we can take the 13-th element of this vector as xyz[13].

Familiarize yourselves with the [] operator by taking out some specific values from this vector, let’s say its 5-th element.

v <- c("hi", "folks", "what's", "up", "folks")
v[5]
[1] "folks"

The [] operator is “vectorized”, meaning that it can actually accept multiple values given as a vector themselves (i.e, something like v[c(1, 3, 4)] will extract the first, third, and fourth element of the vector v.

In this way, extract the first and fifth element of the vector v. What happens if you try to extract a tenth element from v?

v[c(1, 5)]
[1] "hi"    "folks"

Accessing a non-existent element gives us a “not available” or “missing” value.

v[10]
[1] NA

2. Logical-based indexing

Rather than giving the [] operator a specific set of integer numbers, we can provide a vector of TRUE / FALSE values which specify which element of the input vector do we want to “extract”. Note that this TRUE / FALSE indexing vector must have the same length as our original vector!


Create variable containing a vector of five TRUE or FALSE values (i.e., with something like index <- c(TRUE, FALSE, ...) but with five TRUE or FALSE values total, and use that index variable in a v[index] indexing operation.

index <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
v[index]
[1] "hi"     "what's" "up"    

Usually we never want to create this “indexing vector” manually (imagine doing this for a vector of million values – impossible!). Instead, we create this indexing vector “programmatically”, based on a certain condition, like this:

index <- v == "up"

This checks which values of v are equal to “three”, creating a logical TRUE / FALSE vector in the process, storing it in the variable index:

index
[1] FALSE FALSE FALSE  TRUE FALSE

Use the same principle to extract the elements of the vector matching the value “folks”.

index <- v == "folks"
index
[1] FALSE  TRUE FALSE FALSE  TRUE

A nice trick is that summing a logical vector using sum() gives you the number of TRUE matches:

sum(index)
[1] 2

This is actually why we do this indexing operation on vectors in the first places, most of the time – when we want to count how many data points match a certain criterion.

Let’s extract our matching values:

v[index]
[1] "folks" "folks"

There’s another very useful operator is %in% which tests, which elements of a vector is among the elements of another vector. This is an extremely useful operation which you will be doing all the time when doing data analysis. It’s good for you to get familiar with it.

For instance, if we take this vector again:

v <- c("hi", "folks", "what's", "up", "folks", "I", "hope", "you",
       "aren't", "(too)", "bored")

We can then ask, for instance, “which elements of v are among a set of given values?”:

v %in% c("folks", "up", "bored")
 [1] FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

With our example vector v, it’s very easy to glance this with our eyes, of course. But when working with real world data, we often operate on tables with thousands or even millions of columns.


Let’s imagine we don’t need to test whether a given vector is a part of a set of pre-defined values, but we want to ask the opposite question: “are any of my values of interest in my data”? Let’s say your values of interest are values <- c("hope", "(too)", "blah blah") and your whole data is again v. How would you use %in% to get a TRUE or FALSE vector for each of your values?

Although the question is phrased differently, it’s still the same logical operation. So, if this is our set of values of interest:

values <- c("hope", "(too)", "blah blah")

Our test is then this:

values %in% v
[1]  TRUE  TRUE FALSE

Indeed, the third element “blah blah” is not among the elements of v.

Lists

This section will be a repetition on the previous exercises about vectors. Don’t forget – lists are just vectors, except that they can contain values of heterogeneous types (numbers, characters, anything). As a result, everything that applies to vectors above applies also here.

But practice makes perfect, so let’s go through a couple of examples anyway:

l <- list("hello", "folks", "!", 123, "wow a number follows again", 42)
l
[[1]]
[1] "hello"

[[2]]
[1] "folks"

[[3]]
[1] "!"

[[4]]
[1] 123

[[5]]
[1] "wow a number follows again"

[[6]]
[1] 42

1. Numerical-based indexing

The same applies to numerical-based indexing as what we’ve shown for vectors.


Extract the second and fourth elements from l.

l[c(2, 4)]
[[1]]
[1] "folks"

[[2]]
[1] 123

2. Logical-based indexing

Similarly, you can do the same with TRUE / FALSE indexing vectors for lists as what we did with normal (single-type) vectors. Rather than go through the variation of the same exercises, let’s introduce another very useful pattern related to logical-based indexing and that’s removing invalid elements.

Consider this vector:

v <- c("hello", "folks", "!", NA, "wow another NAs are coming", NA, NA, "42")

v
[1] "hello"                      "folks"                     
[3] "!"                          NA                          
[5] "wow another NAs are coming" NA                          
[7] NA                           "42"                        

Notice the NA values. One operation we have to do very often (particularly in data frames, whose columns are vectors as we will see below!) is to remove those invalid elements, using the function is.na().

This function returns a TRUE / FALSE vector which, as you now already know, can be used for logical-based indexing!

is.na(v)
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

A very useful trick in programming is negation (using the ! operator), which flips the TRUE / FALSE states. In other words, prefixing with ! returns a vector saying which elements of the input vector are not NA:

!is.na(v)
[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE

Use is.na(v) and the negation operator ! to remove the NA elements of the vector v!

v[!is.na(v)]
[1] "hello"                      "folks"                     
[3] "!"                          "wow another NAs are coming"
[5] "42"                        

[] vs [[]] operators

Let’s move to a more interesting topic. There’s another operator useful for lists, and that’s [[ ]] (not [ ]!). Extract the third element of the list l using l[4] and l[[4]]]. What’s the difference between the results? If you’re unsure, use the mode() function on l[3] and l[[3]] to help you.

Strange, isn’t it? The [ ] operator seems to return a list, even though we expect the result 123?

l[4]
[[1]]
[1] 123
mode(l[4])
[1] "list"

On the other hand, l[[4]] gives us just a number!

l[[4]]
[1] 123
mode(l[[4]])
[1] "numeric"

I simply cannot not link to this brilliant figure, which explains this result in a very fun way. The left picture shows our list l, the middle picture shows l[4], the right picture shows l[[4]]. Spend some time experimenting with the behavior of [ ] and [[ ]] on our list l! This will come in handy many times in your R carreer!

Traversing nested lists

Remember our nested list from earlier? Here’s it again:

weird_list <- list(
  1,
  "two",
  list(
    "three",
    4,
    list(
      5, "six",
      list("seven", 8)
    )
  )
)

What do you get when you run weird_list[[1]]? How about weird_list[[3]]? And how about weird_list[[3]][[2]]? Things are getting a little complicated (or interesting, depending on how nerdy you are :)).

This takes out the first value of the list:

weird_list[[1]]
[1] 1

Note the [[]] operator. Here’s what we would get with the [] operator (basically, a “sublist”):

weird_list[1]
[[1]]
[1] 1
mode(weird_list[1])
[1] "list"

This extracts the 3rd value, which is itself a list!

weird_list[[3]]
[[1]]
[1] "three"

[[2]]
[1] 4

[[3]]
[[3]][[1]]
[1] 5

[[3]][[2]]
[1] "six"

[[3]][[3]]
[[3]][[3]][[1]]
[1] "seven"

[[3]][[3]][[2]]
[1] 8

And because this is just any other list (just nested), we can also index into that list! Glancing at the result of weird_list[[3]] just above, we see that the 2nd value of that list is a number 4. Let’s verify that:

weird_list[[3]][[2]]
[1] 4

What’s the sequence of this “chaining” of indexing operators to extract the number 8?

Hint: You can leverage the interactive nature of evaluating intermediate results in the R console, adding things to the expression (i.e., a chunk of code) in sequence.

Let’s take it step by step, interactively:

  1. We know that the nested list sits at the 3rd position of the whole list:
weird_list[[3]]
[[1]]
[1] "three"

[[2]]
[1] 4

[[3]]
[[3]][[1]]
[1] 5

[[3]][[2]]
[1] "six"

[[3]][[3]]
[[3]][[3]][[1]]
[1] "seven"

[[3]][[3]][[2]]
[1] 8
  1. The other nested list is at index number 3 again:
weird_list[[3]][[3]]
[[1]]
[1] 5

[[2]]
[1] "six"

[[3]]
[[3]][[1]]
[1] "seven"

[[3]][[2]]
[1] 8
  1. And the final list, the one carrying the number 8, is again in the 3rd position:
weird_list[[3]][[3]][[3]]
[[1]]
[1] "seven"

[[2]]
[1] 8
  1. Finally, we can extract the number 8 from the index at the second position:
weird_list[[3]][[3]][[3]][[2]]
[1] 8

Whew! That last exercise was something, wasn’t it. Kind of annoying, if you ask me.

Luckily, you will not have to do these kind of complex shenanigans in R very often (maybe even never). Still, nested lists are sometimes used in capturing more complex types of data than just lists of numbers or tables (for instance, nested lists capture tree-like structures). In any case, using names instead of just integers as indices makes the whole process much easier, as we will see below.

In data you encounter in practice, the most extreme case of data indexing you will have to do probably won’t be more complex than two nested indexing operators in a row (i.e., the equivalent of doing data[[2]][[3]]).

Particularly when we discuss some very convenient tidyverse operations later, having an idea about what a nested list even is will be very useful, so bear with me please!

Named indexing for vectors and lists

Here’s a neat thing we can do with vectors and lists. They don’t have to contain just values themselves (which can be then extracted using integer or logical indices as we’ve done above), but those values can be assigned names too.

Consider this vector and list:

v <- c(1, 2, 3, 4, 5)
v
[1] 1 2 3 4 5
l <- list(1, 2, 3, 4, 5)
l
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

[[4]]
[1] 4

[[5]]
[1] 5

As a recap, we can index into them in the usual manner like this:

v[c(1, 3, 5)]
[1] 1 3 5
l[c(1, 3)]
[[1]]
[1] 1

[[2]]
[1] 3

But we can also name the values like this (note that the names appear in the print out you get from R in the console):

v <- c(one = 1, two = 2, three = 3, four = 4, five = 5)
v
  one   two three  four  five 
    1     2     3     4     5 
l <- list(one = 1, two = 2, three = 3, four = 4, five = 5)
l
$one
[1] 1

$two
[1] 2

$three
[1] 3

$four
[1] 4

$five
[1] 5

When you have a named data structure like this, you can index into it using those names as well, which can be very convenient. Imagine having data described not by indices but actualy readable names (such as names of people, or excavation sites!):

l[["three"]]
[1] 3
l[c("two", "five")]
$two
[1] 2

$five
[1] 5

Note: This is exactly what data frames are, under the hood (named lists!), as we’ll see in the next section.

Let’s return (one last time, I promise!) to our nested list example, this time presenting it in a more convenient way.

weird_list <- list(
  1,
  "two",
  nested1 = list(
    "three",
    4,
    nested2 = list(
      5, "six",
      nested3 = list("seven", 8)
    )
  )
)

With a list like that, when we previously had to extract the element 8 like this:

weird_list[[3]][[3]][[3]][[2]]
[1] 8

we can now do this:

weird_list[["nested1"]][["nested2"]][["nested3"]][[2]]
[1] 8

Much more readable!

Negative indexing

Consider this vector again:

v <- c("hi", "folks", "what's", "up", "folks")

What happens when you index into v using the [] operator but give it a negative number between 1 and 5?

Negative indices remove elements!

v[-1]
[1] "folks"  "what's" "up"     "folks" 

When we exclude all indices 1:5, we remove everything, oops!

v[-(1:5)]
character(0)

A very useful function is length(), which gives the length of a given vector (or a list – remember, lists are vectors!). Use it to remove the last element of v. How would you remove both the first and last element of a vector or a list (assuming you don’t know the length beforehand, i.e., you can’t put a fixed number as the index of the last element)?

v[-length(v)]
[1] "hi"     "folks"  "what's" "up"    
# this gives us the index of the first and last element
c(1, length(v))
[1] 1 5
# then we can prefix this with the minus sign to remove them
v[-c(1, length(v))]
[1] "folks"  "what's" "up"    

Exercise 6: Data frames

Every scientists works with tables of data, in one way or another. R provides first class support for working with tables, which are formally called “data frames”. We will be spending most of our time of this workshop learning to manipulate, filter, modify, and plot data frames, often times with data that is too big to look at all at once. For simplicity, just to get started and to explain the basic fundamentals, let’s begin with something trivially easy, like this little data frame here:

df <- data.frame(
  v = c("one", "two", "three", "four", "five"),
  w = c(1.0, 2.72, 3.14, 1000.1, 1e6),
  x = c(1, 13, 42, NA, NA),
  y = c("folks", "hello", "from", "data frame", "!"),
  z = c(TRUE, FALSE, FALSE, TRUE, TRUE)
)

df
      v          w  x          y     z
1   one       1.00  1      folks  TRUE
2   two       2.72 13      hello FALSE
3 three       3.14 42       from FALSE
4  four    1000.10 NA data frame  TRUE
5  five 1000000.00 NA          !  TRUE

First, here’s the first killer bit of information: data frames are normal lists!

is.list(df)
[1] TRUE

How is this even possible? And why is this even the case? Explaining this in full would be too much detail, even for a course which tries to go beyond “R only as a plotting tool” as I promised you in the introduction. Still, for now let’s say that R objects can store so called “attributes”, which – in the case of data frame objects – makes them behave as “something more than just a list”. These attributes are called “classes”.


You can poke into these internals but “unclassing” an object. Call unclass(df) in your R console and observe what result you get (just to convince yourself that data frames really are lists under the hood).

Honest admission – you will never need this unclass() stuff in practice, ever. I’m really showing you to demonstrate what “data frame” actually is on a lower-level of R programming. If you’re confused, don’t worry. The fact that data frames are lists matters infinitely more than knowing exactly how is that accomplished inside R.

This is what a normal data frame looks like to us:

df
      v          w  x          y     z
1   one       1.00  1      folks  TRUE
2   two       2.72 13      hello FALSE
3 three       3.14 42       from FALSE
4  four    1000.10 NA data frame  TRUE
5  five 1000000.00 NA          !  TRUE

Here is how a data frame is represented under the hood:

unclass(df)
$v
[1] "one"   "two"   "three" "four"  "five" 

$w
[1]       1.00       2.72       3.14    1000.10 1000000.00

$x
[1]  1 13 42 NA NA

$y
[1] "folks"      "hello"      "from"       "data frame" "!"         

$z
[1]  TRUE FALSE FALSE  TRUE  TRUE

attr(,"row.names")
[1] 1 2 3 4 5

It really is just a list!


Remember how we talked about “named lists” in the previous section! Yes, data frames really are just normal named lists with extra bit of behavior added to them (namely the fact that these lists are printed in a nice, readable, tabular form).

Selecting columns

Quite often we need to extract values of an entire column of a data frame. In the Exercise about indexing, you have learned about the [] operator (for vectors and lists), and also about the $ and [[]] operator (for lists). Now that you’ve learned that data frames are (on a lower level) just lists, what does it mean for wanting to extract a column from a data frame?


Try to use the three indexing options to extract the column named "z" from your data frame df. How do the results differ depending on the indexing method chosen? Is the indexing (and its result) different to indexing a plain list?

We can extract a given column with…

  1. the $ operator (column name as a symbol), which gives us a vector:
df$z
[1]  TRUE FALSE FALSE  TRUE  TRUE
  1. the [ operator (column name as a string), which gives us a (single-column) data frame:
df["z"]
      z
1  TRUE
2 FALSE
3 FALSE
4  TRUE
5  TRUE
  1. the [[ operator (column name as a string), which gives us vector again:
df[["w"]]
[1]       1.00       2.72       3.14    1000.10 1000000.00

Let’s create a list-version of this data frame:

df_list <- as.list(df)
df_list
$v
[1] "one"   "two"   "three" "four"  "five" 

$w
[1]       1.00       2.72       3.14    1000.10 1000000.00

$x
[1]  1 13 42 NA NA

$y
[1] "folks"      "hello"      "from"       "data frame" "!"         

$z
[1]  TRUE FALSE FALSE  TRUE  TRUE

The indexing results match what we get for the data frame. After all, a data frame really is just list (with some very convenient behavior, such as presenting the data in a tabular form). The only exception is the result of df_list["v"] which results a data frame but only returns a list when applied on a list:

df_list$v
[1] "one"   "two"   "three" "four"  "five" 
df_list["v"]
$v
[1] "one"   "two"   "three" "four"  "five" 
df_list[["v"]]
[1] "one"   "two"   "three" "four"  "five" 

The tidyverse approach

In the chapter on tidyverse, we will learn much more powerful and easier tools to do these types of data-frame operations, particularly the select() function. That said, even when you use tidyverse exclusively, you will still encounter code in the wild which uses this base R way of doing things. Additionally, for certain trivial actions, doing “the base R thing” is just quicker to types. This is why knowing the basics of $, [], and [[]] will always be useful.


Selecting rows (“filtering”)

Of course, we often need to refer not just to specific columns of data frames, but also to given rows. Let’s consider our data frame again:

df
      v          w  x          y     z
1   one       1.00  1      folks  TRUE
2   two       2.72 13      hello FALSE
3 three       3.14 42       from FALSE
4  four    1000.10 NA data frame  TRUE
5  five 1000000.00 NA          !  TRUE

In the section on indexing into vectors and lists above, we learned primarily about two means of indexing into vectors. Let’s revisit them in the context of data frames:

  1. Integer-based indexing

What happens when you use the [1:3] index into the df data frame, just as you would do by extracting the first three elements of a vector?

Somewhat curiously, you get the first three columns, not rows!

df[1:3]
      v          w  x
1   one       1.00  1
2   two       2.72 13
3 three       3.14 42
4  four    1000.10 NA
5  five 1000000.00 NA

When indexing into a data frame, you need to distinguish the dimension along which you’re indexing: either a row, or a column dimension. Just like in referring to a cell coordinate in Excel, for example.

The way you do this for data frames in R is to separate the dimensions into which you’re indexing with a comma in this way: [row-index, column-name-or-index].

Try to extract the first three elements (1:3) of the data frame df by df[1:3, ]. Note the empty space after the comma ,! Then select a subset of the df data frame to only show the row #1 and #4 for columns "x" and "z".

Extract the first three rows of df:

df[1:3, ]
      v    w  x     y     z
1   one 1.00  1 folks  TRUE
2   two 2.72 13 hello FALSE
3 three 3.14 42  from FALSE

Extract the rows number 1 and 4 for columns “x” and “z”:

df[c(1, 4), c("x", "z")]
   x    z
1  1 TRUE
4 NA TRUE

Of course, the actual indexing dimensions can be (and often is) specified in variables. For instance, we often have code which firsts computes the indexes, and then we access them into the data frames. The equivalent of this here would be:

row_indices <- 1:3

df[row_indices, ]
      v    w  x     y     z
1   one 1.00  1 folks  TRUE
2   two 2.72 13 hello FALSE
3 three 3.14 42  from FALSE

Extract the rows number 1 and 4 for columns “x” and “z”:

row_indices <- c(1, 4)
col_indices <- c("x", "z")

df[row_indices, col_indices]
   x    z
1  1 TRUE
4 NA TRUE
  1. Logical-based indexing

Similarly to indexing into vectors, you can also specify which rows should be extracted “at once”, using a single logical vector (you can also do this for columns but I honestly don’t remember the last time I had to do this).


The most frequent use for this is to select all rows of a data frame for which a given column (or multiple columns) carry a certain value. Select only those rows for which the column “y” has a value “hello”:

Let’s first use a vectorized comparison to get a TRUE / FALSE vector indicating which values of the “v” column contain the string “hello”. Remember, that if you take a vector (of arbitrary length) and compare it to some value, you will get a TRUE / FALSE vector of the same length:

# this is what the column (vector, really) contains
df$y
[1] "folks"      "hello"      "from"       "data frame" "!"         
# this is how we can find out, which element(s) of the vector match
df$y == "hello"
[1] FALSE  TRUE FALSE FALSE FALSE
# let's save the result to a new variable
row_indices <- df$y == "hello"
row_indices
[1] FALSE  TRUE FALSE FALSE FALSE

Now we can use this vector as a row index into our data frame (don’t forget the comma ,, without which you’d be indexing into the column-dimension, not the row-dimension!).

df[row_indices, ]
    v    w  x     y     z
2 two 2.72 13 hello FALSE

Of course, you can also both filter (remember this word) for a subset of rows and also, at the same time, select (remember this word too) a subset of columns at the same time:

df[row_indices, c("v", "y", "z")]
    v     y     z
2 two hello FALSE

:::


Now instead of filtering rows where column y matches “hello”, filter for rows where w column is less than 1000.

We can again store the filtered rows in a separate variable, and then use that variable to index into the data frame:

row_indices <- df$w < 1000
row_indices
[1]  TRUE  TRUE  TRUE FALSE FALSE
df[row_indices, ]
      v    w  x     y     z
1   one 1.00  1 folks  TRUE
2   two 2.72 13 hello FALSE
3 three 3.14 42  from FALSE

Often, we want to be more consise and do everything in one go:

df[df$w < 1000, ]
      v    w  x     y     z
1   one 1.00  1 folks  TRUE
2   two 2.72 13 hello FALSE
3 three 3.14 42  from FALSE

Remember how we used to filter out elements of a vector using the !is.na(...) operation? You can see that df contains some NA values in the x column. Use the fact that you can filter rows of a data frame using logical-based vectors (as demonstrated above) to filter out rows of df at which the x column contains NA values.

Hint: You can get indices of the rows of df we you want to retain with !is.na(df$x).

This gives us indices of the rows we want to keep:

!is.na(df$x)
[1]  TRUE  TRUE  TRUE FALSE FALSE

This is how we can filter out unwanted rows:

df[!is.na(df$x), ]
      v    w  x     y     z
1   one 1.00  1 folks  TRUE
2   two 2.72 13 hello FALSE
3 three 3.14 42  from FALSE

Creating (and deleting) columns

The $ and [] operators can be used to create new columns. For instance, the paste() function in R can be used to combine a pair of values into one. Try running paste(df$v, df$y) to see what the result of this operation is.

The general pattern to do this is:

df$<name of the new column> <- <vector of values to assign to it>

Create a new column called "new_col" and assign to it the result of paste(df$v, df$y).

df["new_col"] <- paste(df$v, df$y)

# new column appears!
df
      v          w  x          y     z         new_col
1   one       1.00  1      folks  TRUE       one folks
2   two       2.72 13      hello FALSE       two hello
3 three       3.14 42       from FALSE      three from
4  four    1000.10 NA data frame  TRUE four data frame
5  five 1000000.00 NA          !  TRUE          five !

When we want to remove a column from a data frame (for instance, we only used it to store some temporary result in a script), we actually do the same thing, except we assign to it the value NULL.


Remove the column new_col

df$new_col <- NULL

# and the column is gone
df
      v          w  x          y     z
1   one       1.00  1      folks  TRUE
2   two       2.72 13      hello FALSE
3 three       3.14 42       from FALSE
4  four    1000.10 NA data frame  TRUE
5  five 1000000.00 NA          !  TRUE

“Improper” column names

Most column names you will be using in your own script will (well, should!) follow the same rules as apply for variable names – they can’t start with a number, have to compose of alphanumeric characters, and can’t contain any other characters except for underscores (and occasionally dots). To quote from the venerable R language reference:

Identifiers consist of a sequence of letters, digits, the period (‘.’) and the underscore. They must not start with a digit or an underscore, or with a period followed by a digit.

For instance, these are examples of proper identifiers which can serve as variable names, column names and (later) function names:

  • variable1
  • a_longer_var_42
  • anotherVariableName

Unfortunately, when you encounter data in the wild, especially in tables you get from other people or download as supplementary information from the internet, they are rarely this perfect. Here’s a little example of such data frame:

weird_df
  v with spaces          w y with % sign
1           one       1.00         folks
2           two       2.72         hello
3         three       3.14          from
4          four    1000.10    data frame
5          five 1000000.00             !

If you look closely, you see that some columns have spaces " " and also strange characters % which are not allowed? Which of the $, [] and [[]] operators can you use to extract v with spaces and y with % sign columns as vectors?

[] and [[]] work just as before, because they accept a string by default anyway, so spaces and other characters are not a problem:

weird_df["v with spaces"]
  v with spaces
1           one
2           two
3         three
4          four
5          five
weird_df[["y with % sign"]]
[1] "folks"      "hello"      "from"       "data frame" "!"         

The $ operator needs bit more work. When you encounter an “improper” column name in a data frame, you have to enclose the whole “symbol” or “identifier” in “back ticks” like this:

weird_df$`y with % sign`
[1] "folks"      "hello"      "from"       "data frame" "!"         

This is super useful when working with tabular data you get from someone else, especially if they prepared it in Excel. But you should never create data frames with these weird column names yourself. Always use names that would be appropriate as normal standard R identifiers on their own (just alphanumeric symbols or underscores).


The tidyverse approach

Similarly to previous section on column selection, there’s a much more convenient and faster-to-type way of doing filtering, using the tidyverse function filter(). Still, as with the column selection, sometimes doing the quick and easy thing is just more convenient. The minimum on filtering rows of data frames introduced in this section will be enough for you, even in the long run!

Exercise 7: Inspecting column types

Let’s go back to our example data frame:

df1 <- data.frame(
  w = c(1.0, 2.72, 3.14),
  x = c(1, 13, 42),
  y = c("hello", "folks", "!"),
  z = c(TRUE, FALSE, FALSE)
)

df1
     w  x     y     z
1 1.00  1 hello  TRUE
2 2.72 13 folks FALSE
3 3.14 42     ! FALSE

Use the function str() and by calling str(df1), inspect the types of columns in the table.

str(df1)
'data.frame':   3 obs. of  4 variables:
 $ w: num  1 2.72 3.14
 $ x: num  1 13 42
 $ y: chr  "hello" "folks" "!"
 $ z: logi  TRUE FALSE FALSE

Sometimes (usually when we read data from disk, like from another software), a data point sneaks in which makes a column apparently non numeric. Consider this new table called df2:

df2 <- data.frame(
  w = c(1.0, 2.72, 3.14),
  x = c(1, "13", 42),
  y = c("hello", "folks", "!"),
  z = c(TRUE, FALSE, FALSE)
)

df2
     w  x     y     z
1 1.00  1 hello  TRUE
2 2.72 13 folks FALSE
3 3.14 42     ! FALSE

Just by looking at this, the table looks the same as df1 above. Use str() again to see where the problem is.

str(df2)
'data.frame':   3 obs. of  4 variables:
 $ w: num  1 2.72 3.14
 $ x: chr  "1" "13" "42"
 $ y: chr  "hello" "folks" "!"
 $ z: logi  TRUE FALSE FALSE

Exercise 8: Functions

The motivation for this Exercise could be summarized by an ancient motto of programming: Don’t repeat yourself (DRY): “[…] a modification of any single element of a system does not require a change in other logically unrelated elements.”.

Let’s demonstrate this idea in practice.

Let’s say you have the following numeric vector (these could be base qualities, genotype qualities, \(f\)-statistics, sequencing coverage, anything):

vec <- c(0.32, 0.78, 0.68, 0.28, 1.96, 0.21, 0.07, 1.01, 0.06, 0.74,
         0.37, 0.6, 0.08, 1.81, 0.65, 1.23, 1.28, 0.11, 1.74,  1.68)

With numeric vectors, we often need to compute some summary statistics (mean, median, quartile, minimum, maximum, etc.). What’s more, in a given project, we often have to do this computation multiple times in a number of places.

In R, we have a very useful built-in function summary(), which does exactly that. But let’s ignore this for the moment, for learning purposes.

Here is how we can compute those summary statistics individually:

min(vec)
[1] 0.06
# first quartile (a value which is higher than the bottom 25% of the data)
quantile(vec, probs = 0.25)
   25% 
0.2625 
median(vec)
[1] 0.665
mean(vec)
[1] 0.783
# third quartile (a value which is higher than the bottom 75% of the data)
quantile(vec, probs = 0.75)
   75% 
1.2425 
max(vec)
[1] 1.96

Now, you can imagine that you have many more of such vectors (results for different sequenced samples, different computed population genetic metrics, etc.). Having to type out all of these commands for every single one of those vectors would very quickly get extremely tiresome. Worse still, when we would do this, we would certainly resort to copy-pasting, which is guaranteed to lead to errors.


Write a custom function called my_summary, which will accept a single input named values, and returns a list which binds all the six summary statistics together. Name the elements of that list as "min", "quartile_1", "median", "mean", "quartile_3", and "max".

Here is how we could write the function:

my_summary <- function(values) {
  a <- min(vec)
  b <- quantile(vec, probs = 0.25)
  c <- median(vec)
  d <- mean(vec)
  e <- quantile(vec, probs = 0.75)
  f <- max(vec)
  
  result <- list(min = a, quartile_1 = b, median = c, mean = d, quartile_3 = e, max = f)
  
  return(result)
}

Although I would probably prefer to write it a bit more tersely like this:

my_summary <- function(values) {
  result <- list(
    min = min(vec),
    quartile_1 = quantile(vec, probs = 0.25),
    median = median(vec),
    mean = mean(vec),
    quartile_3 = quantile(vec, probs = 0.75),
    max = max(vec)
  )
  
  return(result)
}

Yes, we had to write the code anyway, we even had to do the extra work of wrapping it inside other code (the function body, name the one input argument values, which could be multiple arguments for more complex function). So, one could argue that we didn’t actually save any time. However, that code is now “encapsulated” in a fully self-contained form and can be called repeatably, without any copy-pasting.

In other words, if you now create these three vectors of numeric values:

vec1 <- runif(10)
vec2 <- runif(10)
vec3 <- runif(10)

You can now compute our summary statistics by calling our function my_summary() on these vectors, without any code repetition:

my_summary(vec1)
$min
[1] 0.06

$quartile_1
   25% 
0.2625 

$median
[1] 0.665

$mean
[1] 0.783

$quartile_3
   75% 
1.2425 

$max
[1] 1.96
my_summary(vec2)
$min
[1] 0.06

$quartile_1
   25% 
0.2625 

$median
[1] 0.665

$mean
[1] 0.783

$quartile_3
   75% 
1.2425 

$max
[1] 1.96
my_summary(vec3)
$min
[1] 0.06

$quartile_1
   25% 
0.2625 

$median
[1] 0.665

$mean
[1] 0.783

$quartile_3
   75% 
1.2425 

$max
[1] 1.96

And, surprise! This is what the incredibly useful built-in function summary() provided with every R installation does!

summary(vec1)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05448 0.27491 0.46080 0.49725 0.73522 0.98144 
summary(vec2)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.06231 0.20432 0.27654 0.35114 0.48744 0.87767 
summary(vec3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1097  0.3447  0.5941  0.5788  0.8650  0.9482 

The punchline is this: if we ever need to modify how are summary statistics are computed, _we only have to make a single change in the function code instead of having to modify multiple copies of the code in multiple locations in our project.

Exercise 9: Base R plotting

As a final short section, it’s worth pointing out some very basic base R plotting functions. We won’t be getting into detail because tidyverse provides a much more powerful and more convenient set of functionality for visualizing data. Still, base R is often convenient for quick troubleshooting or quick plotting of data at least in the initial phases of data exploration.

So far we’ve worked with a really oversimplified data frame. For more interesting demonstration, R bundles with a realistic data frame of penguin data:

head(penguins)
  species    island bill_len bill_dep flipper_len body_mass    sex year
1  Adelie Torgersen     39.1     18.7         181      3750   male 2007
2  Adelie Torgersen     39.5     17.4         186      3800 female 2007
3  Adelie Torgersen     40.3     18.0         195      3250 female 2007
4  Adelie Torgersen       NA       NA          NA        NA   <NA> 2007
5  Adelie Torgersen     36.7     19.3         193      3450 female 2007
6  Adelie Torgersen     39.3     20.6         190      3650   male 2007

Use the function hist() to plot a histogram of the body mass of the entire penguins data set.

hist(penguins$body_mass)

Sometimes it is convenient to adjust the bin width:

hist(penguins$body_mass, breaks = 50)


The dataset also contains the measure of bill length. Use the function plot() to visualize a scatter plot of two vectors: penguins$flipper_len against penguins$body_mass. Is there an indication of a relationship between the two metrics?

plot(penguins$flipper_len, penguins$body_mass)

# we can also overlay a linear fit (first computed with the `lm()` function,
# then visualized as a red dashed line)
lm_fit <- lm(body_mass ~ flipper_len, data = penguins)
abline(lm_fit, col = "red", lty = 2)

We can also see that we have data for three different species of penguins. We can therefore partition the visualization for each species individually:

plot(penguins$flipper_len, penguins$body_mass, col = penguins$species)


Base R plotting is very convenient for quick and dirty data summaries, particularly immediately after reading unknown data. However, for anything more complex (and anything more pretty), ggplot2 is unbeatable. We will be looking at ggplot2 visualization in the session on tidyverse but as a sneakpeak, you can take a look at the beautiful figures you’ll learn how to make later.

Look how comparatively little code we need to make beautiful informative figures which immediately tell a clear story! Stay tuned for later! :)

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
ggplot(penguins) +
  geom_histogram(aes(flipper_len, fill = species), alpha = 0.5) +
  theme_minimal() +
  ggtitle("Distribution of flipper lengths across species")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(penguins, aes(flipper_len, body_mass, shape = sex, color = species)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_wrap(~ species) +
  theme_minimal() +
  ggtitle("Body mass as a function of flipper length across penguin species")
`geom_smooth()` using formula = 'y ~ x'

Let’s revisit the concept of a function introduced earlier and explore it in a much more practically useful setting – defining customized plotting functions. This is, honestly, one of the most applications of custom functions in your own research.


In the example above, you used hist(penguins$body_mass) to plot a histogram of the penguins’ body mass. Write a custom function penguin_hist() which will accept two arguments: 1. the name of the column in the penguins data frame, and 2. the number of histogram breakpoints to use as the breaks = arguments in a hist(<vector>, breaks = ...) call.

Hint: Remember that you can extract a column of a data frame as a vector not just using a symbolic identifier with the $ operator but also as a string name with the [[]] operator.

Hint: A useful new argument of the hist() function is main =. You can specify this as a string and the contents of this argument will be plotted as the figure’s title.

If you need further help, feel free to use this template and fill the insides of the function body with your code:

penguin_hist <- function(df, column, breaks) {
  # ... put your hist() code here
}

We can extract a column column of any data frame df with df[[column]], therefore, we only need to do this:

penguin_hist <- function(df, column, breaks) {
  title <- paste("Histogram of the column", column)
  hist(df[[column]], breaks = breaks, main = title)
}

We can then use our fancy new function like this:

penguin_hist(penguins, "bill_len", breaks = 50)

penguin_hist(penguins, "bill_dep", breaks = 50)

penguin_hist(penguins, "flipper_len", breaks = 50)

penguin_hist(penguins, "body_mass", breaks = 50)


If this function stuff appears overwhelming, don’t worry. In the remainder of this workshop you will get a lot of opportunities to practice coding so that it becomes completely trivial at the end.

Exercise 10: Conditional expressions

Oftentimes, especially when writing custom code, we need to make automated decisions whether something should or shouldn’t happen given a certain value. We can do this using the if expression which has the following form:

if (<condition resulting in TRUE or FALSE>) {
  ... code which should be executed if condition is TRUE...
}

An extension of this is the if-else expression, taking the following form:

if (<condition resulting in TRUE or FALSE>) {
  ... code which should be executed if condition is TRUE...
} else {
  ... code which should be executed if condition is FALSE...
}

Go back to your new penguin_hist() function and add an if expression which will make sure that breaks is greater than 0. In other words, if breaks < 1 (the condition you will be testing against), execute the command stop("Incorrect breaks argument given!").

Here is our new modified function:

penguin_hist <- function(df, column, breaks) {
  if (breaks < 1) {
    stop("Incorrect breaks argument given!", call. = FALSE) # call. = FALSE makes errors less verbose
  }

  title <- paste("Histogram of the column", column)
  hist(df[[column]], breaks = breaks, main = title)
}

And here is how our new modification guards against incorrect use of our function:

penguin_hist(df, "bill_len", breaks = -123)
Error: Incorrect breaks argument given!

This was just a little sneak peak. As you get more comfortable with programming, if and if-else expressions like this will be very useful to make your code more robust. Whenever you’re coding, catching errors as soon as they happen is extremely important!

Exercise 11: Iteration and loops

Functions help us take pieces of code and generalize them to reduce the amount of code needed to do similar things, repeatedly, multiple times, and avoid code duplication by copy-pasting (nearly the same) chunks of code over and over. You could think of iteration as generalizing those repetitions even further. Instead of manually calling a bit of code repeatedly, we can iterate over that code in an iterative way.

In general, there are two types of loops:

1. Loops producing a value for each iteration

The loops in this category which we are going to encounter most often are those of the apply family, like lapply() or sapply(). The general pattern like this:

result <- lapply(<vector/list of values>, <function>)

The lapply and sapply functions take, at minimum, a vector or a list as their first argument, and then a function which takes a single argument. Then they apply the given function to each element of the vector/list, and return either a list (if we use the lapply() function) or a vector (if we use the sapply() function).

Let’s consider this more concrete example:

input_list <- list("hello", "this", 123, "is", "a mixed", 42, "list")

result_list <- lapply(input_list, is.numeric)

result_list
[[1]]
[1] FALSE

[[2]]
[1] FALSE

[[3]]
[1] TRUE

[[4]]
[1] FALSE

[[5]]
[1] FALSE

[[6]]
[1] TRUE

[[7]]
[1] FALSE

We took an input_list and applied a function to each element, automatically, gathering the results in another list!


Create the following function which, for a given number, returns TRUE if this number is even and FALSE if the number is odd. Then use sapply() to test which of the following numbers in the input_vector is odd or even. Notice that we can do thousands or even millions of operations like this (for very very long input vectors) with a single sapply() command!

input_vector <- c(23, 11, 8, 36, 47, 6, 66, 94, 20, 2)

is_even <- function(x) {
  x %% 2 == 0 # this answers the question "does x divided by 2 give 0?")
}

Let’s first test our custom-made function on a couple of examples. This is always a good idea when we want to do iteration using lapply() / sapply().

is_even(2)
[1] TRUE
is_even(7)
[1] FALSE
is_even(10)
[1] TRUE
is_even(11)
[1] FALSE

Now we apply is_even to every element of our input_vector:

result <- sapply(input_vector, is_even)
result
 [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

As a practice of indexing, use the result of the sapply() you just got to filter the input_vector values to only odd numbers.

Hint: Remember that you can negate any TRUE / FALSE value (even vector) using the ! operator.

Again, this is how we detected which numbers are even:

even_numbers <- sapply(input_vector, is_even)
even_numbers
 [1] FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

This is how we can flip this vector to get odd number indices:

odd_numbers <- !even_numbers
odd_numbers
 [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

And finally, this is how we can filter the original vector to only numbers that are odd:

input_vector[odd_numbers]
[1] 23 11 47

Here’s a secret: there’s a much quicker way to do this, even without looping, utilizing the fact that many operations can be performed on vectors in a “vectorized” way (for every element at the same time):

This gives us even numbers right away:

input_vector %% 2 != 0
 [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

This is how we can do the filtering in one go:

input_vector[input_vector %% 2 != 0]
[1] 23 11 47

These examples are perhaps too boring to see immediate usefulness of sapply() and lapply() for real-world applications. In later sessions, we will see more complex (and more useful) applications of these functions in daily data analysis tasks.

That said, you can hopefully see how automating an action of many things at once can be very useful means to save yourself repeated typing of the same command over and over. Again, consider a situation in which you have thousands or even millions of data points!

2. Loops which don’t necessarily return a value

This category of loops most often takes form of a for loop, which generally have the following shape:

for (<item> in <vector or list>) {
  ... some commands ...
}

The most trivial runnable example I could think of is this:

# "input" vector of x values
xs <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

# take each x out of all given xs in sequence...
for (x in xs) {
  # ... and print it out on a new line
  cat("The square of", x, "is", x^2, "\n")
}
The square of 1 is 1 
The square of 2 is 4 
The square of 3 is 9 
The square of 4 is 16 
The square of 5 is 25 
The square of 6 is 36 
The square of 7 is 49 
The square of 8 is 64 
The square of 9 is 81 
The square of 10 is 100 

Note: cat() is a very useful function which prints out a given value (or here, actually, multiple values!). If we append "\n" it will add an (invisible) “new line” character, equivalent of hitting ENTER on your keyboard when writing.


Let’s say you want to automate the plotting of several numeric variables from your penguins data frame to a PDF using your custom-defined function penguin_hist() you created above. Fill in the necessary bits of code in the following template to do this! This is a super common pattern that comes in handy very often in data science work.

Note: You can see the cat() call in the body of the for loop (we call the insides of the { ... } block the “body” of a loop). When you iterate over many things, it’s very useful to print out this sort of “log” information, particularly if the for loop can take very long.

# create an empty PDF file
pdf("penguins_hist.pdf")

# let's define our variables of interest (columns of a data frame)
variables <- c("bill_len", "bill_dep", "flipper_len", "body_mass")

# let's now "loop over" each of those variables
for (var in variables) {
  cat("Plotting variable", var, "...\n")
  # ... here is where you should put your call to `penguin_hist()` like
  #     you did above manually ...
}
Plotting variable bill_len ...
Plotting variable bill_dep ...
Plotting variable flipper_len ...
Plotting variable body_mass ...
dev.off() # this closes the PDF
quartz_off_screen 
                2 

If everything works correctly, look at the penguins_hist.pdf file you just created! It was all done in a fully automated way! A hugely important thing for reproducibility.


If you want, look up ?pdf to see how you could modify the width and height of the figures that will be created.

pdf("penguins_hist.pdf")

variables <- c("bill_len", "bill_dep", "flipper_len", "body_mass")
for (var in variables) {
  penguin_hist(penguins, var, 100)  
}

dev.off()
quartz_off_screen 
                2 
#|echo: false
unlink("penguins_hist.pdf")

Further practice

If you have energy and time, take a look at the following chapters of the freely-available Advanced R textbook (the best resource for R programming out there). First work through the quiz at the beginning of each chapter. If you’re not sure about answers (the questions are very hard, so if you can’t answer them, that’s completely OK), work through the material of each chapter and try to solve the exercises.

  1. Names and values
  2. Vectors
  3. Subsetting
  4. Control flow
  5. Functions

Pick whichever topic seems interesting to you. Don’t try to ingest everything – even isolated little bits and details that stick with you will pay off in the long run!