R programming language

⚠️⚠️⚠️ This chapter is a work-in-progress! ⚠️⚠️⚠️

In a wider programming world, R sometimes has a slightly unfortunate reputation as a badly designed “calculator language”. A computing environment which is (maybe!) good for working with data frames and creating figures, but that’s about it. However, while certainly very useful for data science, R is a full-blown programming language which is actually quite powerful even from a purely computer science perspective.

But still, this is a book about population genomics and data science in R. Why does this matter how much “real coding” we do in it?

Well, although this entire workshop will primarily focus on R primarily as as a statistical and visualization environment, neglecting the aspects of R which make it a “proper” programming language (or even considering data science as “not real programming”) is a huge problem.

First, even when “just” doing data science and statistics, we still use typical programming constructs, we need to be aware of underlying data types behind our data (mostly contents of tables), and we need to think algorithmically. Neglecting these things makes it easy to introduce bugs into our code, make it hard to find those bugs, and make our programs less efficient even when they do work.

This chapter will help you get familiar with some of the less obvious aspects of the R language or programming in general, certainly the parts which are often skipped in undergratuate courses in the life sciences in favor of just teaching plotting and running statistical tests. The good thing is, there isn’t that much you need to learn. And what you do learn will continue paying dividents for the rest of your research career!

Let’s say it again, because people with non-computational backgrounds often feel inadequate when it comes to computational aspects of their work: Even when you’re “just” writing data analysis scripts, even when you’re “just” plotting results, you’re still writing programs. You’re a programmer. How exciting, right? Exercises in this chapter are designed to make you comfortable with programming and algorithmic thinking.

Part 1: Basic data types

Create the following variables in your R script and then evaluate this code in your R console:

Hint: I suggest you always write your code in a script in RStudio (click File -> New file -> R script). You can execute the line (or block) of code under cursor in the script window by pressing CTRL+Enter (on Windows or Linux) or CMD+Enter (on a Mac). For quick tests, feel free to type directly in the R console.

w1 <- 3.14
x1 <- 42
y1 <- "hello"
z1 <- TRUE

w1

[1] 3.14

x1

[1] 42

y1

[1] "hello"

z1

[1] TRUE

What are the data “types” you get when you apply function mode() on each of these variables?

Click to see the solution

mode(w1)

[1] "numeric"

mode(x1)

[1] "numeric"

mode(y1)

[1] "character"

mode(z1)

[1] "logical"

You can test whether or not a specific variable is of a specific type using functions such as is.numeric(), is.integer(), is.character(), is.logical(). See what results you get when you apply these functions on these four variables w1, x1, y1, z1. Pay close attention to the difference (or lack thereof?) between applying is.numeric() and is.integer() on variables containing “numbers”.

**Note: This might seem incredibly boring and useless but trust me. In your real data, you will be see, in data frames (discussed below) with thousands of rows, sometimes millions. Being able to make sure that the values you get in your data-frame columns are of the expected type is something you will be doing often.

Click to see the solution

is.numeric(w1)

[1] TRUE

is.integer(w1)

[1] FALSE

is.numeric(x1)

[1] TRUE

is.integer(x1)

[1] FALSE

is.character(y1)

[1] TRUE

is.logical(z1)

[1] TRUE

is.numeric(z1)

[1] FALSE

is.integer(z1)

[1] FALSE

To summarize (and oversimplify a little bit) R allows variables to have several types of data, most importantly:

integers (such as 42)
numerics (such as 42.13)
characters (such as "text value")
logicals (TRUE or FALSE)

We will also encounter two types of “non-values”. We will not be discussing them in detail here, but they will be relevant later. For the time being, just remember that there are also:

undefined values represented by NULL
missing values represented by NA

What do you think is the practical difference between NULL and NA? In other words, when you encounter one or the other in the data, how would you interpret this?

Part 2: Vectors

Vectors are, roughly speaking, collections of values. We could also say “a list”, but that’s not entirely precise in the context of R as we’ll see below.

We can create a vector by calling the c() function (“c” stands for “concatenate”, or “joining together”). Create the following variables containing these vectors. Then inspect their data types using mode() again.

w2 <- c(1.0, 2.72, 3.14)
x2 <- c(1, 13, 42)
y2 <- c("hello", "folks", "!")
z2 <- c(TRUE, FALSE)

w2

[1] 1.00 2.72 3.14

x2

[1]  1 13 42

y2

[1] "hello" "folks" "!"

z2

[1]  TRUE FALSE

Click to see the solution

mode(w2)

[1] "numeric"

mode(x2)

[1] "numeric"

mode(y2)

[1] "character"

mode(z2)

[1] "logical"

We can use the function is.vector() to test that a given object really is a vector. Try this on your vector variables.

Click to see the solution

is.vector(w2)

[1] TRUE

is.vector(x2)

[1] TRUE

is.vector(y2)

[1] TRUE

is.vector(z2)

[1] TRUE

What happens when you call is.vector() on the variables x1, y1, etc. from the previous part (i.e., those which contain single values)?

Click to see the solution

is.vector(42)

[1] TRUE

Yes, even scalars (i.e., singular values) are formally vectors!

This is why we see the [1] index when we type a single number:

[1] 1

In fact, even when we create a vector of length 1, we still get a scalar result:

c(1)

[1] 1

The conclusion is, R doesn’t actually distinguish between scalars and vectors! A scalar (a single value) is simply a vector of length 1. Think of it this way: in a strange mathematically-focused way, even a single tree is a forest. 🙃

Do elements of vectors need to be homogeneous (i.e., of the same data type)? Try creating a vector with values 1, "42", and "hello". Can you do it? What happens when you try? Inspect the result in the R console (take a close look at how the result is presented in text and the quotes that you will see), or use the mode() function again.

Click to see the solution

mixed_vector <- c(1, "42", "hello")
mixed_vector

[1] "1"     "42"    "hello"

mode(mixed_vector)

[1] "character"

As you could see, vectors must carry values of just one type. If they don’t, they are converted by a straightforward cascade of so-called “coercions”. A vector defined with a mixture of different values (i.e., the four atomic types we discussed in the first part) will be coreced to be only one of those types, given certain rules.

Try to figure out some of these coercion rules. Make a couple of vectors with mixed values of different types using the function c(), and observe what type of vector you get in return.

Hint: Try creating a vector which has integers and strings, integers and floats, integers and logicals, floats and logicals, floats and strings, and logicals and strings. Observe the format of the result that you get, and build your intuition by calling mode() on each vector object to verify this.

Click to see the solution

v1 <- c(1, "42", "hello")
v1

[1] "1"     "42"    "hello"

mode(v1)

[1] "character"

v2 <- c(1, 42.13, 123)
v2

[1]   1.00  42.13 123.00

mode(v2)

[1] "numeric"

v3 <- c(1, 42, TRUE)
v3

[1]  1 42  1

mode(v3)

[1] "numeric"

v4 <- c(1.12, 42.13, FALSE)
v4

[1]  1.12 42.13  0.00

mode(v4)

[1] "numeric"

v5 <- c(42.13, "hello")
v5

[1] "42.13" "hello"

mode(v5)

[1] "character"

v6 <- c(TRUE, "hello")
v6

[1] "TRUE"  "hello"

mode(v6)

[1] "character"

Out of all these data type exploration, this part is probably the most crucial for any kind of data science work. Why is that? Think about what can happen when someone does manual data entry in Excel.

Click to see the solution

Imagine what kinds of trouble can happen if you just load a table data from somewhere, if the values are not properly formatted. For instance, if a “numeric” column of your table has accidentally some characters (which can very easily happen when manually entering data in Excel, etc.). This will be much clearer when we get to data frames below.

You can create vector of consecutive values of certain forms using everal approaches. Try these options:

Create a sequence of values from i to j as i:j. Create a vector of numbers 1:20
**Do the same using the function seq(). Read ?seq to find out what parameters you should specify (and how) to get the same result as the i:j shortcut.
Modify the arguments given to seq() so that you create a vector of numbers from 20 to 1.
Use the by = argument of seq() to create a vector of only odd values starting from 1.

Click to see the solution

# 1
1:20

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

# another option is this ("give me sequence of the length N)
seq_len(20)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

# 2
seq(from = 1, to = 20)

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

# 3
seq(from = 20, to = 1)

 [1] 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

# 4
seq(1, 20, by = 2)

 [1]  1  3  5  7  9 11 13 15 17 19

This might look boring, but these functions are super useful to generate indices for data, adding indices as columns to tabular data, etc.**

Another very useful built-in helper function (especially when we get to the iteration part below) is seq_along(). What does it give you when you run it on this vector, for instance?

v <- c(1, "42", "hello", 3.1416)

Click to see the solution

seq_along(v)

[1] 1 2 3 4

This function allows you to quickly iterate over elements of a vector (or a list) using indices into that vector (or a list).

Part 3: Lists

Lists are a little similar to vectors but very different in a couple of important respects. Remember how we tested what happens when we put different types of values in a vector (reminder: vectors must be “homogeneous” in terms of the data types of their elements!)? What happens when you create lists with different types of values using the code in the following chunk? Use mode() on the resulting objects and compare your results to those you got on “mixed value” vectors above.

w3 <- list(1.0, "2.72", 3.14)
x3 <- list(1, 13, 42, "billion")
y3 <- list("hello", "folks", "!", 123, "wow a number follows again", 42)
z3 <- list(TRUE, FALSE, 13, "string")

Click to see the solution

When we type the list variable in the R console, we no longer see the “coercion” we observed for vectors (numbers remain numbers even though the list contains strings):

y3

[[1]]
[1] "hello"

[[2]]
[1] "folks"

[[3]]
[1] "!"

[[4]]
[1] 123

[[5]]
[1] "wow a number follows again"

[[6]]
[1] 42

Calling mode() will (disappointingly) not tell us much about the data types of each individual element. Why is that?

mode(y3)

[1] "list"

Try also a different function called for str() (for “structure”) and apply it on one of those lists. Is mode() or str() more useful to inspect what kind of data is stored in a list (str will be very useful when we get to data frames for – spoiler alert! – exactly this reason). Why?

Click to see the solution

str(y3)

List of 6
 $ : chr "hello"
 $ : chr "folks"
 $ : chr "!"
 $ : num 123
 $ : chr "wow a number follows again"
 $ : num 42

is.list(w3)

[1] TRUE

Use is.vector() and is.list() on one of the lists above (like w3 perhaps). Why do you get the result that you got? Then run both functions on one of the vectors you created above (like w2). What does this mean?

Click to see the solution

Let’s take this list:

w3

[[1]]
[1] 1

[[2]]
[1] "2.72"

[[3]]
[1] 3.14

Lists are vectors!

is.vector(w3)

[1] TRUE

Lists are lists (obviously!):

is.list(w3)

[1] TRUE

Now let’s take this vector:

w2

[1] 1.00 2.72 3.14

Vectors are not lists!

is.list(w2)

[1] FALSE

So:

Every list is also a vector.
But vectors are not lists.

This makes sense because lists can (but don’t have to) contain values

Not only can lists contain arbitrary values of mixed types (atomic data types from Part 1 of this exercise), they can also contain non-atomic data as well, such as other lists! In fact, you can, in principle, create lists of lists of lists of… lists! Try creating a list() which, in addition to a couple of normal values (numbers, strings, doesn’t matter) also contains one or two other lists (we call them “nested”). Don’t think about this too much, just create something arbitrary to get a bit of practice. Save this in a variable called weird_list and type it back in your R console, just to see how R presents such data back to you. In the next part, we will learn how to explore this type of data better.

Click to see the solution

Here’s an example of such “nested list”:

weird_list <- list(
  1,
  "two",
  list(
    "three",
    4,
    list(5, "six", 7)
  )
)

When we type it out in the R console, we see that R tries to lay out the structure of this data with numerical indices (we’ll talk about indices below!) indicating the “depth” of each nested pieces of data (either a plain number or character, or another list!)

weird_list

[[1]]
[1] 1

[[2]]
[1] "two"

[[3]]
[[3]][[1]]
[1] "three"

[[3]][[2]]
[1] 4

[[3]][[3]]
[[3]][[3]][[1]]
[1] 5

[[3]][[3]][[2]]
[1] "six"

[[3]][[3]][[3]]
[1] 7

Note: If you are confused (or even annoyed) why we are even doing this, in the later discussion of data frames and spatial data structures, it will become much clearer why putting lists into other lists allows a whole another level of data science work. Please bear with me for now! This is just laying the groundwork for some very cool things later down the line.

Part 4: Indexing

To extract an i-th element of a vector xyz, we can use the [] operator like this: xyz[i]. For instance, we can take the 13-th element of this vector as xyz[13].

Familiarize yourselves with the [] operator by taking out some specific values from this vector, like the 5-th element.

vector <- c(1, 10, 100, 1000, 10000, 100000, 1000000)

Part 5: Data frames

Every scientists works with tables of data, in one way or another. R provides first class support for working with tables, which are formally called “data frames”. We will be spending most of our time of this workshop learning to manipulate, filter, modify, and plot data frames, often times with data that is too big to look at all at once. For simplicity, just to get started and to explain the basic fundamentals, let’s begin with something trivially easy, like this little data frame here:

df <- data.frame(
  w = c(1.0, 2.72, 3.14),
  x = c(1, 13, 42),
  y = c("hello", "folks", "!"),
  z = c(TRUE, FALSE, FALSE)
)

df

     w  x     y     z
1 1.00  1 hello  TRUE
2 2.72 13 folks FALSE
3 3.14 42     ! FALSE

First, here’s the first killer bit of information: data frames are normal lists!

is.list(df)

[1] TRUE

How is this even possible? And why is this even the case? Explaining this in full would be too much detail, even for a course which tries to go beyond “R only as a plotting tool” as I promised you in the introduction. Still, for now let’s say that R objects can store so called “attributes”, which – in the case of data frame objects – makes them behave as “something more than just a list”. These attributes are called “classes”.

You can poke into these internals but “unclassing” an object. Call unclass(df) in your R console and observe what result you get (just to convince yourself that data frames really are lists under the hood).

Note: Honest admission – you will never need this unclass() stuff in practice, ever. I’m really showing you to demonstrate what “data frame” actually is on a lower-level of R programming. If you’re confused, don’t worry. The fact that data frames are lists matters infinitely more than knowing exactly how is that accomplished inside R.

Click to see the solution

This is what a normal data frame looks like to us:

df

     w  x     y     z
1 1.00  1 hello  TRUE
2 2.72 13 folks FALSE
3 3.14 42     ! FALSE

Here is how a data frame is represented under the hood:

unclass(df)

$w
[1] 1.00 2.72 3.14

$x
[1]  1 13 42

$y
[1] "hello" "folks" "!"    

$z
[1]  TRUE FALSE FALSE

attr(,"row.names")
[1] 1 2 3

It really is just a list!

Click to see the solution

When we create a normal plain R list…

homebrew <- list(col_a = 1:3, col_b = c("one", "two", "three"))
homebrew

$col_a
[1] 1 2 3

$col_b
[1] "one"   "two"   "three"

… and assing a “data.frame” class attribute to it like this…

class(homebrew) <- "data.frame"

… we get a data frame!

homebrew

[1] col_a col_b
<0 rows> (or 0-length row.names)

Two ways of extracting a data-frame column:

$ operator (column name as a symbol):

df$w

[1] 1.00 2.72 3.14

[[ operator (column name as a string):

df[["w"]]

[1] 1.00 2.72 3.14

Note that the operator must be [[, because [ does something slightly different

Part 6: Inspecting column types

Let’s go back to our example data frame:

df1 <- data.frame(
  w = c(1.0, 2.72, 3.14),
  x = c(1, 13, 42),
  y = c("hello", "folks", "!"),
  z = c(TRUE, FALSE, FALSE)
)

df1

     w  x     y     z
1 1.00  1 hello  TRUE
2 2.72 13 folks FALSE
3 3.14 42     ! FALSE

Use the function str() and by calling str(df1), inspect the types of columns in the table.

str(df1)

'data.frame':   3 obs. of  4 variables:
 $ w: num  1 2.72 3.14
 $ x: num  1 13 42
 $ y: chr  "hello" "folks" "!"
 $ z: logi  TRUE FALSE FALSE

Sometimes (usually when we read data from disk, like from another software), a data point sneaks in which makes a column apparently non numeric. Consider this new table called df2:

df2 <- data.frame(
  w = c(1.0, 2.72, 3.14),
  x = c(1, "13", 42),
  y = c("hello", "folks", "!"),
  z = c(TRUE, FALSE, FALSE)
)

df2

     w  x     y     z
1 1.00  1 hello  TRUE
2 2.72 13 folks FALSE
3 3.14 42     ! FALSE

Just by looking at this, the table looks the same as df1 above. Use str() again to see where the problem is.

str(df2)

'data.frame':   3 obs. of  4 variables:
 $ w: num  1 2.72 3.14
 $ x: chr  "1" "13" "42"
 $ y: chr  "hello" "folks" "!"
 $ z: logi  TRUE FALSE FALSE

Part 7: Functions

The motto of this part could be summarized by an ancient motto of programming: Don’t repeat yourself (DRY): “[…] a modification of any single element of a system does not require a change in other logically unrelated elements.”.

Let’s say you have the following numeric vector:

vec <- c(-0.32, 0.78, -0.68, 0.28, -1.96, 0.21, -0.07, -1.01, 0.06, 0.74,
         -0.37, -0.6, -0.08, 1.81, -0.65, -1.23, -1.28, 0.11, -1.74,  -1.68)

With numeric vectors (which could be base qualities, genotype qualities, $f$-statistics, etc.), we often need to compute a couple of summary statistics (mean, median, quartile, minimum, maximum, etc.).

In R, we have a very useful built-in function summary(), which does exactly that. But let’s ignore this for the moment, for learning purposes.

Here is how you can compute those summary statistics individually:

min(vec)

[1] -1.96

quantile(vec, probs = 0.25)

   25% 
-1.065

median(vec)

[1] -0.345

mean(vec)

[1] -0.384

quantile(vec, probs = 0.75)

  75% 
0.135

max(vec)

[1] 1.81

Now, you can imagine that you have many more of such vectors (results for different sequenced samples, different computed population genetic metrics, etc.). Having to type out all of these commands for every single one of those vectors would very quickly get extremely tiresome. Worse still, when we would do this, we would certainly resort to copy-pasting, which is guaranteed to lead to errors.

Write a custom function called my_summary, which will accept a single input named values, and returns a list which binds all the six summary statistics together. Name the elements of that list as "min", "quartile_1", "median", "mean", "quartile_3", and "max".

my_summary <- function(values) {
  a <- min(vec)
  b <- quantile(vec, probs = 0.25)
  c <- median(vec)
  d <- mean(vec)
  e <- quantile(vec, probs = 0.75)
  f <- max(vec)
  
  result <- list(min = a, quartile_1 = b, median = c, mean = d, quartile_3 = e, max = f)
  
  return(result)
}

Note that yes, we had to write the code anyway, but it is now “encapsulated” in a fully self-contained form and can be called repeatably, without any copy-pasting.

In other words, if we imagine having these three vectors of numeric values:

vec1 <- runif(10)
vec2 <- runif(10)
vec3 <- runif(10)

We can now compute our summary statistics by calling our function my_summary():

my_summary(vec1)

$min
[1] -1.96

$quartile_1
   25% 
-1.065 

$median
[1] -0.345

$mean
[1] -0.384

$quartile_3
  75% 
0.135 

$max
[1] 1.81

my_summary(vec2)

$min
[1] -1.96

$quartile_1
   25% 
-1.065 

$median
[1] -0.345

$mean
[1] -0.384

$quartile_3
  75% 
0.135 

$max
[1] 1.81

my_summary(vec3)

$min
[1] -1.96

$quartile_1
   25% 
-1.065 

$median
[1] -0.345

$mean
[1] -0.384

$quartile_3
  75% 
0.135 

$max
[1] 1.81

And, surprise! This is what the incredibly useful built-in function summary() provided with every R installation does!

summary(vec1)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.06962 0.23824 0.69048 0.57455 0.88080 0.93092

summary(vec2)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09554 0.21031 0.26917 0.34678 0.47034 0.83012

summary(vec3)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.004007 0.187542 0.481866 0.437755 0.630130 0.894811

Part 8: Iteration and loops

Functions help us take pieces of code and generalize them to reduce the amount of code needed to do similar things, repeatedly, multiple times. You could think of iteration as generalizing those repetitions.

This is very vague and abstract

Take home message

It’s worth acknowledging that developing code as proper functions, using iteration, and generally think about solving problems using “proper” program approaches like that is a lot of work. It might seem that “just copying a bit of code a couple of times” is easier. In the moment, it actually is.

Except we rarely, if ever, do things only once. We have to come back to old scripts, update them, or add new steps. Doing this for every single bit of copy pasted code (sometimes in multiple scripts) is awful, and almost always results in bugs. Even missing one edit in one copy of a piece of code will make our results wrong.

It’s always worth investing a bit extra time into extracting repeated pieces of code into individual functions. It will always result in less work in the long run.