<- 3.14
w1 <- 42
x1 <- "hello"
y1 <- TRUE z1
R programming language
⚠️⚠️⚠️ This chapter is a work-in-progress! ⚠️⚠️⚠️
In a wider programming world, R sometimes has a slightly unfortunate reputation as a badly designed “calculator language”. A computing environment which is (maybe!) good for working with data frames and creating figures, but that’s about it. However, while certainly very useful for data science, R is a full-blown programming language which is actually quite powerful even from a purely computer science perspective.
But still, this is a book about population genomics and data science in R. Why does this matter how much “real coding” we do in it?
Well, although this entire workshop will primarily focus on R primarily as as a statistical and visualization environment, neglecting the aspects of R which make it a “proper” programming language (or even considering data science as “not real programming”) is a huge problem.
First, even when “just” doing data science and statistics, we still use typical programming constructs, we need to be aware of underlying data types behind our data (mostly contents of tables), and we need to think algorithmically. Neglecting these things makes it easy to introduce bugs into our code, make it hard to find those bugs, and make our programs less efficient even when they do work.
This chapter will help you get familiar with some of the less obvious aspects of the R language or programming in general, certainly the parts which are often skipped in undergratuate courses in the life sciences in favor of just teaching plotting and running statistical tests. The good thing is, there isn’t that much you need to learn. And what you do learn will continue paying dividents for the rest of your research career!
Let’s say it again, because people with non-computational backgrounds often feel inadequate when it comes to computational aspects of their work: Even when you’re “just” writing data analysis scripts, even when you’re “just” plotting results, you’re still writing programs. You’re a programmer. How exciting, right? Exercises in this chapter are designed to make you comfortable with programming and algorithmic thinking.
Part 1: Basic data types
Create the following variables in your R script and then evaluate this code in your R console:
Hint: I suggest you always write your code in a script in RStudio (click File -> New file -> R script). You can execute the line (or block) of code under cursor in the script window by pressing CTRL+Enter (on Windows or Linux) or CMD+Enter (on a Mac). For quick tests, feel free to type directly in the R console.
w1
[1] 3.14
x1
[1] 42
y1
[1] "hello"
z1
[1] TRUE
What are the data “types” you get when you apply function mode()
on each of these variables?
You can test whether or not a specific variable is of a specific type using functions such as is.numeric()
, is.integer()
, is.character()
, is.logical()
. See what results you get when you apply these functions on these four variables w1
, x1
, y1
, z1
. Pay close attention to the difference (or lack thereof?) between applying is.numeric()
and is.integer()
on variables containing “numbers”.
**Note: This might seem incredibly boring and useless but trust me. In your real data, you will be see, in data frames (discussed below) with thousands of rows, sometimes millions. Being able to make sure that the values you get in your data-frame columns are of the expected type is something you will be doing often.
To summarize (and oversimplify a little bit) R allows variables to have several types of data, most importantly:
- integers (such as
42
) - numerics (such as
42.13
) - characters (such as
"text value"
) - logicals (
TRUE
orFALSE
)
We will also encounter two types of “non-values”. We will not be discussing them in detail here, but they will be relevant later. For the time being, just remember that there are also:
- undefined values represented by
NULL
- missing values represented by
NA
What do you think is the practical difference between NULL
and NA
? In other words, when you encounter one or the other in the data, how would you interpret this?
Part 2: Vectors
Vectors are, roughly speaking, collections of values. We could also say “a list”, but that’s not entirely precise in the context of R as we’ll see below.
We can create a vector by calling the c()
function (“c” stands for “concatenate”, or “joining together”). Create the following variables containing these vectors. Then inspect their data types using mode()
again.
<- c(1.0, 2.72, 3.14)
w2 <- c(1, 13, 42)
x2 <- c("hello", "folks", "!")
y2 <- c(TRUE, FALSE) z2
w2
[1] 1.00 2.72 3.14
x2
[1] 1 13 42
y2
[1] "hello" "folks" "!"
z2
[1] TRUE FALSE
We can use the function is.vector()
to test that a given object really is a vector. Try this on your vector variables.
What happens when you call is.vector()
on the variables x1
, y1,
etc. from the previous part (i.e., those which contain single values)?
Do elements of vectors need to be homogeneous (i.e., of the same data type)? Try creating a vector with values 1
, "42"
, and "hello"
. Can you do it? What happens when you try? Inspect the result in the R console (take a close look at how the result is presented in text and the quotes that you will see), or use the mode()
function again.
As you could see, vectors must carry values of just one type. If they don’t, they are converted by a straightforward cascade of so-called “coercions”. A vector defined with a mixture of different values (i.e., the four atomic types we discussed in the first part) will be coreced to be only one of those types, given certain rules.
Try to figure out some of these coercion rules. Make a couple of vectors with mixed values of different types using the function c()
, and observe what type of vector you get in return.
Hint: Try creating a vector which has integers and strings, integers and floats, integers and logicals, floats and logicals, floats and strings, and logicals and strings. Observe the format of the result that you get, and build your intuition by calling mode()
on each vector object to verify this.
Out of all these data type exploration, this part is probably the most crucial for any kind of data science work. Why is that? Think about what can happen when someone does manual data entry in Excel.
You can create vector of consecutive values of certain forms using everal approaches. Try these options:
Create a sequence of values from
i
toj
asi:j
. Create a vector of numbers 1:20**Do the same using the function
seq()
. Read?seq
to find out what parameters you should specify (and how) to get the same result as thei:j
shortcut.Modify the arguments given to
seq()
so that you create a vector of numbers from 20 to 1.Use the
by =
argument ofseq()
to create a vector of only odd values starting from 1.
Another very useful built-in helper function (especially when we get to the iteration part below) is seq_along()
. What does it give you when you run it on this vector, for instance?
<- c(1, "42", "hello", 3.1416) v
Part 3: Lists
Lists are a little similar to vectors but very different in a couple of important respects. Remember how we tested what happens when we put different types of values in a vector (reminder: vectors must be “homogeneous” in terms of the data types of their elements!)? What happens when you create lists with different types of values using the code in the following chunk? Use mode()
on the resulting objects and compare your results to those you got on “mixed value” vectors above.
<- list(1.0, "2.72", 3.14)
w3 <- list(1, 13, 42, "billion")
x3 <- list("hello", "folks", "!", 123, "wow a number follows again", 42)
y3 <- list(TRUE, FALSE, 13, "string") z3
Try also a different function called for str()
(for “structure”) and apply it on one of those lists. Is mode()
or str()
more useful to inspect what kind of data is stored in a list (str
will be very useful when we get to data frames for – spoiler alert! – exactly this reason). Why?
is.list(w3)
[1] TRUE
Use is.vector()
and is.list()
on one of the lists above (like w3
perhaps). Why do you get the result that you got? Then run both functions on one of the vectors you created above (like w2
). What does this mean?
Not only can lists contain arbitrary values of mixed types (atomic data types from Part 1 of this exercise), they can also contain non-atomic data as well, such as other lists! In fact, you can, in principle, create lists of lists of lists of… lists! Try creating a list()
which, in addition to a couple of normal values (numbers, strings, doesn’t matter) also contains one or two other lists (we call them “nested”). Don’t think about this too much, just create something arbitrary to get a bit of practice. Save this in a variable called weird_list
and type it back in your R console, just to see how R presents such data back to you. In the next part, we will learn how to explore this type of data better.
Note: If you are confused (or even annoyed) why we are even doing this, in the later discussion of data frames and spatial data structures, it will become much clearer why putting lists into other lists allows a whole another level of data science work. Please bear with me for now! This is just laying the groundwork for some very cool things later down the line.
Part 4: Indexing
To extract an i-th element of a vector xyz
, we can use the []
operator like this: xyz[i]
. For instance, we can take the 13-th element of this vector as xyz[13]
.
Familiarize yourselves with the []
operator by taking out some specific values from this vector, like the 5-th element.
<- c(1, 10, 100, 1000, 10000, 100000, 1000000) vector
Part 5: Data frames
Every scientists works with tables of data, in one way or another. R provides first class support for working with tables, which are formally called “data frames”. We will be spending most of our time of this workshop learning to manipulate, filter, modify, and plot data frames, often times with data that is too big to look at all at once. For simplicity, just to get started and to explain the basic fundamentals, let’s begin with something trivially easy, like this little data frame here:
<- data.frame(
df w = c(1.0, 2.72, 3.14),
x = c(1, 13, 42),
y = c("hello", "folks", "!"),
z = c(TRUE, FALSE, FALSE)
)
df
w x y z
1 1.00 1 hello TRUE
2 2.72 13 folks FALSE
3 3.14 42 ! FALSE
First, here’s the first killer bit of information: data frames are normal lists!
is.list(df)
[1] TRUE
How is this even possible? And why is this even the case? Explaining this in full would be too much detail, even for a course which tries to go beyond “R only as a plotting tool” as I promised you in the introduction. Still, for now let’s say that R objects can store so called “attributes”, which – in the case of data frame objects – makes them behave as “something more than just a list”. These attributes are called “classes”.
You can poke into these internals but “unclassing” an object. Call unclass(df)
in your R console and observe what result you get (just to convince yourself that data frames really are lists under the hood).
Note: Honest admission – you will never need this unclass()
stuff in practice, ever. I’m really showing you to demonstrate what “data frame” actually is on a lower-level of R programming. If you’re confused, don’t worry. The fact that data frames are lists matters infinitely more than knowing exactly how is that accomplished inside R.
Two ways of extracting a data-frame column:
$
operator (column name as a symbol):
$w df
[1] 1.00 2.72 3.14
[[
operator (column name as a string):
"w"]] df[[
[1] 1.00 2.72 3.14
Note that the operator must be [[
, because [
does something slightly different
Part 6: Inspecting column types
Let’s go back to our example data frame:
<- data.frame(
df1 w = c(1.0, 2.72, 3.14),
x = c(1, 13, 42),
y = c("hello", "folks", "!"),
z = c(TRUE, FALSE, FALSE)
)
df1
w x y z
1 1.00 1 hello TRUE
2 2.72 13 folks FALSE
3 3.14 42 ! FALSE
Use the function str()
and by calling str(df1)
, inspect the types of columns in the table.
str(df1)
'data.frame': 3 obs. of 4 variables:
$ w: num 1 2.72 3.14
$ x: num 1 13 42
$ y: chr "hello" "folks" "!"
$ z: logi TRUE FALSE FALSE
Sometimes (usually when we read data from disk, like from another software), a data point sneaks in which makes a column apparently non numeric. Consider this new table called df2
:
<- data.frame(
df2 w = c(1.0, 2.72, 3.14),
x = c(1, "13", 42),
y = c("hello", "folks", "!"),
z = c(TRUE, FALSE, FALSE)
)
df2
w x y z
1 1.00 1 hello TRUE
2 2.72 13 folks FALSE
3 3.14 42 ! FALSE
Just by looking at this, the table looks the same as df1
above. Use str()
again to see where the problem is.
str(df2)
'data.frame': 3 obs. of 4 variables:
$ w: num 1 2.72 3.14
$ x: chr "1" "13" "42"
$ y: chr "hello" "folks" "!"
$ z: logi TRUE FALSE FALSE
Part 7: Functions
The motto of this part could be summarized by an ancient motto of programming: Don’t repeat yourself (DRY): “[…] a modification of any single element of a system does not require a change in other logically unrelated elements.”.
Let’s say you have the following numeric vector:
<- c(-0.32, 0.78, -0.68, 0.28, -1.96, 0.21, -0.07, -1.01, 0.06, 0.74,
vec -0.37, -0.6, -0.08, 1.81, -0.65, -1.23, -1.28, 0.11, -1.74, -1.68)
With numeric vectors (which could be base qualities, genotype qualities, \(f\)-statistics, etc.), we often need to compute a couple of summary statistics (mean, median, quartile, minimum, maximum, etc.).
In R, we have a very useful built-in function summary()
, which does exactly that. But let’s ignore this for the moment, for learning purposes.
Here is how you can compute those summary statistics individually:
min(vec)
[1] -1.96
quantile(vec, probs = 0.25)
25%
-1.065
median(vec)
[1] -0.345
mean(vec)
[1] -0.384
quantile(vec, probs = 0.75)
75%
0.135
max(vec)
[1] 1.81
Now, you can imagine that you have many more of such vectors (results for different sequenced samples, different computed population genetic metrics, etc.). Having to type out all of these commands for every single one of those vectors would very quickly get extremely tiresome. Worse still, when we would do this, we would certainly resort to copy-pasting, which is guaranteed to lead to errors.
Write a custom function called my_summary
, which will accept a single input named values
, and returns a list which binds all the six summary statistics together. Name the elements of that list as "min"
, "quartile_1"
, "median"
, "mean"
, "quartile_3"
, and "max"
.
<- function(values) {
my_summary <- min(vec)
a <- quantile(vec, probs = 0.25)
b <- median(vec)
c <- mean(vec)
d <- quantile(vec, probs = 0.75)
e <- max(vec)
f
<- list(min = a, quartile_1 = b, median = c, mean = d, quartile_3 = e, max = f)
result
return(result)
}
Note that yes, we had to write the code anyway, but it is now “encapsulated” in a fully self-contained form and can be called repeatably, without any copy-pasting.
In other words, if we imagine having these three vectors of numeric values:
<- runif(10)
vec1 <- runif(10)
vec2 <- runif(10) vec3
We can now compute our summary statistics by calling our function my_summary()
:
my_summary(vec1)
$min
[1] -1.96
$quartile_1
25%
-1.065
$median
[1] -0.345
$mean
[1] -0.384
$quartile_3
75%
0.135
$max
[1] 1.81
my_summary(vec2)
$min
[1] -1.96
$quartile_1
25%
-1.065
$median
[1] -0.345
$mean
[1] -0.384
$quartile_3
75%
0.135
$max
[1] 1.81
my_summary(vec3)
$min
[1] -1.96
$quartile_1
25%
-1.065
$median
[1] -0.345
$mean
[1] -0.384
$quartile_3
75%
0.135
$max
[1] 1.81
And, surprise! This is what the incredibly useful built-in function summary()
provided with every R installation does!
summary(vec1)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.06962 0.23824 0.69048 0.57455 0.88080 0.93092
summary(vec2)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.09554 0.21031 0.26917 0.34678 0.47034 0.83012
summary(vec3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.004007 0.187542 0.481866 0.437755 0.630130 0.894811
Part 8: Iteration and loops
Functions help us take pieces of code and generalize them to reduce the amount of code needed to do similar things, repeatedly, multiple times. You could think of iteration as generalizing those repetitions.
This is very vague and abstract
Take home message
It’s worth acknowledging that developing code as proper functions, using iteration, and generally think about solving problems using “proper” program approaches like that is a lot of work. It might seem that “just copying a bit of code a couple of times” is easier. In the moment, it actually is.
Except we rarely, if ever, do things only once. We have to come back to old scripts, update them, or add new steps. Doing this for every single bit of copy pasted code (sometimes in multiple scripts) is awful, and almost always results in bugs. Even missing one edit in one copy of a piece of code will make our results wrong.
It’s always worth investing a bit extra time into extracting repeated pieces of code into individual functions. It will always result in less work in the long run.