<- 3.14
w1 <- 42
x1 <- "hello"
y1 <- TRUE z1
R bootcamp
In this chapter, we will be exploring the basics of the R language. We will focus on topics which are normally taken for granted and never explained in basic data science courses, which generally immediately jump to data manipulation and plotting. I strongly believe that getting familiar with the fundamentals of R as a complete programming language from this “lower-level” perspective, although it might seem a little overwhelming at the beginning, will pay dividends over and over your scientific career.
When we get to data science work in later chapters, you will see that many things which otherwise remain quite obscure and magical boil down to a set of very simple principles and components.
This knowledge will make you much more confident in the results of your work, and much easier to debug issues and problems.
Finally, we call this chapter a “bootcamp” on purpose – we only have a limited amount of time to go through all of these topics. After all, the primary reason for the existence of this course is to make you competent researchers in computational population genomics, so the emphasis will be on practical applications and solving concrete data science issues. That said, if you ever want more information, I encourage you to take a look at relevant chapter of the Advanced R textbook.
And now, open RStudio, create a new R script (File
->
New file
->
R Script
), save it somewhere on your computer as r-bootcamp.R
(File
-> Save
) and let’s get started!
Exercise 0: Getting help
Before we even get started, there’s one thing you should remember: R (and R packages) have an absolutely stellar documentation and help system. What’s more, this documentation is standardized, has always the same format, and is accessible in the same way. The primary way of interacting with it from inside R (and RStudio) is the ?
operator. For instance, to get help about the hist()
function (histograms), you can type ?hist
in the R console. This documentation has a consistent format and appears in the “Help” pane in your RStudio window.
There are a couple of things to look for:
On the top of the documentation page, you will always see a brief description of the arguments of each function. This is what you’ll be looking for most of the time (“How do I do specify this or that? How do I modify the behavior of the function?”).
On the bottom of the page are examples. These are small bits of code which often explain the behavior of some functionality in a very helpful way.
Whenever you’re lost or can’t remember some detail about some piece of R functionality, looking up ?
documentation is always very helpful.
As a practice and to build a habit, whenever we introduce a new function in this course, use ?<name of the function>
to open its documentation.
Exercise 1: Basic data types
Create the following variables in your R script and then evaluate this code in your R console:
Hint: I suggest you always write your code in a script in RStudio (click File -> New file -> R script). You can execute the line (or block) of code under cursor in the script window by pressing CTRL+Enter (on Windows or Linux) or CMD+Enter (on a Mac). For quick tests, feel free to type directly in the R console.
The <-
operator can be read as “assign the value”. I.e., “assign the value 3.14 to a variable w1
.
w1
[1] 3.14
x1
[1] 42
y1
[1] "hello"
z1
[1] TRUE
What are the data “types” you get when you apply function typeof()
on each of these variables?
You can test whether or not a specific variable is of a specific type using functions such as is.numeric()
, is.integer()
, is.character()
, is.logical()
. See what results you get when you apply these functions on these four variables w1
, x1
, y1
, z1
. Pay close attention to the difference (or lack thereof?) between applying is.numeric()
and is.integer()
on variables containing “numbers”.
Note: This might seem incredibly boring and useless but trust me. In your real data, you will be see, in data frames (discussed below) with thousands of rows, sometimes millions. Being able to make sure that the values you get in your data-frame columns are of the expected type is something you will be doing often.
To summarize (and oversimplify a little bit) R allows variables to have several types of data, most importantly:
- integers (such as
42
) - numerics (such as
42.13
) - characters (such as
"text value"
) - logicals (
TRUE
orFALSE
)
We will also encounter two types of “non-values”. We will not be discussing them in detail here, but they will be relevant later. For the time being, just remember that there are also:
- undefined values represented by
NULL
- missing values represented by
NA
What do you think is the practical difference between NULL
and NA
? In other words, when you encounter one or the other in the data, how would you interpret this?
Exercise 2: Vectors
Vectors are, roughly speaking, collections of values. We create a vector by calling the c()
function (the “c” stands for “concatenate”, or “joining together”).
Create the following variables containing these vectors. Then inspect their data types using typeof()
again.
<- c(1.0, 2.72, 3.14)
w2 <- c(1, 13, 42)
x2 <- c("hello", "folks", "!")
y2 <- c(TRUE, FALSE) z2
We can use the function is.vector()
to test that a given object really is a vector. Try this on your vector variables.
What happens when you call is.vector()
on the variables x1
, y1,
etc. from the previous Exercise (i.e., those which contain single values)?
Do elements of vectors need to be homogeneous (i.e., of the same data type)? Try creating a vector with values 1
, "42"
, and "hello"
using the c()
function again. Can you do it? What happens when you try? Inspect the result in the R console (take a close look at how the result is presented in text and the quotes that you will see), or use the typeof()
function again.
If vectors are not create with values of the same type, they are converted by a cascade of so-called “coercions”. A vector defined with a mixture of different values (i.e., the four atomic types we discussed in the first Exercise) will be coreced to be only one of those types, given certain rules.
Try to figure out some of these coercion rules. Make a couple of vectors with mixed values of different types using the function c()
, and observe what type of vector you get in return.
Hint: Try creating a vector which has integers and strings, integers and floats, integers and logicals, floats and logicals, floats and strings, and logicals and strings. Observe the format of the result that you get, and build your intuition by calling typeof()
on each vector object to verify this.
Out of all these data type explorations, this Exercise is probably the most crucial for any kind of data science work. Why is that? Think about what can happen when someone does manual data entry in Excel.
You can create vector of consecutive values of certain forms using everal approaches. Try these options:
Create a sequence of values from
i
toj
asi:j
. Create a vector of numbers 1:20**Do the same using the function
seq()
. Read?seq
to find out what parameters you should specify (and how) to get the same result as thei:j
shortcut.Modify the arguments given to
seq()
so that you create a vector of numbers from 20 to 1.Use the
by =
argument ofseq()
to create a vector of only odd values starting from 1.
Another very useful built-in helper function (especially when we get to the iteration Exercise below) is seq_along()
. What does it give you when you run it on this vector, for instance?
<- c(1, "42", "hello", 3.1416) v
Exercise 3: Lists
Lists are a little similar to vectors but very different in a couple of important respects. Remember how we tested what happens when we put different types of values in a vector (reminder: vectors must be “homogeneous” in terms of the data types of their elements!)? What happens when you create lists with different types of values using the code in the following chunk? Use typeof()
on the resulting objects and compare your results to those you got on “mixed value” vectors above.
<- list(1.0, "2.72", 3.14)
w3 <- list(1, 13, 42, "billion")
x3 <- list("hello", "folks", "!", 123, "wow a number follows again", 42)
y3 <- list(TRUE, FALSE, 13, "string") z3
Try also a different function called for str()
(for “structure”) and apply it on one of those lists. Is typeof()
or str()
more useful to inspect what kind of data is stored in a list (str
will be very useful when we get to data frames for – spoiler alert! – exactly this reason). Why?
is.list(w3)
[1] TRUE
Use is.vector()
and is.list()
on one of the lists above (like w3
perhaps). Why do you get the result that you got? Then run both functions on one of the vectors you created above (like w2
). What does this mean?
Not only can lists contain arbitrary values of mixed types (atomic data types from Exercise 1 of this exercise), they can also contain “non-atomic” data as well, such as other lists! In fact, you can, in principle, create lists of lists of lists of… lists!
Try creating a list()
which, in addition to a couple of normal values (numbers, strings, doesn’t matter) also contains one or two other lists (we call them “nested”). Don’t think about this too much, just create something arbitrary to get a bit of practice. Save this in a variable called weird_list
and type it back in your R console, just to see how R presents such data back to you. In the next Exercise, we will learn how to explore this type of data better.
Note: If you are confused (or even annoyed) why we are even doing this, in the later discussion of data frames and spatial data structures, it will become much clearer why putting lists into other lists allows a whole another level of data science work. Please bear with me for now! This is just laying the groundwork for some very cool things later down the line.
Exercise 4: Boolean expressions and conditionals
Let’s recap some basic Boolean algebra in logic. The following basic rules apply (take a look at the truth table for a bit of a high school refresher) for the “and”, “or”, and “negation” operations:
- The AND operator (represented by
&
in R, or often∧
in math):
Both conditions must be TRUE
for the expression to be TRUE
).
TRUE
&TRUE
==TRUE
TRUE
&FALSE
==FALSE
FALSE
&TRUE
==FALSE
FALSE
&FALSE
==FALSE
- The OR operator (represented by
|
in R, or often∨
in math):
At least one condition must be TRUE
for the expression to be TRUE
).
TRUE
|TRUE
==TRUE
TRUE
|FALSE
==TRUE
FALSE
|TRUE
==TRUE
FALSE
|FALSE
==FALSE
- The NOT operator (represented by
!
in R, or often¬
in math):
The opposite of the expression.
!TRUE
==FALSE
!FALSE
==TRUE
- Comparison operators
==
(“equal to”),!=
(“not equal to”), </
>("lesser / greater than"), and
<=/
>=` (“lesser/greater or equal than”):
Comparing two things with either of these results in TRUE
or FALSE
result.
Note: There are other operations and more complex rules, but we will be using these three exclusively (plus, the more complex rules can be derived using these basic operations anyway).
Create two logical vectors with three elements each using the c()
function (pick random TRUE
and FALSE
values for each of them), and store them in variables named A
and B
. What happens when you do A & B
, A | B
, and !A
or !B
?
What happens when you apply base R functions all()
and any()
on your A
and B
(or !A
and !B
) vectors? Remember those because they are very useful?
If this all feels too technical and mathematical, you’re kind of correct. That said, when you do data science, you will be using these logical expressions literally every single day. Think about a table which has a column with some values, like sequencing coverage
. Every time you filter for samples with, for instance, coverage > 10
, you’re performing this exact operation! You essentially ask, for each sample (each value in the column), which samples have coverage > 10
(giving you TRUE
) and which have less than 10 (giving you FALSE
). Filtering data is, in essence, about applying logical operations on vectors of TRUE
and FALSE
values (which boils down to “logical indexing” introduced below), even though those logical values rarely feature as data in the tables we generally work with. Keep this in mind!
Consider the following vectors of sample coverages and origins (let’s imagine these are columns in a table you got from a bioinformatics lab) and copy them into your R script:
<- c(15.09, 48.85, 36.5, 1.12, 16.65, 0.79, 16.9, 46.09, 12.76, 11.51)
coverage <- c("mod", "mod", "mod", "anc", "mod", "anc", "mod", "mod", "mod", "mod") origin
Create a variable is_high
which will contain a TRUE
/ FALSE
vector indicating whether a given coverage
value is higher than 5. Then create a variable is_modern
which will contain another logical vector indicating whether a given sample is "mod"
(i.e., “modern”).
Use the AND operator &
to test if every high coverage sample (is_high
) is also a modern sample (is_modern
).
Hint: Use the ==
operator in combination with the all()
function.
Exercise 5: Indexing into vectors and lists
To extract a specific element(s) of a vector or a list (or to assign its given position(s)), we use a so-called “indexing” operation. Generally speaking, we can do indexing in three ways:
numerical-based indexing (by specifying a set of integer numbers),
logical-based indexing (by specifying a vector of
TRUE
/FALSE
values of the same length as the vector we’re indexing into)name-based indexing (by specifying names of elements to index)
Let’s practice those for vectors and lists separately. Later, when we introduce data frames, we will return to the topic of indexing again.
Vectors
1. Numerical-based indexing
To extract an i-th element of a vector xyz
, we can use the []
operator like this: xyz[i]
. For instance, we can take the 13-th element of this vector as xyz[13]
.
Familiarize yourselves with the []
operator by taking out some specific values from this vector, let’s say its 5-th element.
<- c("hi", "folks", "what's", "up", "folks") v
The []
operator is “vectorized”, meaning that it can actually accept multiple values given as a vector themselves (i.e, something like v[c(1, 3, 4)]
will extract the first, third, and fourth element of the vector v
.
In this way, extract the first and fifth element of the vector v
. What happens if you try to extract a tenth element from v
?
2. Logical-based indexing
Rather than giving the []
operator a specific set of integer numbers, we can provide a vector of TRUE
/ FALSE
values which specify which element of the input vector do we want to “extract”. Note that this TRUE
/ FALSE
indexing vector must have the same length as our original vector!
Create variable containing a vector of five TRUE
or FALSE
values (i.e., with something like index <- c(TRUE, FALSE, ...)
but with five TRUE
or FALSE
values total, and use that index
variable in a v[index]
indexing operation.
Usually we never want to create this “indexing vector” manually (imagine doing this for a vector of million values – impossible!). Instead, we create this indexing vector “programmatically”, based on a certain condition, like this:
<- v == "up" index
This checks which values of v
are equal to “three”, creating a logical TRUE
/ FALSE
vector in the process, storing it in the variable index
:
index
[1] FALSE FALSE FALSE TRUE FALSE
Use the same principle to extract the elements of the vector matching the value “folks”.
There’s another very useful operator is %in%
which tests, which elements of a vector is among the elements of another vector. This is an extremely useful operation which you will be doing all the time when doing data analysis. It’s good for you to get familiar with it.
For instance, if we take this vector again:
<- c("hi", "folks", "what's", "up", "folks", "I", "hope", "you",
v "aren't", "(too)", "bored")
We can then ask, for instance, “which elements of v
are among a set of given values?”:
%in% c("folks", "up", "bored") v
[1] FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
With our example vector v
, it’s very easy to glance this with our eyes, of course. But when working with real world data, we often operate on tables with thousands or even millions of columns.
Let’s imagine we don’t need to test whether a given vector is a part of a set of pre-defined values, but we want to ask the opposite question: “are any of my values of interest in my data”? Let’s say your values of interest are values <- c("hope", "(too)", "blah blah")
and your whole data is again v
. How would you use %in%
to get a TRUE
or FALSE
vector for each of your values
?
Lists
This section will be a repetition on the previous exercises about vectors. Don’t forget – lists are just vectors, except that they can contain values of heterogeneous types (numbers, characters, anything). As a result, everything that applies to vectors above applies also here.
But practice makes perfect, so let’s go through a couple of examples anyway:
<- list("hello", "folks", "!", 123, "wow a number follows again", 42)
l l
[[1]]
[1] "hello"
[[2]]
[1] "folks"
[[3]]
[1] "!"
[[4]]
[1] 123
[[5]]
[1] "wow a number follows again"
[[6]]
[1] 42
1. Numerical-based indexing
The same applies to numerical-based indexing as what we’ve shown for vectors.
Extract the second and fourth elements from l
.
2. Logical-based indexing
Similarly, you can do the same with TRUE
/ FALSE
indexing vectors for lists as what we did with normal (single-type) vectors. Rather than go through the variation of the same exercises, let’s introduce another very useful pattern related to logical-based indexing and that’s removing invalid elements.
Consider this vector:
<- c("hello", "folks", "!", NA, "wow another NAs are coming", NA, NA, "42")
v
v
[1] "hello" "folks"
[3] "!" NA
[5] "wow another NAs are coming" NA
[7] NA "42"
Notice the NA
values. One operation we have to do very often (particularly in data frames, whose columns are vectors as we will see below!) is to remove those invalid elements, using the function is.na()
.
This function returns a TRUE
/ FALSE
vector which, as you now already know, can be used for logical-based indexing!
is.na(v)
[1] FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE
A very useful trick in programming is negation (using the !
operator), which flips the TRUE
/ FALSE
states. In other words, prefixing with !
returns a vector saying which elements of the input vector are not NA
:
!is.na(v)
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
Use is.na(v)
and the negation operator !
to remove the NA
elements of the vector v
!
[]
vs [[]]
operators
Let’s move to a more interesting topic. There’s another operator useful for lists, and that’s [[ ]]
(not [ ]
!). Extract the third element of the list l
using l[4]
and l[[4]]]
. What’s the difference between the results? If you’re unsure, use the mode()
function on l[3]
and l[[3]]
to help you.
Traversing nested lists
Remember our nested list from earlier? Here’s it again:
<- list(
weird_list 1,
"two",
list(
"three",
4,
list(
5, "six",
list("seven", 8)
)
) )
What do you get when you run weird_list[[1]]
? How about weird_list[[3]]
? And how about weird_list[[3]][[2]]
? Things are getting a little complicated (or interesting, depending on how nerdy you are :)).
What’s the sequence of this “chaining” of indexing operators to extract the number 8?
Hint: You can leverage the interactive nature of evaluating intermediate results in the R console, adding things to the expression (i.e., a chunk of code) in sequence.
Whew! That last exercise was something, wasn’t it. Kind of annoying, if you ask me.
Luckily, you will not have to do these kind of complex shenanigans in R very often (maybe even never). Still, nested lists are sometimes used in capturing more complex types of data than just lists of numbers or tables (for instance, nested lists capture tree-like structures). In any case, using names instead of just integers as indices makes the whole process much easier, as we will see below.
In data you encounter in practice, the most extreme case of data indexing you will have to do probably won’t be more complex than two nested indexing operators in a row (i.e., the equivalent of doing data[[2]][[3]])
.
Particularly when we discuss some very convenient tidyverse operations later, having an idea about what a nested list even is will be very useful, so bear with me please!
Named indexing for vectors and lists
Here’s a neat thing we can do with vectors and lists. They don’t have to contain just values themselves (which can be then extracted using integer or logical indices as we’ve done above), but those values can be assigned names too.
Consider this vector and list:
<- c(1, 2, 3, 4, 5)
v v
[1] 1 2 3 4 5
<- list(1, 2, 3, 4, 5)
l l
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
[[4]]
[1] 4
[[5]]
[1] 5
As a recap, we can index into them in the usual manner like this:
c(1, 3, 5)] v[
[1] 1 3 5
c(1, 3)] l[
[[1]]
[1] 1
[[2]]
[1] 3
But we can also name the values like this (note that the names appear in the print out you get from R in the console):
<- c(one = 1, two = 2, three = 3, four = 4, five = 5)
v v
one two three four five
1 2 3 4 5
<- list(one = 1, two = 2, three = 3, four = 4, five = 5)
l l
$one
[1] 1
$two
[1] 2
$three
[1] 3
$four
[1] 4
$five
[1] 5
When you have a named data structure like this, you can index into it using those names as well, which can be very convenient. Imagine having data described not by indices but actualy readable names (such as names of people, or excavation sites!):
"three"]] l[[
[1] 3
c("two", "five")] l[
$two
[1] 2
$five
[1] 5
Note: This is exactly what data frames are, under the hood (named lists!), as we’ll see in the next section.
Let’s return (one last time, I promise!) to our nested list example, this time presenting it in a more convenient way.
<- list(
weird_list 1,
"two",
nested1 = list(
"three",
4,
nested2 = list(
5, "six",
nested3 = list("seven", 8)
)
) )
With a list like that, when we previously had to extract the element 8 like this:
3]][[3]][[3]][[2]] weird_list[[
[1] 8
we can now do this:
"nested1"]][["nested2"]][["nested3"]][[2]] weird_list[[
[1] 8
Much more readable!
Negative indexing
Consider this vector again:
<- c("hi", "folks", "what's", "up", "folks") v
What happens when you index into v
using the []
operator but give it a negative number between 1 and 5?
A very useful function is length()
, which gives the length of a given vector (or a list – remember, lists are vectors!). Use it to remove the last element of v
. How would you remove both the first and last element of a vector or a list (assuming you don’t know the length beforehand, i.e., you can’t put a fixed number as the index of the last element)?
Exercise 6: Data frames
Every scientists works with tables of data, in one way or another. R provides first class support for working with tables, which are formally called “data frames”. We will be spending most of our time of this workshop learning to manipulate, filter, modify, and plot data frames, often times with data that is too big to look at all at once. For simplicity, just to get started and to explain the basic fundamentals, let’s begin with something trivially easy, like this little data frame here:
<- data.frame(
df v = c("one", "two", "three", "four", "five"),
w = c(1.0, 2.72, 3.14, 1000.1, 1e6),
x = c(1, 13, 42, NA, NA),
y = c("folks", "hello", "from", "data frame", "!"),
z = c(TRUE, FALSE, FALSE, TRUE, TRUE)
)
df
v w x y z
1 one 1.00 1 folks TRUE
2 two 2.72 13 hello FALSE
3 three 3.14 42 from FALSE
4 four 1000.10 NA data frame TRUE
5 five 1000000.00 NA ! TRUE
First, here’s the first killer bit of information: data frames are normal lists!
is.list(df)
[1] TRUE
How is this even possible? And why is this even the case? Explaining this in full would be too much detail, even for a course which tries to go beyond “R only as a plotting tool” as I promised you in the introduction. Still, for now let’s say that R objects can store so called “attributes”, which – in the case of data frame objects – makes them behave as “something more than just a list”. These attributes are called “classes”.
You can poke into these internals but “unclassing” an object. Call unclass(df)
in your R console and observe what result you get (just to convince yourself that data frames really are lists under the hood).
Honest admission – you will never need this unclass()
stuff in practice, ever. I’m really showing you to demonstrate what “data frame” actually is on a lower-level of R programming. If you’re confused, don’t worry. The fact that data frames are lists matters infinitely more than knowing exactly how is that accomplished inside R.
Remember how we talked about “named lists” in the previous section! Yes, data frames really are just normal named lists with extra bit of behavior added to them (namely the fact that these lists are printed in a nice, readable, tabular form).
Selecting columns
Quite often we need to extract values of an entire column of a data frame. In the Exercise about indexing, you have learned about the []
operator (for vectors and lists), and also about the $
and [[]]
operator (for lists). Now that you’ve learned that data frames are (on a lower level) just lists, what does it mean for wanting to extract a column from a data frame?
Try to use the three indexing options to extract the column named "z"
from your data frame df
. How do the results differ depending on the indexing method chosen? Is the indexing (and its result) different to indexing a plain list?
The tidyverse approach
In the chapter on tidyverse, we will learn much more powerful and easier tools to do these types of data-frame operations, particularly the select()
function. That said, even when you use tidyverse exclusively, you will still encounter code in the wild which uses this base R way of doing things. Additionally, for certain trivial actions, doing “the base R thing” is just quicker to types. This is why knowing the basics of $
, []
, and [[]]
will always be useful.
Selecting rows (“filtering”)
Of course, we often need to refer not just to specific columns of data frames, but also to given rows. Let’s consider our data frame again:
df
v w x y z
1 one 1.00 1 folks TRUE
2 two 2.72 13 hello FALSE
3 three 3.14 42 from FALSE
4 four 1000.10 NA data frame TRUE
5 five 1000000.00 NA ! TRUE
In the section on indexing into vectors and lists above, we learned primarily about two means of indexing into vectors. Let’s revisit them in the context of data frames:
- Integer-based indexing
What happens when you use the [1:3]
index into the df
data frame, just as you would do by extracting the first three elements of a vector?
When indexing into a data frame, you need to distinguish the dimension along which you’re indexing: either a row, or a column dimension. Just like in referring to a cell coordinate in Excel, for example.
The way you do this for data frames in R is to separate the dimensions into which you’re indexing with a comma in this way: [row-index, column-name-or-index]
.
Try to extract the first three elements (1:3
) of the data frame df
by df[1:3, ]
. Note the empty space after the comma ,
! Then select a subset of the df
data frame to only show the row #1 and #4 for columns "x"
and "z"
.
- Logical-based indexing
Similarly to indexing into vectors, you can also specify which rows should be extracted “at once”, using a single logical vector (you can also do this for columns but I honestly don’t remember the last time I had to do this).
The most frequent use for this is to select all rows of a data frame for which a given column (or multiple columns) carry a certain value. Select only those rows for which the column “y” has a value “hello”:
Now we can use this vector as a row index into our data frame (don’t forget the comma ,
, without which you’d be indexing into the column-dimension, not the row-dimension!).
df[row_indices, ]
v w x y z
2 two 2.72 13 hello FALSE
Of course, you can also both filter (remember this word) for a subset of rows and also, at the same time, select (remember this word too) a subset of columns at the same time:
c("v", "y", "z")] df[row_indices,
v y z
2 two hello FALSE
:::
Now instead of filtering rows where column y
matches “hello”, filter for rows where w
column is less than 1000.
Remember how we used to filter out elements of a vector using the !is.na(...)
operation? You can see that df
contains some NA
values in the x
column. Use the fact that you can filter rows of a data frame using logical-based vectors (as demonstrated above) to filter out rows of df
at which the x
column contains NA
values.
Hint: You can get indices of the rows of df
we you want to retain with !is.na(df$x)
.
Creating (and deleting) columns
The $
and []
operators can be used to create new columns. For instance, the paste()
function in R can be used to combine a pair of values into one. Try running paste(df$v, df$y)
to see what the result of this operation is.
The general pattern to do this is:
$<name of the new column> <- <vector of values to assign to it> df
Create a new column called "new_col"
and assign to it the result of paste(df$v, df$y)
.
When we want to remove a column from a data frame (for instance, we only used it to store some temporary result in a script), we actually do the same thing, except we assign to it the value NULL
.
Remove the column new_col
“Improper” column names
Most column names you will be using in your own script will (well, should!) follow the same rules as apply for variable names – they can’t start with a number, have to compose of alphanumeric characters, and can’t contain any other characters except for underscores (and occasionally dots). To quote from the venerable R language reference:
Identifiers consist of a sequence of letters, digits, the period (‘.’) and the underscore. They must not start with a digit or an underscore, or with a period followed by a digit.
For instance, these are examples of proper identifiers which can serve as variable names, column names and (later) function names:
variable1
a_longer_var_42
anotherVariableName
Unfortunately, when you encounter data in the wild, especially in tables you get from other people or download as supplementary information from the internet, they are rarely this perfect. Here’s a little example of such data frame:
weird_df
v with spaces w y with % sign
1 one 1.00 folks
2 two 2.72 hello
3 three 3.14 from
4 four 1000.10 data frame
5 five 1000000.00 !
If you look closely, you see that some columns have spaces " "
and also strange characters %
which are not allowed? Which of the $
, []
and [[]]
operators can you use to extract v with spaces
and y with % sign
columns as vectors?
The tidyverse approach
Similarly to previous section on column selection, there’s a much more convenient and faster-to-type way of doing filtering, using the tidyverse function filter()
. Still, as with the column selection, sometimes doing the quick and easy thing is just more convenient. The minimum on filtering rows of data frames introduced in this section will be enough for you, even in the long run!
Exercise 7: Inspecting column types
Let’s go back to our example data frame:
<- data.frame(
df1 w = c(1.0, 2.72, 3.14),
x = c(1, 13, 42),
y = c("hello", "folks", "!"),
z = c(TRUE, FALSE, FALSE)
)
df1
w x y z
1 1.00 1 hello TRUE
2 2.72 13 folks FALSE
3 3.14 42 ! FALSE
Use the function str()
and by calling str(df1)
, inspect the types of columns in the table.
Sometimes (usually when we read data from disk, like from another software), a data point sneaks in which makes a column apparently non numeric. Consider this new table called df2
:
<- data.frame(
df2 w = c(1.0, 2.72, 3.14),
x = c(1, "13", 42),
y = c("hello", "folks", "!"),
z = c(TRUE, FALSE, FALSE)
)
df2
w x y z
1 1.00 1 hello TRUE
2 2.72 13 folks FALSE
3 3.14 42 ! FALSE
Just by looking at this, the table looks the same as df1
above. Use str()
again to see where the problem is.
str(df2)
'data.frame': 3 obs. of 4 variables:
$ w: num 1 2.72 3.14
$ x: chr "1" "13" "42"
$ y: chr "hello" "folks" "!"
$ z: logi TRUE FALSE FALSE
Exercise 8: Functions
The motivation for this Exercise could be summarized by an ancient motto of programming: Don’t repeat yourself (DRY): “[…] a modification of any single element of a system does not require a change in other logically unrelated elements.”.
Let’s demonstrate this idea in practice.
Let’s say you have the following numeric vector (these could be base qualities, genotype qualities, \(f\)-statistics, sequencing coverage, anything):
<- c(0.32, 0.78, 0.68, 0.28, 1.96, 0.21, 0.07, 1.01, 0.06, 0.74,
vec 0.37, 0.6, 0.08, 1.81, 0.65, 1.23, 1.28, 0.11, 1.74, 1.68)
With numeric vectors, we often need to compute some summary statistics (mean, median, quartile, minimum, maximum, etc.). What’s more, in a given project, we often have to do this computation multiple times in a number of places.
In R, we have a very useful built-in function summary()
, which does exactly that. But let’s ignore this for the moment, for learning purposes.
Here is how we can compute those summary statistics individually:
min(vec)
[1] 0.06
# first quartile (a value which is higher than the bottom 25% of the data)
quantile(vec, probs = 0.25)
25%
0.2625
median(vec)
[1] 0.665
mean(vec)
[1] 0.783
# third quartile (a value which is higher than the bottom 75% of the data)
quantile(vec, probs = 0.75)
75%
1.2425
max(vec)
[1] 1.96
Now, you can imagine that you have many more of such vectors (results for different sequenced samples, different computed population genetic metrics, etc.). Having to type out all of these commands for every single one of those vectors would very quickly get extremely tiresome. Worse still, when we would do this, we would certainly resort to copy-pasting, which is guaranteed to lead to errors.
Write a custom function called my_summary
, which will accept a single input named values
, and returns a list which binds all the six summary statistics together. Name the elements of that list as "min"
, "quartile_1"
, "median"
, "mean"
, "quartile_3"
, and "max"
.
Yes, we had to write the code anyway, we even had to do the extra work of wrapping it inside other code (the function
body, name the one input argument values
, which could be multiple arguments for more complex function). So, one could argue that we didn’t actually save any time. However, that code is now “encapsulated” in a fully self-contained form and can be called repeatably, without any copy-pasting.
In other words, if you now create these three vectors of numeric values:
<- runif(10)
vec1 <- runif(10)
vec2 <- runif(10) vec3
You can now compute our summary statistics by calling our function my_summary()
on these vectors, without any code repetition:
The punchline is this: if we ever need to modify how are summary statistics are computed, _we only have to make a single change in the function code instead of having to modify multiple copies of the code in multiple locations in our project.
Exercise 9: Base R plotting
As a final short section, it’s worth pointing out some very basic base R plotting functions. We won’t be getting into detail because tidyverse provides a much more powerful and more convenient set of functionality for visualizing data. Still, base R is often convenient for quick troubleshooting or quick plotting of data at least in the initial phases of data exploration.
So far we’ve worked with a really oversimplified data frame. For more interesting demonstration, R bundles with a realistic data frame of penguin data:
head(penguins)
species island bill_len bill_dep flipper_len body_mass sex year
1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
2 Adelie Torgersen 39.5 17.4 186 3800 female 2007
3 Adelie Torgersen 40.3 18.0 195 3250 female 2007
4 Adelie Torgersen NA NA NA NA <NA> 2007
5 Adelie Torgersen 36.7 19.3 193 3450 female 2007
6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
Use the function hist()
to plot a histogram of the body mass of the entire penguins data set.
The dataset also contains the measure of bill length. Use the function plot()
to visualize a scatter plot of two vectors: penguins$flipper_len
against penguins$body_mass
. Is there an indication of a relationship between the two metrics?
Base R plotting is very convenient for quick and dirty data summaries, particularly immediately after reading unknown data. However, for anything more complex (and anything more pretty), ggplot2 is unbeatable. We will be looking at ggplot2 visualization in the session on tidyverse but as a sneakpeak, you can take a look at the beautiful figures you’ll learn how to make later.
Let’s revisit the concept of a function introduced earlier and explore it in a much more practically useful setting – defining customized plotting functions. This is, honestly, one of the most applications of custom functions in your own research.
In the example above, you used hist(penguins$body_mass)
to plot a histogram of the penguins’ body mass. Write a custom function penguin_hist()
which will accept two arguments: 1. the name of the column in the penguins
data frame, and 2. the number of histogram breakpoints to use as the breaks =
arguments in a hist(<vector>, breaks = ...)
call.
Hint: Remember that you can extract a column of a data frame as a vector not just using a symbolic identifier with the $
operator but also as a string name with the [[]]
operator.
Hint: A useful new argument of the hist()
function is main =
. You can specify this as a string and the contents of this argument will be plotted as the figure’s title.
If you need further help, feel free to use this template and fill the insides of the function body with your code:
<- function(df, column, breaks) {
penguin_hist # ... put your hist() code here
}
If this function stuff appears overwhelming, don’t worry. In the remainder of this workshop you will get a lot of opportunities to practice coding so that it becomes completely trivial at the end.
Exercise 10: Conditional expressions
Oftentimes, especially when writing custom code, we need to make automated decisions whether something should or shouldn’t happen given a certain value. We can do this using the if
expression which has the following form:
if (<condition resulting in TRUE or FALSE>) {
if condition is TRUE...
... code which should be executed }
An extension of this is the if-else
expression, taking the following form:
if (<condition resulting in TRUE or FALSE>) {
if condition is TRUE...
... code which should be executed else {
} if condition is FALSE...
... code which should be executed }
Go back to your new penguin_hist()
function and add an if
expression which will make sure that breaks
is greater than 0. In other words, if breaks < 1
(the condition you will be testing against), execute the command stop("Incorrect breaks argument given!")
.
This was just a little sneak peak. As you get more comfortable with programming, if
and if-else
expressions like this will be very useful to make your code more robust. Whenever you’re coding, catching errors as soon as they happen is extremely important!
Exercise 11: Iteration and loops
Functions help us take pieces of code and generalize them to reduce the amount of code needed to do similar things, repeatedly, multiple times, and avoid code duplication by copy-pasting (nearly the same) chunks of code over and over. You could think of iteration as generalizing those repetitions even further. Instead of manually calling a bit of code repeatedly, we can iterate over that code in an iterative way.
In general, there are two types of loops:
1. Loops producing a value for each iteration
The loops in this category which we are going to encounter most often are those of the apply
family, like lapply()
or sapply()
. The general pattern like this:
<- lapply(<vector/list of values>, <function>) result
The lapply
and sapply
functions take, at minimum, a vector or a list as their first argument, and then a function which takes a single argument. Then they apply the given function to each element of the vector/list, and return either a list (if we use the lapply()
function) or a vector (if we use the sapply()
function).
Let’s consider this more concrete example:
<- list("hello", "this", 123, "is", "a mixed", 42, "list")
input_list
<- lapply(input_list, is.numeric)
result_list
result_list
[[1]]
[1] FALSE
[[2]]
[1] FALSE
[[3]]
[1] TRUE
[[4]]
[1] FALSE
[[5]]
[1] FALSE
[[6]]
[1] TRUE
[[7]]
[1] FALSE
We took an input_list
and applied a function to each element, automatically, gathering the results in another list!
Create the following function which, for a given number, returns TRUE
if this number is even and FALSE
if the number is odd. Then use sapply()
to test which of the following numbers in the input_vector
is odd or even. Notice that we can do thousands or even millions of operations like this (for very very long input vectors) with a single sapply()
command!
<- c(23, 11, 8, 36, 47, 6, 66, 94, 20, 2)
input_vector
<- function(x) {
is_even %% 2 == 0 # this answers the question "does x divided by 2 give 0?")
x }
As a practice of indexing, use the result of the sapply()
you just got to filter the input_vector
values to only odd numbers.
Hint: Remember that you can negate any TRUE
/ FALSE
value (even vector) using the !
operator.
These examples are perhaps too boring to see immediate usefulness of sapply()
and lapply()
for real-world applications. In later sessions, we will see more complex (and more useful) applications of these functions in daily data analysis tasks.
That said, you can hopefully see how automating an action of many things at once can be very useful means to save yourself repeated typing of the same command over and over. Again, consider a situation in which you have thousands or even millions of data points!
2. Loops which don’t necessarily return a value
This category of loops most often takes form of a for
loop, which generally have the following shape:
for (<item> in <vector or list>) {
... some commands ... }
The most trivial runnable example I could think of is this:
# "input" vector of x values
<- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
xs
# take each x out of all given xs in sequence...
for (x in xs) {
# ... and print it out on a new line
cat("The square of", x, "is", x^2, "\n")
}
The square of 1 is 1
The square of 2 is 4
The square of 3 is 9
The square of 4 is 16
The square of 5 is 25
The square of 6 is 36
The square of 7 is 49
The square of 8 is 64
The square of 9 is 81
The square of 10 is 100
Note: cat()
is a very useful function which prints out a given value (or here, actually, multiple values!). If we append "\n"
it will add an (invisible) “new line” character, equivalent of hitting ENTER on your keyboard when writing.
Let’s say you want to automate the plotting of several numeric variables from your penguins
data frame to a PDF using your custom-defined function penguin_hist()
you created above. Fill in the necessary bits of code in the following template to do this! This is a super common pattern that comes in handy very often in data science work.
Note: You can see the cat()
call in the body of the for loop (we call the insides of the { ... }
block the “body” of a loop). When you iterate over many things, it’s very useful to print out this sort of “log” information, particularly if the for loop can take very long.
# create an empty PDF file
pdf("penguins_hist.pdf")
# let's define our variables of interest (columns of a data frame)
<- c("bill_len", "bill_dep", "flipper_len", "body_mass")
variables
# let's now "loop over" each of those variables
for (var in variables) {
cat("Plotting variable", var, "...\n")
# ... here is where you should put your call to `penguin_hist()` like
# you did above manually ...
}
Plotting variable bill_len ...
Plotting variable bill_dep ...
Plotting variable flipper_len ...
Plotting variable body_mass ...
dev.off() # this closes the PDF
quartz_off_screen
2
If everything works correctly, look at the penguins_hist.pdf
file you just created! It was all done in a fully automated way! A hugely important thing for reproducibility.
If you want, look up ?pdf
to see how you could modify the width
and height
of the figures that will be created.
Further practice
If you have energy and time, take a look at the following chapters of the freely-available Advanced R textbook (the best resource for R programming out there). First work through the quiz at the beginning of each chapter. If you’re not sure about answers (the questions are very hard, so if you can’t answer them, that’s completely OK), work through the material of each chapter and try to solve the exercises.
Pick whichever topic seems interesting to you. Don’t try to ingest everything – even isolated little bits and details that stick with you will pay off in the long run!