Introduction to tidyverse

Packages introduced in this session

Before we begin, let’s introduce some crucial parts of the tidyverse ecosystem, which we will be using extensively during exercises in this chapter (and throughout the remaining chapters).

  1. dplyr is a centerpiece of the entire R data science universe, providing important functions for data manipulation, data summarization, and filtering of tabular data. Many data-frame operations that were annoying or cumbersome to do during the R bootcamp session are easy and natural to do with dplyr.

  2. readr provides useful functions for reading (and writing) tabular data. Think of it as a set of better alternatives to base R functions such as read.table(). (Another very useful package is readxl which is useful for working with Excel data).

Every single script you will be writing in this session will therefore begin with these three lines of code, which load functions from these two packages.

library(dplyr)
library(readr)

Our example data set

Let’s also introduce a second star in this session, our example data set. The command below will read a metadata table with information about individuals published in a recent aDNA paper on the history or the Holocene in West Eurasia, dubbed “MesoNeo” (reference). Feel free to take a couple of minutes to skim the paper to get an idea about the sort of data we will be using during our exercises.

We will read the metadata of individual sequenced as part of this paper using the read_tsv() function (from the above-mentioned readr package), and store the metadata table in a variable df (it’s a short name and we’ll be typing it a lot in this session).

Yes, this function is quite magical – it can read stuff from a file stored on the internet. If you’re curious about the file itself, just paste the URL address in your browser. Even though R gives us superpowers in analyzing data, it’s never a bad thing to rely on more mundane ways to look at the information we’re dealing with.

df <- read_tsv("https://tinyurl.com/simgen-metadata")

Exercise 0: Creating a new script

Now, begin by creating a new R script in RStudio, (File -> New file -> R Script) and save it somewhere on your computer as tidy-basics.R (File -> Save). Put the library() calls and the read_tsv() command above in this script.

Unlike the previous R Bootcamp session in which it didn’t really matter if you wrote in a script or R console, in this session, whenever you’re done experimenting in the R console, always record your solution in your R script. I recommend separating code for different exercises with comments, perhaps something like this:

########################################
# Exercise 1 solutions
########################################

Now let’s get started!

A selection of data-frame inspection functions

Whenever you get a new source of data, like a table from a collaborator, or a data sheet downloaded from supplementary materials of a paper (our situation in this session!), you need to familiarize yourself with it before you do anything else.

Here is a list of functions that you will be using constantly when doing data science to answer the following basic questions about your data.

  1. “How many observations (rows) and variables (columns) does my data have?”nrow(), ncol()

  2. “What variable names (columns) am I going to be working with?”colnames()

  3. “What data types (numbers, strings, logicals, etc.) does it contain?”str(), or better, glimpse()

  4. “How can I take a quick look at a subset of the data (first/last couple of rows)?”head(), tail()

  5. “For a specific variable column, what is the distribution of values I can expect in that column?”table() for “categorical types” (types which take only a couple of discrete values), summary() for “numeric types”, min(), max(), which.min(), which.max(). Remember that you can get values of a given col in a data frame df by using the named-list syntax of df$col!

Note: Feel free to use this list as another cheatsheet of sorts! Also, don’t forget to use the ?function command in the R console to look up the documentation to see the possibilities for options and additional features, or even just to refresh your memory. Every time I look up the ? help for a function I’ve been using for decade, I always learn new tricks.


Before we move on, note that when you type df into your R console, you will see a slightly different format of the output than when we worked with plain R data frames in the previous chapter. This format of data frame data is called a “tibble” and represents tidyverse’s more user friendly and modern take on data frames. For almost all practical purposes, from now on, we’ll be talking about tibbles as data frames (they behave the same anyway).


Exercise 1: Exploring new data

Try to answer the following questions using the functions from the list above (you should decide which function is appropriate for which question).

Before you use one of these functions for the first time, take a moment to skim through its help using the ?FUNCTIONNAME command, just to build that habit of not forgetting that the help pages are always there.


What variables (i.e., columns) do we have in our data? Think about what could they mean (some will be obvious, some less so, as is always the case with unknown data from a supplementary material of a paper). What do the first couple of rows of the data look like (i.e., what form does the data take)?

Column names function:

colnames(df)
 [1] "sampleId"     "popId"        "site"         "country"      "region"      
 [6] "continent"    "groupLabel"   "groupAge"     "flag"         "latitude"    
[11] "longitude"    "dataSource"   "age14C"       "ageHigh"      "ageLow"      
[16] "ageAverage"   "datingSource" "coverage"     "sex"          "hgMT"        
[21] "gpAvg"        "ageRaw"       "hgYMajor"     "hgYMinor"    

A couple of first rows:

head(df)
# A tibble: 6 × 24
  sampleId popId site  country region     continent groupLabel groupAge flag 
  <chr>    <chr> <chr> <chr>   <chr>      <chr>     <chr>      <chr>    <chr>
1 NA18486  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
2 NA18488  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
3 NA18489  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
4 NA18498  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
5 NA18499  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
6 NA18501  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

How many individuals do we have metadata for?

Number of rows:

nrow(df)
[1] 4072

Number of unique() sample IDs (this should ideally always give the same number, but there’s never enough sanity checks in data science):

length(unique(df$sampleId))
[1] 4072

What data types (numbers, strings, logicals, etc.) are our variables of?

Hint: Look at the list of questions and suitable functions above!

glimpse(df)
Rows: 4,072
Columns: 24
$ sampleId     <chr> "NA18486", "NA18488", "NA18489", "NA18498", "NA18499", "N…
$ popId        <chr> "YRI", "YRI", "YRI", "YRI", "YRI", "YRI", "YRI", "YRI", "…
$ site         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ country      <chr> "Nigeria", "Nigeria", "Nigeria", "Nigeria", "Nigeria", "N…
$ region       <chr> "WestAfrica", "WestAfrica", "WestAfrica", "WestAfrica", "…
$ continent    <chr> "Africa", "Africa", "Africa", "Africa", "Africa", "Africa…
$ groupLabel   <chr> "YRI", "YRI", "YRI", "YRI", "YRI", "YRI", "YRI", "YRI", "…
$ groupAge     <chr> "Modern", "Modern", "Modern", "Modern", "Modern", "Modern…
$ flag         <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0…
$ latitude     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ longitude    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ dataSource   <chr> "1000g", "1000g", "1000g", "1000g", "1000g", "1000g", "10…
$ age14C       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ageHigh      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ageLow       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ageAverage   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ datingSource <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ coverage     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ sex          <chr> "XY", "XX", "XX", "XY", "XX", "XY", "XX", "XY", "XX", "XY…
$ hgMT         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ gpAvg        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ ageRaw       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ hgYMajor     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ hgYMinor     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

What are the columns we have that describe the ages (maybe look at those which have “age” in their name you detected with the columns() function earlier)? How many missing values (NA) are in the ageAverage column? What information does the groupAge column probably contain?

Hint: Remember that for a column vector df$COLUMN_NAME, the command is.na(df$COLUMN_NAME) gives you the value TRUE for each NA element in that column variable, and sum(is.na(df$COLUMN_NAME)) then counts the number of those NA values (because TRUE counts as 1, FALSE as 0). Alternatively, mean(is.na(df$COLUMN_NAME)) counts the proportion of these NA values (because it computes the proportion of 1s).

From colnames(df) above we see a number of columns which seem to have something todo with “age”:

columns <- colnames(df)
columns
 [1] "sampleId"     "popId"        "site"         "country"      "region"      
 [6] "continent"    "groupLabel"   "groupAge"     "flag"         "latitude"    
[11] "longitude"    "dataSource"   "age14C"       "ageHigh"      "ageLow"      
[16] "ageAverage"   "datingSource" "coverage"     "sex"          "hgMT"        
[21] "gpAvg"        "ageRaw"       "hgYMajor"     "hgYMinor"    

Our age columns of interest are these:

[1] "groupAge"   "age14C"     "ageHigh"    "ageLow"     "ageAverage"
[6] "ageRaw"    

Now let’s take a look at the values available in the column ageAverage:

mean(!is.na(df$ageAverage))
[1] 0.4093811
unique(df$groupAge)
[1] "Modern"  "Ancient" "Archaic"

It looks like the ageAverage variable has the proportion of non-NA values at about 40.94%.

We also seem to have another column, groupAge which clusters individuals into three groups. We’ll stick to these two variables whenever we have a question regarding a date of an individual.


How many “ancient” individuals do we have in our data? How many “modern” (i.e., present-day humans) individuals (column groupAge)?

Hint: Maybe the table() function is most useful here?

table() is probably the best solution here:

table(df$groupAge)

Ancient Archaic  Modern 
   1664       3    2405 

Who are the mysterious “Archaic” individuals? What are their names (sampleId column) and which publications they come frome (dataSource column)? Use your data-frame row- and column-indexing knowledge from the R bootcamp session here.

Hint: We need to filter our table down to rows which have groupAge == "Archaic". This is an indexing operation which you learned about in the R bootcamp session! Remember that data frames can be indexed into along two dimensions: rows and columns. You want to filter by the rows here using a logical vector obtained by df$groupAge == "Archaic".

Again, this gets us a logical vector which has TRUE for each element corresponding to an “Archaic” individual (whoever that might be):

which_archaic <- df$groupAge == "Archaic"
head(which_archaic)
[1] FALSE FALSE FALSE FALSE FALSE FALSE

And we can then use this vector as an index into our overall data frame, just like we learned in the bootcamp session:

archaic_df <- df[which_archaic, ]
archaic_df$sampleId
[1] "Vindija33.19"    "AltaiNeandertal" "Denisova"       

Our mysterious “Archaic” individuals are two Neanderthals and a Denisovan!

Here are the publications:

archaic_df$dataSource
[1] "Prufer_Science_2017" "Prufer_Nature_2014"  "Pruefer_2017"       

Do we have geographical information in our metadata? Maybe countries or geographical coordinates (or even both)? Which countries do we have individuals from?

Hint: Again, the function table() ran on an appropriate column will help here.

Looking at colnames(df) we have a number of columns which have something to do with geography: country, region, and also the traditional longitude and latitude coordinates.

How can we get an idea about the distribution of countries? table() counts the number of elements in a vector (basically, a histogram without plotting it), sort() then sorts those elements for easier reading:

sort(table(df$country), decreasing = TRUE)

         India          China         Russia          Italy        Nigeria 
           394            308            285            256            207 
       Denmark         Sweden             UK          Spain         Gambia 
           184            174            167            132            114 
           USA        Vietnam          Japan     PuertoRico        Finland 
           108            108            107            104            102 
         Kenya       Barbados       Pakistan       Colombia           Peru 
            98             96             96             94             91 
   SierraLeone         Mexico     Kazakhstan        Estonia        Ukraine 
            85             67             64             60             60 
        Poland         France         Norway        Hungary        Iceland 
            56             54             42             41             40 
    Kyrgyzstan         Turkey        Lebanon       Portugal        Ireland 
            39             22             19             18             16 
        Faroes        Germany      Greenland        Moldova  CzechRepublic 
            15             12             11             10              9 
  South Africa           Iran         Latvia        Romania        Armenia 
             9              8              7              7              6 
        Canada         Serbia      Argentina         Brazil Canary Islands 
             6              6              5              5              5 
      Mongolia          Chile        Georgia           Laos       Malaysia 
             5              4              4              4              4 
      Thailand   Turkmenistan         Greece       Cameroon      Indonesia 
             4              4              3              2              2 
       Croatia       Ethiopia      Lithuania     Luxembourg    Philippines 
             1              1              1              1              1 
      Slovakia    Switzerland 
             1              1 
sort(table(df$region), decreasing = TRUE)

      NorthernEurope            SouthAsia             EastAsia 
                 635                  489                  420 
      SouthernEurope           WestAfrica         NorthAmerica 
                 409                  406                  382 
CentralEasternEurope        WesternEurope         SouthAmerica 
                 330                  251                  199 
           NorthAsia        SouthEastAsia          CentralAsia 
                 146                  124                  107 
          EastAfrica          WesternAsia          SouthAfrica 
                  99                   59                    9 
         NorthAfrica        CentralAfrica 
                   5                    2 

We will ignore longitude and latitude for now, because they are most useful in truly geographical data analysis setting (which we will delve into a bit later).


What’s the distribution of coverage of the samples (coverage column)? Compute the summary() function on the coverage column. Finally, use the hist() function to visualize this information to get a rough idea about the data.

summary() computes various useful summary statistics, and also reports on the number of NA values which are missing:

summary(df$coverage)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0124  0.2668  0.7824  1.6238  1.5082 58.8661    2405 

We can see that the coverage information is missing for 2405 individuals, which is the number of individuals in the (present-day) 1000 Genomes Project data. So it makes sense, we only have coverage for the lower-coverage aDNA samples, but for present-day individuals (who have imputed genomes), the coverage does not even make sense here.

Let’s plot the coverage values using base R hist() function:

hist(df$coverage, breaks = 100)

Exercise 2: Selecting columns

We have now some basic familiarity with the format of our data and which kinds of variables/columns we have for every individual. We also got a little bit more practice on using base R for basic data-frame operations, mostly indexing (or subsetting). It’s time to learn about techniques for manipulating, modifying, and filtering this data using tidyverse, specifically the dplyr package.

Often times we end up in a situation in which we don’t want to have a large data frame with a huge number of columns. Not as much for the reasons of the data taking up too much memory, but for convenience. You can see that our df metadata table has 24 columns, which don’t fit on the screen if we just print it out (note the “13 more variables” in the output). Just try this yourself again in your R console and observe what kind of (cluttered) output you get:

df
# A tibble: 4,072 × 24
   sampleId popId site  country region     continent groupLabel groupAge flag 
   <chr>    <chr> <chr> <chr>   <chr>      <chr>     <chr>      <chr>    <chr>
 1 NA18486  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 2 NA18488  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 3 NA18489  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 4 NA18498  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 5 NA18499  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 6 NA18501  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 7 NA18502  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 8 NA18504  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
 9 NA18505  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
10 NA18507  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
# ℹ 4,062 more rows
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

You can select which columns to pick from a (potentially very large data frame) with the function select(), the simplest and most basic dplyr function. It has the following general format:

select(<data frame>, <column 1>, <column 2>, ...)

As a reminder, this is how we would select columns using a normal base R subsetting/indexing operation:

smaller_df <- df[, c("sampleId", "region", "coverage", "ageAverage")]

smaller_df
# A tibble: 4,072 × 4
   sampleId region     coverage ageAverage
   <chr>    <chr>         <dbl>      <dbl>
 1 NA18486  WestAfrica       NA         NA
 2 NA18488  WestAfrica       NA         NA
 3 NA18489  WestAfrica       NA         NA
 4 NA18498  WestAfrica       NA         NA
 5 NA18499  WestAfrica       NA         NA
 6 NA18501  WestAfrica       NA         NA
 7 NA18502  WestAfrica       NA         NA
 8 NA18504  WestAfrica       NA         NA
 9 NA18505  WestAfrica       NA         NA
10 NA18507  WestAfrica       NA         NA
# ℹ 4,062 more rows

This is the tidyverse approach using select():

smaller_df <- select(df, sampleId, region, coverage, ageAverage)

smaller_df
# A tibble: 4,072 × 4
   sampleId region     coverage ageAverage
   <chr>    <chr>         <dbl>      <dbl>
 1 NA18486  WestAfrica       NA         NA
 2 NA18488  WestAfrica       NA         NA
 3 NA18489  WestAfrica       NA         NA
 4 NA18498  WestAfrica       NA         NA
 5 NA18499  WestAfrica       NA         NA
 6 NA18501  WestAfrica       NA         NA
 7 NA18502  WestAfrica       NA         NA
 8 NA18504  WestAfrica       NA         NA
 9 NA18505  WestAfrica       NA         NA
10 NA18507  WestAfrica       NA         NA
# ℹ 4,062 more rows

Note: The most important thing for you to notice here is the absence of “double quotes”. It might not look like much, but saving yourself from having to type double quotes for every data-frame operation (like with base R) is incredibly convenient.


Practice select() by creating three new data frame objects and assigning them into the following two new variables:

  1. Data frame df_ages which contains all variables related to sample ages you found above
  2. Data frame df_geo which contains all variables related to geography you found above

Additionally, the first column of these data frames should always be sampleId, so make sure this column is included in the selection.

Check visually by typing out those two data frames into your console, or using the ncol() on them, that you indeed have a smaller number of columns in these two new data frames.

df_ages <- select(df, sampleId, groupAge, age14C, ageHigh, ageLow, ageAverage, ageRaw)
df_ages
# A tibble: 4,072 × 7
   sampleId groupAge age14C ageHigh ageLow ageAverage ageRaw
   <chr>    <chr>     <dbl>   <dbl>  <dbl>      <dbl> <chr> 
 1 NA18486  Modern       NA      NA     NA         NA <NA>  
 2 NA18488  Modern       NA      NA     NA         NA <NA>  
 3 NA18489  Modern       NA      NA     NA         NA <NA>  
 4 NA18498  Modern       NA      NA     NA         NA <NA>  
 5 NA18499  Modern       NA      NA     NA         NA <NA>  
 6 NA18501  Modern       NA      NA     NA         NA <NA>  
 7 NA18502  Modern       NA      NA     NA         NA <NA>  
 8 NA18504  Modern       NA      NA     NA         NA <NA>  
 9 NA18505  Modern       NA      NA     NA         NA <NA>  
10 NA18507  Modern       NA      NA     NA         NA <NA>  
# ℹ 4,062 more rows
df_geo <- select(df, site, country, region, latitude, longitude)
df_geo
# A tibble: 4,072 × 5
   site  country region     latitude longitude
   <chr> <chr>   <chr>         <dbl>     <dbl>
 1 <NA>  Nigeria WestAfrica       NA        NA
 2 <NA>  Nigeria WestAfrica       NA        NA
 3 <NA>  Nigeria WestAfrica       NA        NA
 4 <NA>  Nigeria WestAfrica       NA        NA
 5 <NA>  Nigeria WestAfrica       NA        NA
 6 <NA>  Nigeria WestAfrica       NA        NA
 7 <NA>  Nigeria WestAfrica       NA        NA
 8 <NA>  Nigeria WestAfrica       NA        NA
 9 <NA>  Nigeria WestAfrica       NA        NA
10 <NA>  Nigeria WestAfrica       NA        NA
# ℹ 4,062 more rows

Note: select() allows us to see the contents of columns of interest much easier. For instance, in a situation in which we want to analyse the geographical location of samples, we don’t want to see columns unrelated to that (like haplogrous, sex of an individual, etc. which are all part of the huge original table) and select() is the solution to this.


If your table has many columns of interest you might want to select (and save to a new variable like you did above), typing them all by hand like you did just now can become tiresome real quick. Here are a few helper functions which can be very useful in that situation, and which you can use inside the select() function as an alternative to manually typing out column names:

  • starts_with("age") – matches columns starting with the string “age”
  • ends_with("age") – matches columns ending with the string “age”
  • contains("age") – matches columns containing the string “age”

You can use them in place of normal column names. If we would modify our select() “template” above, we could do this, for instance:

select(df, starts_with("text1"), ends_with("text2"))

Check out the ?help belonging to those functions (note that they have ignore.case = TRUE set by default!). Then create the df_ages table again, but this time use the three helper functions listed above.

df_ages1 <- select(df, sampleId, groupAge, age14C, ageHigh, ageLow, ageAverage, ageRaw)

df_ages2 <- select(df,
                   sampleId,
                   starts_with("age", ignore.case = FALSE),
                   ends_with("Age", ignore.case = FALSE))

The function contains() unfortunately doesn’t work because of the coverage column:

select(df, contains("age"))
# A tibble: 4,072 × 7
   groupAge age14C ageHigh ageLow ageAverage coverage ageRaw
   <chr>     <dbl>   <dbl>  <dbl>      <dbl>    <dbl> <chr> 
 1 Modern       NA      NA     NA         NA       NA <NA>  
 2 Modern       NA      NA     NA         NA       NA <NA>  
 3 Modern       NA      NA     NA         NA       NA <NA>  
 4 Modern       NA      NA     NA         NA       NA <NA>  
 5 Modern       NA      NA     NA         NA       NA <NA>  
 6 Modern       NA      NA     NA         NA       NA <NA>  
 7 Modern       NA      NA     NA         NA       NA <NA>  
 8 Modern       NA      NA     NA         NA       NA <NA>  
 9 Modern       NA      NA     NA         NA       NA <NA>  
10 Modern       NA      NA     NA         NA       NA <NA>  
# ℹ 4,062 more rows

Exercise 3: Filtering rows

In the session on base R, we learned how to filter rows using the indexing operation along the row-based dimension of (two-dimensional) data frames. For instance, in order to find out which individuals in the df metadata are archaics, we can first created a TRUE / FALSE vector of the rows corresponding to those individuals like this:

# get a vector of the indices belonging to archaic individuals
which_archaic <- df$groupAge == "Archaic"

# take a peek at the result to make sure we got TRUE / FALSE vector
tail(which_archaic, 10)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

And then we would use it to index into the data frame (in its first dimension, before the , comma):

archaic_df <- df[which_archaic, ]
archaic_df
# A tibble: 3 × 24
  sampleId        popId site  country region continent groupLabel groupAge flag 
  <chr>           <chr> <chr> <chr>   <chr>  <chr>     <chr>      <chr>    <chr>
1 Vindija33.19    Vind… Vind… Croatia Centr… Europe    Neanderta… Archaic  0    
2 AltaiNeandertal Alta… Deni… Russia  North… Asia      Neanderta… Archaic  0    
3 Denisova        Deni… Deni… Russia  North… Asia      Denisova_… Archaic  0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

However, what if we need to filter on multiple conditions? For instance, what if we want to find all archaic individuals older than 50000 years?

One option would be to create multiple TRUE / FALSE vectors, each corresponding to one of those conditions, and then join them into a single logical vector by combining & and | logical operators like you did in the R bootcamp chapter. For example, we could do this:

# who is archaic in our data?
which_archaic <- df$groupAge == "Archaic"
# who is older than 50000 years?
which_old <- df$ageAverage > 50000

# who is BOTH? note the AND logical operator!
which_combined <- which_archaic & which_old

Then we can use that logical vector for row-based indexing (in this case, subsetting of rows) again:

df[which_combined, c("sampleId", "ageAverage")]
# A tibble: 2 × 2
  sampleId        ageAverage
  <chr>                <dbl>
1 AltaiNeandertal     125000
2 Denisova             80000

But this gets tiring very quickly, requires unnecessary amount of typing, and is very error prone. Imagine having to do this for many more conditions! The filter() function from tidyverse again fixes all of these problems.

We can rephrase the example situation with the archaics to use the new function filter() like this (please try it yourself!):

filter(df, groupAge == "Archaic" & ageAverage > 50000)

I hope that, even if you never really programmed much before, you appreciate that this single command involves very little typing and is immediately readable, almost like this English sentence:

“Filter data frame df for individuals in which column groupAge is”Archaic” and who are older than 50000 years”.

Over time you will see that all of tidyverse packages follow these ergonomic principles.

Note: It’s worth pointing out again that — just like this is the case across the entire tidyverse — we refer to columns of our data frame df (like the columns groupAge or ageAverage right above) as if they were normal variables! Just like we did in select() above. In any tidyverse function we don’t write "groupAge", but simply groupAge! When you start, you’ll probably make this kind of mistake quite often. So this is a reminder to keep an eye for this.


Practice filtering with the filter() function by finding out which individual(s) in your df metadata table are from a country with the value "CzechRepublic".

filter(df, country == "CzechRepublic")
# A tibble: 9 × 24
  sampleId popId   site       country region continent groupLabel groupAge flag 
  <chr>    <chr>   <chr>      <chr>   <chr>  <chr>     <chr>      <chr>    <chr>
1 NEO128   NEO128  Vedrovice  CzechR… Centr… Europe    Czech_Neo… Ancient  lowc…
2 kol6     kol6    Kol√≠n     CzechR… Centr… Europe    Czech_Neo… Ancient  0    
3 kol2     kol2    Kol√≠n     CzechR… Centr… Europe    Czech_Neo… Ancient  lowG…
4 RISE566  RISE566 Knezeves   CzechR… Centr… Europe    Czech_Bro… Ancient  0    
5 RISE586  RISE586 Moravsk√°… CzechR… Centr… Europe    Czech_Bro… Ancient  0    
6 RISE577  RISE577 Velk√© P≈… CzechR… Centr… Europe    Czech_Bro… Ancient  0    
7 DA111    DA111   Lovosice … CzechR… Centr… Europe    Czech_Iro… Ancient  0    
8 DA112    DA112   Lovosice … CzechR… Centr… Europe    Czech_Iro… Ancient  0    
9 RISE569  RISE569 Brandysek  CzechR… Centr… Europe    Czech_Iro… Ancient  0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Which one of these “Czech” individuals have coverage higher than 1? Which individuals have coverage higher than 10?

filter(df, country == "CzechRepublic" & coverage > 1)
# A tibble: 3 × 24
  sampleId popId   site       country region continent groupLabel groupAge flag 
  <chr>    <chr>   <chr>      <chr>   <chr>  <chr>     <chr>      <chr>    <chr>
1 kol6     kol6    Kol√≠n     CzechR… Centr… Europe    Czech_Neo… Ancient  0    
2 RISE577  RISE577 Velk√© P≈… CzechR… Centr… Europe    Czech_Bro… Ancient  0    
3 RISE569  RISE569 Brandysek  CzechR… Centr… Europe    Czech_Iro… Ancient  0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>
filter(df, country == "CzechRepublic" & coverage > 10)
# A tibble: 0 × 24
# ℹ 24 variables: sampleId <chr>, popId <chr>, site <chr>, country <chr>,
#   region <chr>, continent <chr>, groupLabel <chr>, groupAge <chr>,
#   flag <chr>, latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Exercise 4: The pipe %>%

The pipe operator, %>%, is the rockstar of the tidyverse R ecosystem, and the primary reason what makes tidy data workflow so efficient and easy to read.

First, what is “the pipe”? Whenever you see code like this:

something%>%f()

you can read it as:

“take something and put it as the first argument of f()`”.

Why would you want to do this? Imagine some complex data processing operation like this:

h(f(g(i(j(input_data)))))

This means take input_data, compute j(input_data), then compute i() on that, so i(j(input_data)), then compute g(i(j(input_data))), etc. Of course this is an extreme example but is surprisingly not that far off from what we often have to do in data science.

One way to make this easier to read would be perhaps this:

tmp1 <- j(input_data)
tmp2 <- i(tmp1)
tmp3 <- g(tmp2)
tmp4 <- f(tmp3)
result <- f(tmp4)

But that’s too much typing when we want to get insights into our data as quickly as possible with as little work as possible.

The pipe approach of tidyverse would make the same thing easier to write and read like this:

input_data %>% j %>% i %>% g %>% f %>% h

This kind of “data transformation chain” is so frequent that RStudio even provides a built-in shortcut for it:

  • CMD + Shift + M on macOS
  • CTRL + Shift + M on Windows and Linux

Whenever you will pipe something like this in your solutions, always get in the habit of using these shortcuts. Eventually this will allow you to write code as quickly as you can think, trust me! (And take a peek at the cheatsheet) of RStudio shortcuts to refresh your memory on the other useful shortcuts! For instance, Alt + - or Option + - inserts the <- assignment operator!


Use your newly acquired select() and filter() skills, powered by the %>% piping operator, and perform the following transformation on the df metadata table, “chaining” the filter-ing and select-ion operations on the indicated columns:

  1. First filter() the df metadata to get only those rows / individuals who are:
  • “Ancient” individuals (column groupAge)
  • older than 10000 years (column ageAverage)
  • from Italy (column country)
  • with coverage column higher than 3
  1. Then pipe the result of step 1. to select() columns: sampleId, site, sex, and hgMT and hgYMajor (mt and Y haplogroups).

As a practice, try to be as silly as you can and write the entire command with as many uses of filter() and select() function calls in sequence as you can.

Hint: Don’t write the entire pipeline at once! For filter(), start with one condition, evaluate it, then add another one, etc., inspecting the intermediate results as you’re seeing them in the R console after every evaluation from your script (CTRL / CMD + Enter). Alternatively, first build up everything in the console one step at a time, then paste the result into your script to save the command on disk. This is the tidyverse way of doing data science!

Hint: What I mean by this is that the following two commands produce the exact same result:

new_df1 <- filter(df, col1 == "MatchString" & col2 > 10000 & col3 == TRUE) %>%
  select(colX, colY, starts_with("ColNamePrefix"))

like doing this instead:

new_df1 <- df %>%
  filter(col1 == "MatchString") %>%
  filter(col2 > 10000) %>%
  filter(col3 == TRUE) %>%
  select(colX, colY, starts_with("ColNamePrefix"))

(Just take a moment to read both of these versions and compare them to convince youself they do the same.)

This works because “the thing on the left” (which is always a data frame) is placed by the %>% pipe operator as “the first argument of a function on the right” (which again expects always a data frame)!

Hopefully you now see what this idea of “build a more complex %>% pipeline one step a time” can mean in practice. Now apply these ideas to solve the filter() and select() exercise above.

df %>%
  filter(groupAge == "Ancient") %>%
  filter(ageAverage > 10000) %>%
  filter(country == "Italy") %>%
  filter(coverage > 3) %>%
  select(sampleId, site, sex, hgMT, hgYMajor)
# A tibble: 1 × 5
  sampleId site              sex   hgMT              hgYMajor
  <chr>    <chr>             <chr> <chr>             <chr>   
1 R7       Grotta Continenza XY    U5b1+16189+@16192 I2a     

We could also write the same thing more concisely (not the single filter() call):

df %>%
  filter(groupAge == "Ancient" & ageAverage > 10000 & country == "Italy" & coverage > 1) %>%
  select(sampleId, site, sex, hgMT, hgYMajor)
# A tibble: 1 × 5
  sampleId site              sex   hgMT              hgYMajor
  <chr>    <chr>             <chr> <chr>             <chr>   
1 R7       Grotta Continenza XY    U5b1+16189+@16192 I2a     

Note: It is always a good practice to split long chains of %>% piping commands into multiple lines, and indenting them neatly one after the other. Readability matters to avoid bugs in your code!


Exercise 5: More complex conditions

Recall our exercises about logical conditional expressions (&, |, !, etc.).

Whenever you need to do a more complex operation, such as saying that a variable columnX should have a value "ABC" or "ZXC", you can already guess that you can do this by writing filter(df, columnX == "ABC" | column == "ZXC").

Similarly, you can condition on numerical variables, just as we did in the exercises on TRUE / FALSE expressions. For instance, if you want to condition on a column varX being varX < 1 or varX > 10, you could write filter(df, varX < 1 | var X > 10).


Practice combining multiple filtering conditions into a tidyverse %>% “piping chain” by filtering our metadata table to find individuals for which the following conditions hold:

  1. They are “Ancient” (groupAge column)
  2. They are from “France” or “Canary Islands” (country column)
  3. Their coverage less than 0.1 or more than 3 (coverage column)

Hint: The easiest solution is to write three filter() function calls, one for each condition above:

df %>%
  filter(groupAge == "Ancient") %>%
  filter(country == "France" | country == "Canary Islands") %>%
  filter(coverage < 0.1 | coverage > 3)
# A tibble: 3 × 24
  sampleId popId  site        country region continent groupLabel groupAge flag 
  <chr>    <chr>  <chr>       <chr>   <chr>  <chr>     <chr>      <chr>    <chr>
1 gun011   gun011 Tenerife    Canary… North… Africa    CanaryIsl… Ancient  0    
2 NEO812   NEO812 Grotte du … France  Weste… Europe    France_Ne… Ancient  0    
3 NEO813   NEO813 Grotte du … France  Weste… Europe    France_Ne… Ancient  1d_r…
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Whenever you want to test whether a variable is of a set of multiple possible values (like you wanted here for the country filter), you can use the %in% operator:

df %>%
  filter(groupAge == "Ancient") %>%
  filter(country %in% c("France", "Canary Islands")) %>%
  filter(coverage < 0.1 | coverage > 3)
# A tibble: 3 × 24
  sampleId popId  site        country region continent groupLabel groupAge flag 
  <chr>    <chr>  <chr>       <chr>   <chr>  <chr>     <chr>      <chr>    <chr>
1 gun011   gun011 Tenerife    Canary… North… Africa    CanaryIsl… Ancient  0    
2 NEO812   NEO812 Grotte du … France  Weste… Europe    France_Ne… Ancient  0    
3 NEO813   NEO813 Grotte du … France  Weste… Europe    France_Ne… Ancient  1d_r…
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Now select individuals who are from Germany and have coverage higher than 3 or individuals who are from Estonia with coverage less or equal to 1. Save your result to df_subset variable and print everything in this table by executing print(df_subset, n = Inf).

Supplementary question: Why do we need to print() the result like this? What happens when you just type df_subset into your R console (or execute this from your script using CTRL / CMD + Enter on its own?)

We can of course compose (infinitely) complex combinations of & and | and !:

df_subset <- filter(df,
                    (country == "Germany" & coverage > 3) |
                    (country == "Estonia" & coverage <= 1))

print(df_subset, n = Inf)
# A tibble: 34 × 24
   sampleId  popId     site   country region continent groupLabel groupAge flag 
   <chr>     <chr>     <chr>  <chr>   <chr>  <chr>     <chr>      <chr>    <chr>
 1 Stuttgart Stuttgart Stutt… Germany Weste… Europe    Germany_N… Ancient  0    
 2 FN2       FN2       Bavar… Germany Weste… Europe    Germany_R… Ancient  0    
 3 Alh1      Alh1      Bavar… Germany Weste… Europe    Germany_M… Ancient  0    
 4 Alh10     Alh10     Bavar… Germany Weste… Europe    Germany_M… Ancient  0    
 5 Ardu1     Ardu1     Ardu,… Estonia North… Europe    Estonia_N… Ancient  0    
 6 Ardu2     Ardu2     Ardu,… Estonia North… Europe    Estonia_N… Ancient  0    
 7 RISE00    RISE00    J√§ba… Estonia North… Europe    Estonia_N… Ancient  0    
 8 Kunila1   Kunila1   Kursi… Estonia North… Europe    Estonia_N… Ancient  0    
 9 V14       V14       Harju… Estonia North… Europe    Estonia_B… Ancient  0    
10 X10       X10       Harju… Estonia North… Europe    Estonia_B… Ancient  0    
11 V9        V9        Harju… Estonia North… Europe    Estonia_B… Ancient  0    
12 0LS11     0LS11     Harju… Estonia North… Europe    Estonia_B… Ancient  0    
13 X17       X17       Harju… Estonia North… Europe    Estonia_B… Ancient  0    
14 X08       X08       Harju… Estonia North… Europe    Estonia_B… Ancient  0    
15 X14       X14       Harju… Estonia North… Europe    Estonia_B… Ancient  0    
16 V16       V16       Harju… Estonia North… Europe    Estonia_B… Ancient  0    
17 X15       X15       Tartu… Estonia North… Europe    Estonia_B… Ancient  0    
18 X11       X11       Ida-V… Estonia North… Europe    Estonia_B… Ancient  0    
19 V10       V10       Laane… Estonia North… Europe    Estonia_I… Ancient  0    
20 0LS10     0LS10     Laane… Estonia North… Europe    Estonia_I… Ancient  0    
21 VII4      VII4      Laane… Estonia North… Europe    Estonia_I… Ancient  0    
22 V11       V11       Saare… Estonia North… Europe    Estonia_I… Ancient  0    
23 V12       V12       Saare… Estonia North… Europe    Estonia_I… Ancient  0    
24 X04       X04       Saare… Estonia North… Europe    Estonia_I… Ancient  0    
25 VK480     VK480     Salme  Estonia North… Europe    Estonia_V… Ancient  0    
26 VK485     VK485     Salme  Estonia North… Europe    Estonia_V… Ancient  1d_r…
27 VK488     VK488     Salme  Estonia North… Europe    Estonia_V… Ancient  0    
28 VK490     VK490     Salme  Estonia North… Europe    Estonia_V… Ancient  1d_r…
29 VK504     VK504     Salme  Estonia North… Europe    Estonia_V… Ancient  0    
30 VK507     VK507     Salme  Estonia North… Europe    Estonia_V… Ancient  0    
31 VK554     VK554     Salme  Estonia North… Europe    Estonia_V… Ancient  0    
32 IIa       IIa       Saare… Estonia North… Europe    Estonia_M… Ancient  0    
33 IIf       IIf       Valga… Estonia North… Europe    Estonia_M… Ancient  0    
34 IVLS09KT  IVLS09KT  Tartu… Estonia North… Europe    Estonia_M… Ancient  0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

How is “chaining” several filter() command one after another different from using a single filter() command with multiple conditional expressions joined by &? How about the difference from joining them with |?

Exercise 6: Dropping columns

This one will be easy. If you want to drop a column from a table, just prefix its name with a minus sign (-) in a select() function.

Note: Yes, this also works with starts_with() and its friends above, just put - in front of them!

To demonstrate the dropping of columns in practice, here’s our df table again (just one row for brevity). Observe the columns of the resulting table:

df %>% head(1)
# A tibble: 1 × 24
  sampleId popId site  country region     continent groupLabel groupAge flag 
  <chr>    <chr> <chr> <chr>   <chr>      <chr>     <chr>      <chr>    <chr>
1 NA18486  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Observe what happens when we do this (and compare to the above):

df %>% select(-sampleId) %>% head(1)
# A tibble: 1 × 23
  popId site  country region     continent groupLabel groupAge flag  latitude
  <chr> <chr> <chr>   <chr>      <chr>     <chr>      <chr>    <chr>    <dbl>
1 YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0           NA
# ℹ 14 more variables: longitude <dbl>, dataSource <chr>, age14C <dbl>,
#   ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>, datingSource <chr>,
#   coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>,
#   hgYMajor <chr>, hgYMinor <chr>

And this:

df %>% select(-sampleId, -site) %>% head(1)
# A tibble: 1 × 22
  popId country region    continent groupLabel groupAge flag  latitude longitude
  <chr> <chr>   <chr>     <chr>     <chr>      <chr>    <chr>    <dbl>     <dbl>
1 YRI   Nigeria WestAfri… Africa    YRI        Modern   0           NA        NA
# ℹ 13 more variables: dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, ageAverage <dbl>, datingSource <chr>, coverage <dbl>,
#   sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>, hgYMajor <chr>,
#   hgYMinor <chr>

And this:

df %>% select(-sampleId, -site, -popId) %>% head(1)
# A tibble: 1 × 21
  country region     continent groupLabel groupAge flag  latitude longitude
  <chr>   <chr>      <chr>     <chr>      <chr>    <chr>    <dbl>     <dbl>
1 Nigeria WestAfrica Africa    YRI        Modern   0           NA        NA
# ℹ 13 more variables: dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, ageAverage <dbl>, datingSource <chr>, coverage <dbl>,
#   sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>, hgYMajor <chr>,
#   hgYMinor <chr>

The columns prefixed with - are dropped from the resulting table!

Rather than typing out a long list of columns to drop, we can also do this to specify the range of consecutive columns (notice the minus - sign):

df %>% select(-(sampleId:popId)) %>% head(1)
# A tibble: 1 × 22
  site  country region    continent groupLabel groupAge flag  latitude longitude
  <chr> <chr>   <chr>     <chr>     <chr>      <chr>    <chr>    <dbl>     <dbl>
1 <NA>  Nigeria WestAfri… Africa    YRI        Modern   0           NA        NA
# ℹ 13 more variables: dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, ageAverage <dbl>, datingSource <chr>, coverage <dbl>,
#   sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>, hgYMajor <chr>,
#   hgYMinor <chr>

Alternatively, we can also use our well-known c() function, which is very useful whenever we want to drop a non-consecutive set of columns (again notice the minus - sign):

df %>% select(-c(sampleId, site, popId)) %>% head(1)
# A tibble: 1 × 21
  country region     continent groupLabel groupAge flag  latitude longitude
  <chr>   <chr>      <chr>     <chr>      <chr>    <chr>    <dbl>     <dbl>
1 Nigeria WestAfrica Africa    YRI        Modern   0           NA        NA
# ℹ 13 more variables: dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, ageAverage <dbl>, datingSource <chr>, coverage <dbl>,
#   sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>, hgYMajor <chr>,
#   hgYMinor <chr>

Note: The same “range syntax” of using : and and listing columns with c() applies also to selecting which columns to choose, not just for dropping them.


Use the : range in select() to drop every column after country (i.e., all the way to the last column in your table, whichever column this is). Do not save the result back to the df variable though! Just run the select() command on its own.

Hint: There’s another useful helper function called last_col().

df %>% select(-(region:hgYMinor))
# A tibble: 4,072 × 4
   sampleId popId site  country
   <chr>    <chr> <chr> <chr>  
 1 NA18486  YRI   <NA>  Nigeria
 2 NA18488  YRI   <NA>  Nigeria
 3 NA18489  YRI   <NA>  Nigeria
 4 NA18498  YRI   <NA>  Nigeria
 5 NA18499  YRI   <NA>  Nigeria
 6 NA18501  YRI   <NA>  Nigeria
 7 NA18502  YRI   <NA>  Nigeria
 8 NA18504  YRI   <NA>  Nigeria
 9 NA18505  YRI   <NA>  Nigeria
10 NA18507  YRI   <NA>  Nigeria
# ℹ 4,062 more rows

I was personally a bit surprised that this also works without the parentheses. R is smart!

df %>% select(-region:-hgYMinor)
# A tibble: 4,072 × 4
   sampleId popId site  country
   <chr>    <chr> <chr> <chr>  
 1 NA18486  YRI   <NA>  Nigeria
 2 NA18488  YRI   <NA>  Nigeria
 3 NA18489  YRI   <NA>  Nigeria
 4 NA18498  YRI   <NA>  Nigeria
 5 NA18499  YRI   <NA>  Nigeria
 6 NA18501  YRI   <NA>  Nigeria
 7 NA18502  YRI   <NA>  Nigeria
 8 NA18504  YRI   <NA>  Nigeria
 9 NA18505  YRI   <NA>  Nigeria
10 NA18507  YRI   <NA>  Nigeria
# ℹ 4,062 more rows

Exercise 7: Renaming columns

Very often you read data frames in which columns have names which are either very long, containing characters which are not allowed, or generally inconvenient. Imagine a situation, in which you refer to a particular column very often in your workflow, but it takes too much effort to type it out, or it uses awkward characters.

After discussing select() and filter(), let’s introduce another member of the tidyverse—the function rename().

The template for using it is again very easy (again, you would replace the text in <pointy brackets> with appropriate symbols):

rename(<data frame>, <new name> = <old name>)

Create a new data frame df_subset by doing the following:

  1. First select() the columns sampleId, popId, country, continent, groupAge, ageAverage, and coverage.
  2. Pipe the result of the select() operation using %>% into:
  3. rename() function to give some columns a shorter name: sampleId -> sample, popId -> population, groupAge -> set, ageAverage -> age. Leave the country and coverage columns as they are (i.e., don’t rename those).
df_subset <-
  df %>%
  select(sampleId, popId, country, continent, groupAge, ageAverage, coverage) %>%
  rename(sample = sampleId, population = popId, set = groupAge, age = ageAverage)

df_subset
# A tibble: 4,072 × 7
   sample  population country continent set      age coverage
   <chr>   <chr>      <chr>   <chr>     <chr>  <dbl>    <dbl>
 1 NA18486 YRI        Nigeria Africa    Modern    NA       NA
 2 NA18488 YRI        Nigeria Africa    Modern    NA       NA
 3 NA18489 YRI        Nigeria Africa    Modern    NA       NA
 4 NA18498 YRI        Nigeria Africa    Modern    NA       NA
 5 NA18499 YRI        Nigeria Africa    Modern    NA       NA
 6 NA18501 YRI        Nigeria Africa    Modern    NA       NA
 7 NA18502 YRI        Nigeria Africa    Modern    NA       NA
 8 NA18504 YRI        Nigeria Africa    Modern    NA       NA
 9 NA18505 YRI        Nigeria Africa    Modern    NA       NA
10 NA18507 YRI        Nigeria Africa    Modern    NA       NA
# ℹ 4,062 more rows

We now have a much cleaner table which is much easier to work with!


A shortcut which can be quite useful sometimes is that select() also accepts the new_name = old_name renaming pattern used by the rename() function, which allows you to both select columns (and rename some of them) all at once. To practice this, create the df_subset data frame again, but this time using just select().

df_subset <- df %>%
  select(sample = sampleId, population = popId,
         country,
         continent,
         set = groupAge, age = ageAverage,
         coverage)

df_subset
# A tibble: 4,072 × 7
   sample  population country continent set      age coverage
   <chr>   <chr>      <chr>   <chr>     <chr>  <dbl>    <dbl>
 1 NA18486 YRI        Nigeria Africa    Modern    NA       NA
 2 NA18488 YRI        Nigeria Africa    Modern    NA       NA
 3 NA18489 YRI        Nigeria Africa    Modern    NA       NA
 4 NA18498 YRI        Nigeria Africa    Modern    NA       NA
 5 NA18499 YRI        Nigeria Africa    Modern    NA       NA
 6 NA18501 YRI        Nigeria Africa    Modern    NA       NA
 7 NA18502 YRI        Nigeria Africa    Modern    NA       NA
 8 NA18504 YRI        Nigeria Africa    Modern    NA       NA
 9 NA18505 YRI        Nigeria Africa    Modern    NA       NA
10 NA18507 YRI        Nigeria Africa    Modern    NA       NA
# ℹ 4,062 more rows

When would you use one or the other (select() vs rename())`?

Answer: select() always drops the columns which are not explicitly listed. rename() only renames the columns which are listed, but retains all other columns even though they are not listed.


Exercise 8: Reorganizing columns

Let’s look at another useful application of the select() function and that is reordering columns. Our df metadata table has 24 columns. When we print it out, we only see a couple of them:

head(df, 3)
# A tibble: 3 × 24
  sampleId popId site  country region     continent groupLabel groupAge flag 
  <chr>    <chr> <chr> <chr>   <chr>      <chr>     <chr>      <chr>    <chr>
1 NA18486  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
2 NA18488  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
3 NA18489  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Oftentimes when doing data analysis, we often work interactively in the console, focusing on a specific subset of columns, and need to immediately see the values of our columns of interest, rather than having them buried in the rest of the (non-visible) output — how can we do this?

We already know that we can use select() to pick those columns of interest, but this removes the non-selected columns from the data frame we get. Whenever we want to retain them, we can add the call to everything(), like this:

select(<data frame>, <column 1>, <column 2>, ..., everything())

Which effectively moves <column 1>, <column 2>, … to the “front” of our table, and adds everything else at the end.

Select the subset of columns you selected in the previous exercise on renaming in exactly the same way, but this time add a call to everything() at the end to keep the entire data set intact (with just columns rearranged).

Please make sure to save the result to df again, so that the metadata table has renamed columns from now on (it will be important later). Just assign the select() result back into df using the <- assignment operator.

Our data frame before:

head(df, 3)
# A tibble: 3 × 24
  sampleId popId site  country region     continent groupLabel groupAge flag 
  <chr>    <chr> <chr> <chr>   <chr>      <chr>     <chr>      <chr>    <chr>
1 NA18486  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
2 NA18488  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
3 NA18489  YRI   <NA>  Nigeria WestAfrica Africa    YRI        Modern   0    
# ℹ 15 more variables: latitude <dbl>, longitude <dbl>, dataSource <chr>,
#   age14C <dbl>, ageHigh <dbl>, ageLow <dbl>, ageAverage <dbl>,
#   datingSource <chr>, coverage <dbl>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>
df <- select(df, sample = sampleId, population = popId, country, continent,
             set = groupAge, age = ageAverage, coverage,
             everything())

Our data frame after:

head(df, 3)
# A tibble: 3 × 24
  sample  population country continent set      age coverage site  region    
  <chr>   <chr>      <chr>   <chr>     <chr>  <dbl>    <dbl> <chr> <chr>     
1 NA18486 YRI        Nigeria Africa    Modern    NA       NA <NA>  WestAfrica
2 NA18488 YRI        Nigeria Africa    Modern    NA       NA <NA>  WestAfrica
3 NA18489 YRI        Nigeria Africa    Modern    NA       NA <NA>  WestAfrica
# ℹ 15 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Notice that we prioritized the selected columns of interest (and also renamed some for more readability), but we still have all the other columns available!


Experiment with the function relocate() (it uses the same format as select()). Try it with our df table and with giving it a names of a couple of columns, similarly to what you did with select() above. What result do you get?

What happens when you do the same with select() (just use select() instead of relocate()) and add everything() after the last column?

relocate(df, flag, ageLow, continent)
# A tibble: 4,072 × 24
   flag  ageLow continent sample  population country set      age coverage site 
   <chr>  <dbl> <chr>     <chr>   <chr>      <chr>   <chr>  <dbl>    <dbl> <chr>
 1 0         NA Africa    NA18486 YRI        Nigeria Modern    NA       NA <NA> 
 2 0         NA Africa    NA18488 YRI        Nigeria Modern    NA       NA <NA> 
 3 0         NA Africa    NA18489 YRI        Nigeria Modern    NA       NA <NA> 
 4 0         NA Africa    NA18498 YRI        Nigeria Modern    NA       NA <NA> 
 5 0         NA Africa    NA18499 YRI        Nigeria Modern    NA       NA <NA> 
 6 0         NA Africa    NA18501 YRI        Nigeria Modern    NA       NA <NA> 
 7 0         NA Africa    NA18502 YRI        Nigeria Modern    NA       NA <NA> 
 8 0         NA Africa    NA18504 YRI        Nigeria Modern    NA       NA <NA> 
 9 0         NA Africa    NA18505 YRI        Nigeria Modern    NA       NA <NA> 
10 0         NA Africa    NA18507 YRI        Nigeria Modern    NA       NA <NA> 
# ℹ 4,062 more rows
# ℹ 14 more variables: region <chr>, groupLabel <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>,
#   hgYMajor <chr>, hgYMinor <chr>
select(df, flag, ageLow, continent, everything())
# A tibble: 4,072 × 24
   flag  ageLow continent sample  population country set      age coverage site 
   <chr>  <dbl> <chr>     <chr>   <chr>      <chr>   <chr>  <dbl>    <dbl> <chr>
 1 0         NA Africa    NA18486 YRI        Nigeria Modern    NA       NA <NA> 
 2 0         NA Africa    NA18488 YRI        Nigeria Modern    NA       NA <NA> 
 3 0         NA Africa    NA18489 YRI        Nigeria Modern    NA       NA <NA> 
 4 0         NA Africa    NA18498 YRI        Nigeria Modern    NA       NA <NA> 
 5 0         NA Africa    NA18499 YRI        Nigeria Modern    NA       NA <NA> 
 6 0         NA Africa    NA18501 YRI        Nigeria Modern    NA       NA <NA> 
 7 0         NA Africa    NA18502 YRI        Nigeria Modern    NA       NA <NA> 
 8 0         NA Africa    NA18504 YRI        Nigeria Modern    NA       NA <NA> 
 9 0         NA Africa    NA18505 YRI        Nigeria Modern    NA       NA <NA> 
10 0         NA Africa    NA18507 YRI        Nigeria Modern    NA       NA <NA> 
# ℹ 4,062 more rows
# ℹ 14 more variables: region <chr>, groupLabel <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>,
#   hgYMajor <chr>, hgYMinor <chr>

everything() is another useful helper function for select() operations, like starts_with() et al. above.


Is the select() & everything() combination needed when we have relocate()?

I don’t think so — I only learned about relocate() two days ago, after using R for ten years professionally. But there’s a lot of code out there using select() for this, so it’s good to be aware of this technique.

Exercise 9: Sorting rows based on column values

When you need to sort the rows of a data frame based on a value of one or multiple columns, you will need to use the function arrange(). Again, in the spirit of the consistency across the tidyverse ecosystem, it follows exactly the same format of first giving the function a data frame to operate on, followed by a list of columns to sort by.

arrange(<data frame>, <column 1>, <column 2>, ...)

Note: When you want to reverse the order of the sort, you can surround the column name in a helper function desc() (standing for “descending”).

arrange(<data frame>, desc(<column 1>), desc(<column 2>), ...)

Who is the oldest individual in your data who is not an archaic individual?

Hint: Remember that you can filter() out rows corresponding to archaic individuals with the condition set != "Archaic" and then pipe %>% the result into arrange() for sorting based on the column age.

It’s the famous Ust’-Ishim individual!

df %>% filter(set != "Archaic") %>% arrange(desc(age))
# A tibble: 4,069 × 24
   sample   population country continent set        age coverage site     region
   <chr>    <chr>      <chr>   <chr>     <chr>    <dbl>    <dbl> <chr>    <chr> 
 1 UstIshim UstIshim   Russia  Asia      Ancient 45020    35.2   Ust'-Is… North…
 2 Kostenki Kostenki   Russia  Europe    Ancient 37470     2.53  Kostenki Centr…
 3 SII      SII        Russia  Europe    Ancient 34234     4.25  Sunghir  Centr…
 4 SIII     SIII       Russia  Europe    Ancient 34092.   11.2   Sunghir  Centr…
 5 SIV      SIV        Russia  Europe    Ancient 33992     4.04  Sunghir  Centr…
 6 SI       SI         Russia  Europe    Ancient 32822.    1.16  Sunghir  Centr…
 7 Yana     Yana       Russia  Asia      Ancient 31950    26.5   Yana RHS North…
 8 Yana2    Yana2      Russia  Asia      Ancient 31950     6.81  Yana RHS North…
 9 NEO283   NEO283     Georgia Asia      Ancient 25635     0.786 Kotias … Weste…
10 MA1      MA1        Russia  Asia      Ancient 24305     1.09  Mal'ta   North…
# ℹ 4,059 more rows
# ℹ 15 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Similarly, who is the youngest ancient individual in your data in terms of its dating?

Apparently it’s an individual from Andaman islands from 90 years ago:

df %>% filter(set == "Ancient") %>% arrange(age)
# A tibble: 1,664 × 24
   sample  population country   continent set       age coverage site     region
   <chr>   <chr>      <chr>     <chr>     <chr>   <dbl>    <dbl> <chr>    <chr> 
 1 Andaman Andaman    India     Asia      Ancient   90    16.6   Great A… South…
 2 890     890        Argentina America   Ancient  100     0.508 Beagle … South…
 3 894     894        Argentina America   Ancient  100     0.983 Beagle … South…
 4 895     895        Argentina America   Ancient  100     1.32  Beagle … South…
 5 MA577   MA577      Argentina America   Ancient  100     1.82  Tierra … South…
 6 Nr74    Nr74       Chile     America   Ancient  100     0.438 Strait … South…
 7 AM71    AM71       Chile     America   Ancient  100     0.103 Strait … South…
 8 523a_C  523a_C     USA       America   Ancient  125     0.956 Palm Si… North…
 9 Vt719   Vt719      Vietnam   Asia      Ancient  154.    0.257 Northea… South…
10 KOV-A-2 KOV-A-2    Iceland   Europe    Ancient  236     0.685 Kopavog… North…
# ℹ 1,654 more rows
# ℹ 15 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Does the same approach work for sorting text? What do you get when you try sorting based on the country column?

Yes, this gives us standard alphabetical sort!

df %>% arrange(country)
# A tibble: 4,072 × 24
   sample    population country   continent set       age coverage site   region
   <chr>     <chr>      <chr>     <chr>     <chr>   <dbl>    <dbl> <chr>  <chr> 
 1 Aconcagua Aconcagua  Argentina America   Ancient  500    2.67   Cerro… South…
 2 890       890        Argentina America   Ancient  100    0.508  Beagl… South…
 3 894       894        Argentina America   Ancient  100    0.983  Beagl… South…
 4 895       895        Argentina America   Ancient  100    1.32   Beagl… South…
 5 MA577     MA577      Argentina America   Ancient  100    1.82   Tierr… South…
 6 NEO110    NEO110     Armenia   Asia      Ancient 7594    0.0605 Aknas… Weste…
 7 RISE423   RISE423    Armenia   Asia      Ancient 3256.   0.366  Nerqu… Weste…
 8 DA31      DA31       Armenia   Asia      Ancient 3314.   0.345  Lchas… Weste…
 9 DA35      DA35       Armenia   Asia      Ancient 3314.   1.62   Lchas… Weste…
10 RISE407   RISE407    Armenia   Asia      Ancient 2955    0.187  Norab… Weste…
# ℹ 4,062 more rows
# ℹ 15 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

What do you get when you try sorting based on country and then coverage? What happens when you sort based on coverage and then country? Why is there a difference?

df %>% arrange(country, coverage)
# A tibble: 4,072 × 24
   sample    population country   continent set       age coverage site   region
   <chr>     <chr>      <chr>     <chr>     <chr>   <dbl>    <dbl> <chr>  <chr> 
 1 890       890        Argentina America   Ancient  100    0.508  Beagl… South…
 2 894       894        Argentina America   Ancient  100    0.983  Beagl… South…
 3 895       895        Argentina America   Ancient  100    1.32   Beagl… South…
 4 MA577     MA577      Argentina America   Ancient  100    1.82   Tierr… South…
 5 Aconcagua Aconcagua  Argentina America   Ancient  500    2.67   Cerro… South…
 6 NEO110    NEO110     Armenia   Asia      Ancient 7594    0.0605 Aknas… Weste…
 7 RISE407   RISE407    Armenia   Asia      Ancient 2955    0.187  Norab… Weste…
 8 DA31      DA31       Armenia   Asia      Ancient 3314.   0.345  Lchas… Weste…
 9 RISE423   RISE423    Armenia   Asia      Ancient 3256.   0.366  Nerqu… Weste…
10 RISE397   RISE397    Armenia   Asia      Ancient 2902.   0.392  Kapan  Weste…
# ℹ 4,062 more rows
# ℹ 15 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>
df %>% arrange(coverage, country)
# A tibble: 4,072 × 24
   sample population country continent set       age coverage site        region
   <chr>  <chr>      <chr>   <chr>     <chr>   <dbl>    <dbl> <chr>       <chr> 
 1 NEO496 NEO496     Ukraine Europe    Ancient 10655   0.0124 Vasilevka-I Centr…
 2 NEO7   NEO7       Denmark Europe    Ancient  5240   0.0127 Sigersdal … North…
 3 NEO13  NEO13      Denmark Europe    Ancient  9511   0.0134 Hedegaard … North…
 4 NEO580 NEO580     Denmark Europe    Ancient  4612   0.0134 Klokkehøj   North…
 5 NEO1   NEO1       Denmark Europe    Ancient  6594   0.0150 Holmegård-… North…
 6 NEO916 NEO916     Russia  Asia      Ancient  5727   0.0175 Vengerovo-2 North…
 7 NEO41  NEO41      Denmark Europe    Ancient  5531   0.0195 Rude        North…
 8 NEO915 NEO915     Russia  Asia      Ancient  5727   0.0202 Vengerovo-2 North…
 9 NEO566 NEO566     Denmark Europe    Ancient  5135   0.0206 Døjringe    North…
10 NEO961 NEO961     Denmark Europe    Ancient  5149   0.0229 Avlebjerg … North…
# ℹ 4,062 more rows
# ℹ 15 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>

Does it matter what order you use filter() (on some column) and arrange() (on another column) if you’re using both functions? If yes, why? If not, why not? Think about the amount of work these functions have to do in either of those two scenarios.

Exercise 10: Mutating tables

Mutating a table means adding a column for a new variable. Again, as with the previously introduced functions select(), filter(), and rename() and arrange(), it follows a consistent tidyverse pattern:

df %>% mutate(<new column name> = <vector of the required length>)

Not surprisingly, the new column is assigned a vector of the same length as the number of rows in a data frame. What we often do is this:

df %>% mutate(<new column name> = <expression involving other columns>)

because mutate() allows us to refer to other columns in the data frame already present.

Please note that you can also modify an existing column using the same command. In this case, <new column name> above could simply be the name of an already existing column.


To demonstrate this on a couple of exercises, let’s actually remove some columns from our data frame first. They were originally created by mutate() in the first place and this gives as a useful opportunity for practice (because we’ll soon add them back again ourselves):

df <- select(df, -set)

The important aspect of mutate is that it is a vectorized operation. We can’t create a column and give values to only some rows. Here are several ways how we could do this, starting with a simple example of assignment of sex description based on sex chromosomes:

1. if_else()

if_else() accepts a logical vector (i.e., a vector of TRUE / FALSE values), and produces another vector which contains one value for each TRUE, and another value for each FALSE. Here is an example:

v <- c(TRUE, TRUE, FALSE, TRUE, FALSE)

if_else(v, 123, -123)
[1]  123  123 -123  123 -123

Notice that we get 123 for each TRUE and -123 for each FALSE.


The above can be a bit confusing, so spend a bit of time playing around with if_else(). For instance, create a different logical vector variable (containing an arbitrary TRUE or FALSE values, up to you), and have the function produce values “hello” and “bye” depending on TRUE / FALSE state of each element of the logical vector variable.

Just get familiar with this vectorized thinking because whenever we do data science, it almost always happens in this vectorized fashion (such as operating on every single value in a column of a table at once, like we’ll do soon).

v <- c(FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE)

if_else(v, "hello", "bye")
[1] "bye"   "bye"   "bye"   "hello" "hello" "bye"   "hello" "bye"  

As mentioned, this function is extremely useful whenever we need to generate values of a new column based on values of another column(s). The general pattern for using it inside a mutate() call is this:

df %>%
  mutate(<new_column> = if_else(<logical vector>, value_for_true, value_for_false))

You can see that the column df$sex contains chromosome sex determination of each individual as either XY or XX. Use the if_else() function in a mutate() call to create a new column sex_desc which will contain a character vector of either “male” (if sex == "XY") or else assigns “female”.

Hint: Again, you might want to build an intuition first. Create a small vector of a mix of “XY” and “XX” values and store it in a variable sex. Then experiment with if_else()-based assignment of “female” and “male”. When you’re sure you got it, apply it to the data frame in the mutate() call to solve this exercise.

df <- mutate(df, sex_desc = if_else(sex == "XY", "male", "female"))

Run table(df$sex, df$sex_desc) to make sure the assignment of sex description worked as expected. How do you interpret the results? Did we miss something? If we did, what is wrong?

table(df$sex, df$sex_desc)
     
      female male
  XX    1862    0
  XXY      1    0
  XY       0 2209

Oh, we can see that there’s an “XXY” individual who – because of our improperly specified if_else() condition (which assigns individuals as male only when they are “XY”) – got assigned in to the “female” category.

We would’ve noticed this in our exploratory phase, had we checked the range of possible values of the sex column like this. This is why getting familiar with all data is so crucial the first time we look at it. We could’ve (and should’ve) ran this at the beginning:

table(df$sex)

  XX  XXY   XY 
1862    1 2209 
unique(df$sex)
[1] "XY"  "XX"  "XXY"

It turns out there is an individual with a Klinefelter syndrome. A male carrying an extra X chromosome. Clearly the logic of our if_else() is incorrect because this individual was incorrectly assigned as “female”. Take a look at your mutate() and if_else() code again to make sure you see why we did it wrong. How would you fix the mutate() (or rather the if_else()) command to work correctly and correctly assign the XXY individual as “male”?

There are a number of options. This is simplest, we can just flip the condition:

df <- mutate(df, sex_desc = if_else(sex == "XX", "female", "male"))

But it may be better to be more explicit?

df <- mutate(df, sex_desc = if_else(sex %in% c("XY", "XXY"), "male", "female"))

We can verify that the assignment of sex description worked correctly now:

table(df$sex, df$sex_desc)
     
      female male
  XX    1862    0
  XXY      0    1
  XY       0 2209

This is a lesson to always remember to check assumptions in your data, even the ones you consider trivial. Functions such as table() and unique() are extremely useful for this.

Note: I got this originally wrong when I was preparing these materials. So this is a real-world cautionary tale.


Let’s practice if_else() a bit more. First, use filter() to look at the values of the age column for present-day individuals in the 1000 Genomes Project data (dataSource == "1000g"). What ages do you see?

The individuals have NA age. This could be annoying if we ever want to plot some population genetic statistics as functions of age:

df %>% filter(dataSource == "1000g")
# A tibble: 2,405 × 24
   sample  population country continent   age coverage site  region   groupLabel
   <chr>   <chr>      <chr>   <chr>     <dbl>    <dbl> <chr> <chr>    <chr>     
 1 NA18486 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 2 NA18488 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 3 NA18489 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 4 NA18498 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 5 NA18499 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 6 NA18501 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 7 NA18502 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 8 NA18504 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
 9 NA18505 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
10 NA18507 YRI        Nigeria Africa       NA       NA <NA>  WestAfr… YRI       
# ℹ 2,395 more rows
# ℹ 15 more variables: flag <chr>, latitude <dbl>, longitude <dbl>,
#   dataSource <chr>, age14C <dbl>, ageHigh <dbl>, ageLow <dbl>,
#   datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>,
#   hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>

Is there anyone in the 1000 GP data who has an age specified?

df %>% filter(dataSource == "1000g") %>% filter(!is.na(age))
# A tibble: 0 × 24
# ℹ 24 variables: sample <chr>, population <chr>, country <chr>,
#   continent <chr>, age <dbl>, coverage <dbl>, site <chr>, region <chr>,
#   groupLabel <chr>, flag <chr>, latitude <dbl>, longitude <dbl>,
#   dataSource <chr>, age14C <dbl>, ageHigh <dbl>, ageLow <dbl>,
#   datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>, ageRaw <chr>,
#   hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>

The 1000 GP individuals are missing an age value. It should be set to 0. Let’s fix that now and correct the metadata table.

You already know that you can get a logical vector indicating whether a certain element of another vector (here, the column age) is NA or not via the function is.na(). Use is.na(age) in the if_else() to set the age column with the mutate() function so that:

  1. Rows where is.na(age) is TRUE will be set to 0, and
  2. Rows where is.na(age) is FALSE will be set to age (because those values don’t need to be replaced).

Note: Using mutate() to replace values of an existing column (rather than creating a new column) is done very often in data science, particularly in “clean up” operations like this one. Your data will never be “perfect” (especially if you get it from some other source) and you will need to do a lot of so-called “table munging” to prepare it for analysis and plotting, just like we’re doing here.

Hint: Again, if you need help, you can start slowly by creating a toy example variable age which will have a vector of a mix of numbers, including some NA values. Then practice creating an if_else() expression which will return 0 in place of NA values in this vector, and keep every other value intact.

The individuals have NA age. This could be annoying if we ever want to plot some population genetic statistics as functions of age:

df <- df %>% mutate(age = if_else(is.na(age), 0, age))

2. case_when()

if_else() does only work on binary TRUE or FALSE conditions. But what if we want to create a new column with values of multiple categories, not just two. Recall our original column set (which we dropped from our df table earlier), which had values either “Archaic”, “Ancient”, or “Modern”.

For this purpose, case_when() is perfect. It works in a very similar manner to if_else(), but allows not-just-binary categorization. Consider this vector of numbers between 1 and 30:

v <- 1:20

If we wanted to assign a value “less_than_10” or “10_or_more” to each element, we could use the function if_else() like this:

if_else(v < 10, "less than 10", "10 or more")
 [1] "less than 10" "less than 10" "less than 10" "less than 10" "less than 10"
 [6] "less than 10" "less than 10" "less than 10" "less than 10" "10 or more"  
[11] "10 or more"   "10 or more"   "10 or more"   "10 or more"   "10 or more"  
[16] "10 or more"   "10 or more"   "10 or more"   "10 or more"   "10 or more"  

What if we want to introduce three or more categories? case_when() to the rescue! Try running this yourself on the vector v created above.

case_when(
  v < 10  ~ "less than 10",
  v == 10 ~ "exactly 10",
  v > 10 ~ "more than 10"
)
 [1] "less than 10" "less than 10" "less than 10" "less than 10" "less than 10"
 [6] "less than 10" "less than 10" "less than 10" "less than 10" "exactly 10"  
[11] "more than 10" "more than 10" "more than 10" "more than 10" "more than 10"
[16] "more than 10" "more than 10" "more than 10" "more than 10" "more than 10"

Remember how we had a slightly annoying time with doing vectorized conditional logical expressions on TRUE / FALSE vectors in our Bootcamp session? Now it’s all paying off!

Every one of the three conditions actually result in a logical vector, and case_when() decides which value to produced for each element of that vector based on whichever results in TRUE.

The case_when() function has a very useful optional argument called .default =, which determines the value it should return whenever a particular location in the logical vector either results in all FALSE (so none of the conditions would apply), or whenever it produces a NA value (so none of the conditions can apply even in principle).


First, let’s start simple and reimplement the mutate() operation to add the sex_desc column again (just to practice on something we already know) this time using case_when() and implementing the three conditions (“XX”, “XY”, “XXY”) individually. Additionally, use the .default = argument of case_when() to assign “unknown” as a value of the sex_desc column.

df <-
  df %>%
  mutate(sex_desc = case_when(
    sex == "XX" ~ "female",
    sex == "XY" ~ "male",
    sex == "XXY" ~ "male",
    .default = "unknown"
  ))

Let’s check that this worked correctly:

table(df$sex, df$sex_desc)
     
      female male
  XX    1862    0
  XXY      0    1
  XY       0 2209

In the exercise on conditional expressions, you learned about & and | operators which make it possible to combine multiple conditions into a single TRUE / FALSE vector. Of course, this means that the conditions inside the case_when() function can also utilize those operators.

Create a new column sex_set which will contain the following values:

  1. sex == "XX" & age == 0 should produce "female (present-day)",
  2. (sex == "XY" | sex == "XXY") & age == 0 should produce "male (present-day)",
  3. sex == "XX" & age > 0 should produce "female (ancient)",
  4. (sex == "XY" | sex == "XXY") & age > 0 should produce "male (ancient)",
  5. any other combination should default to “other”

I provided the code of the logical conditions for you, you just have to put them in a case_when() call appropriately, modifying the solution to the previous exercise. Verify the outcome by doing table(df$sex_set) again.

Why did I put parenthesis around the sub-clause involving the | OR operator? What happens if we don’t do this?

df <-
  df %>%
  mutate(sex_set = case_when(
    sex == "XX" & age == 0                   ~ "female (present-day)",
    sex == "XX" & age > 0                    ~ "female (ancient)",
    (sex == "XY" | sex == "XXY") & age == 0  ~ "male (present-day)",
    (sex == "XY" | sex == "XXY") & age > 0   ~ "male (ancient)",
    .default = "other"
  ))

Let’s check that this worked correctly:

table(df$sex_set)

    female (ancient) female (present-day)       male (ancient) 
                 641                 1221                 1026 
  male (present-day) 
                1184 

Note: Again, admittedly this might seem arbitrary to you — why would we need something like this for data analysis and statistics? But this kind of “arbitrary categorization” is extremely useful to generate factor categories for plotting, especially for plotting with ggplot2 later. So keep this in mind for the time being.


Remember how we removed the set column, which originally contained values “Modern”, “Ancient”, or “Archaic”, depending on whether the individual was a present-day modern human, ancient modern human, or an archaic individual, respectively? Use what you learned in this exercise on if_else() or case_when() to reconstruct this column again based on information in the remaining columns. (There is many possible solutions, so don’t be afraid to think creatively!)

Hint: Using information in age and the names of the three archaic individuals (combined with the .default argument of case_when() is probably the easiest solution).

# let's pick out individual names of the archaics
archaics <- c("Vindija33.19", "AltaiNeandertal", "Denisova")

df <-
  df %>%
  mutate(set = case_when(
    age == 0             ~ "Modern",  # whoever has sampling date 0 is "Modern"
    sample %in% archaics ~ "Archaic", # whoever is among archaics is "Archaic"
    .default             = "Ancient"  # and only the third category remains
  ),
  .after = coverage)

Note the use of .after = in my solution to the previous exercise. Look up the help of mutate() in ?mutate to see what it does and why I used it.

Usually, when we add a new column using mutate() we don’t want to add it to the very end of the table (where we cannot see it), which is what the function does by default. .after and .before are options which allow us to specify where among the already present columns in the data frame do we want to add the new column

Exercise 11: Summarizing tables

You have already learned how to operate on rows (filter(), arrange()), columns (select(), rename()), and both rows and columns (mutate()). We have one remaining piece of the puzzle and that is operating on groups of rows. This takes the tidyverse functionality to an entire whole level and allows you to do many more powerful things with tabular data and compute summary statistics.

In this section, we will will cover the functions group_by(), summarize(), and various associated “slice functions”.

1. group_by()

Take a look at the first couple of rows of our metadata table again:

df
# A tibble: 4,072 × 26
   sample  population country continent   age coverage set    site  region    
   <chr>   <chr>      <chr>   <chr>     <dbl>    <dbl> <chr>  <chr> <chr>     
 1 NA18486 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 2 NA18488 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 3 NA18489 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 4 NA18498 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 5 NA18499 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 6 NA18501 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 7 NA18502 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 8 NA18504 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 9 NA18505 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
10 NA18507 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
# ℹ 4,062 more rows
# ℹ 17 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>, sex_set <chr>

Now run the following code in your R console and carefully inspect the output (and compare it to the result of the previous command):

group_by(df, continent)
# A tibble: 4,072 × 26
# Groups:   continent [4]
   sample  population country continent   age coverage set    site  region    
   <chr>   <chr>      <chr>   <chr>     <dbl>    <dbl> <chr>  <chr> <chr>     
 1 NA18486 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 2 NA18488 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 3 NA18489 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 4 NA18498 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 5 NA18499 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 6 NA18501 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 7 NA18502 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 8 NA18504 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 9 NA18505 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
10 NA18507 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
# ℹ 4,062 more rows
# ℹ 17 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>, sex_set <chr>

You can probably see that the data hasn’t changed at all (same number of columns, same number of rows, all remains the same), but you can notice a new piece of information in the output:

[...]
# Groups:   continent [4]
[...]

The output says that the data has been grouped by the column/variable continent. This means that any subsequent operation will be applied not to individual rows, like it was with our mutate() operations earlier, but they will now work “per continent”. You can imagine this as the group_by() function adding a tiny bit of invisible annotation data which instructs downstream functions to work per group.

Before we move on to computing summary statistics, let’s also note that we can, of course, group based on multiple columns. For instance, in our data, we should probably not summarize based on continent but also on age, splitting individuals based on the broad age sets like this:

group_by(df, continent, set)
# A tibble: 4,072 × 26
# Groups:   continent, set [10]
   sample  population country continent   age coverage set    site  region    
   <chr>   <chr>      <chr>   <chr>     <dbl>    <dbl> <chr>  <chr> <chr>     
 1 NA18486 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 2 NA18488 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 3 NA18489 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 4 NA18498 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 5 NA18499 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 6 NA18501 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 7 NA18502 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 8 NA18504 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
 9 NA18505 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
10 NA18507 YRI        Nigeria Africa        0       NA Modern <NA>  WestAfrica
# ℹ 4,062 more rows
# ℹ 17 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>, sex_set <chr>

Grouping is just the first step though, and doesn’t do anything on its own…

2. summarize()

Why is grouping introduced above even useful? The most simple use case is computing summary statistics on the data on a “per group” basis using the summarize() function. This function works a little bit like mutate(), because it creates new columns, but it only creates one row for each group.

For instance, we could compute the mean coverage of each individual in a group like this, and arrange the groups by the coverage available (try running this in your R console to see this in practice!):

df %>%
  group_by(continent, set) %>%                   # first create groups
  summarize(mean_coverage = mean(coverage)) %>%  # then create summaries for each group
  arrange(desc(mean_coverage))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 10 × 3
# Groups:   continent [4]
   continent set     mean_coverage
   <chr>     <chr>           <dbl>
 1 Asia      Archaic         35.3 
 2 Europe    Archaic         24.4 
 3 Africa    Ancient          4.99
 4 America   Ancient          2.67
 5 Asia      Ancient          1.71
 6 Europe    Ancient          1.41
 7 Africa    Modern          NA   
 8 America   Modern          NA   
 9 Asia      Modern          NA   
10 Europe    Modern          NA   

Notice that the result has only three columns, unlike the many columns in the original df table! We have a column for each variable we grouped over, and we have the new column with our summary statistic. Additionally, we only have 10 rows, one row for each combination of continent and set. So, this is kind of similar to mutate() (it creates new variables/columns), but different.


What does the NA value for “Modern” individuals in the output of the previous code chunk? Take a look at the original data df to answer this question.

The reason is simple – the present-day individuals are high quality imputed genotypes, which is why the sequencing coverage column is full of NA (not available, i.e., missing data) values. Computing a mean of such values is, itself, an NA value too:

mean(c(1, NA, 2, NA, 42, NA))
[1] NA

summarize() is even more powerful than that because we can compute many different things at the same time, for each group! This next exercise is an example of that.

As you know, the mean (such as the mean_coverage we computed right above) is computed as a sum of a set of values divided by their count. Use the group_by() and summarize() code exactly like you did above, but instead of computing the mean() directly, compute it manually by doing (in a summarize() call):

  1. sum_coverage = sum(coverage) (the sum of coverages for each group),
  2. n = n() (the number of values in each group),
  3. mean_coverage = sum_coverage / n (using the two quantities just computed to calculate the mean)

To make things a little tidier, arrange() the results by the column mean_coverage.

Once you have your results, save this new table in a variable called df_summary.

Here’s our entire pipeline to compute the mean coverage “manually” instead of the built-in function mean() like we did in the previous exercise:

df_summary <-
  df %>%                               # take the data frame df...
  group_by(continent, set) %>%         # ... group rows on continent and set...
  summarize(                           # ... and compute for each group:
    sum_coverage = sum(coverage),      #       1. the sum of all coverage
    n = n(),                           #       2. the number of coverages
    mean_coverage = sum_coverage / n   #       3. the mean using 1. and 2.above
  ) %>%
  arrange(mean_coverage)               # ... then sort by average coverage
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
df_summary
# A tibble: 10 × 5
# Groups:   continent [4]
   continent set     sum_coverage     n mean_coverage
   <chr>     <chr>          <dbl> <int>         <dbl>
 1 Europe    Ancient       1722.   1220          1.41
 2 Asia      Ancient        600.    350          1.71
 3 America   Ancient        205.     77          2.67
 4 Africa    Ancient         84.8    17          4.99
 5 Europe    Archaic         24.4     1         24.4 
 6 Asia      Archaic         70.6     2         35.3 
 7 Africa    Modern          NA     504         NA   
 8 America   Modern          NA     504         NA   
 9 Asia      Modern          NA     993         NA   
10 Europe    Modern          NA     404         NA   

Look up the function ?ungroup. Why is it useful?

Although grouping is very useful for computing summary statistics across groups of values, we often need to discard the grouping because we might want to continue working with the summarized table we obtained with other tidyverse functions without working on the pre-defined groups. For instance, we might want to follow up a summarize() operation with a mutate() operation, without working with groups.


Bonus exercise (feel free to skip this): How would you compute 95% confidence interval for the mean coverage using the group_by() and summarize()? Remember that you will need standard deviation (R function sd()) and also the number of observations (function n()). Look up confidence interval equation on Wikipedia.**

group_by(df, continent, set) %>%
    filter(set == "Ancient") %>%
    summarise(mean_cov=mean(coverage), sd_cov=sd(coverage), n_cov=n()) %>%
    mutate(
        se_cov = sd_cov / sqrt(n_cov),
        lower_ci = mean_cov - qt(1 - (0.05 / 2), n_cov - 1) * se_cov,
        upper_ci = mean_cov + qt(1 - (0.05 / 2), n_cov - 1) * se_cov
    )
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
# A tibble: 4 × 8
# Groups:   continent [4]
  continent set     mean_cov sd_cov n_cov se_cov lower_ci upper_ci
  <chr>     <chr>      <dbl>  <dbl> <int>  <dbl>    <dbl>    <dbl>
1 Africa    Ancient     4.99   5.47    17 1.33       2.18     7.80
2 America   Ancient     2.67   4.77    77 0.543      1.59     3.75
3 Asia      Ancient     1.71   3.90   350 0.208      1.30     2.12
4 Europe    Ancient     1.41   2.81  1220 0.0804     1.25     1.57

slice_ functions

Finally, in addition to computing various summaries on groups using summarize(), we have five “slicing” functions at our disposal, each of which extracts specific row(s) from each defined group. These functions are:

  • slice_head(n = 1) takes the first row,
  • slice_tail(n = 1) takes the last row,
  • slice_min(<column>, n = 1) takes the row with the smallest value of ,
  • slice_max(<column>, n = 1) takes the row with the largest value of ,
  • slice_sample(<col> = 1) takes one random row.

What is the lowest coverage individual in each continent and set group? While you’re doing this, filter() out the set of “Modern” individuals because they are not meaningful anyway (they have NA coverages), composing the entire command as a series of steps in a %>% pipeline.

At the end of your %>% pipeline, pipe your results into relocate(continent, set) to make everything even tidier.

df %>%
  filter(set != "Modern") %>%
  group_by(continent, set) %>%
  slice_min(coverage, n = 1) %>%
  relocate(continent, set)
# A tibble: 6 × 26
# Groups:   continent, set [6]
  continent set     sample       population country    age coverage site  region
  <chr>     <chr>   <chr>        <chr>      <chr>    <dbl>    <dbl> <chr> <chr> 
1 Africa    Ancient gun002       gun002     Canary…   860.   0.213  Tene… North…
2 America   Ancient AM71         AM71       Chile     100    0.103  Stra… South…
3 Asia      Ancient NEO916       NEO916     Russia   5727    0.0175 Veng… North…
4 Asia      Archaic Denisova     Denisova   Russia  80000   25.8    Deni… North…
5 Europe    Ancient NEO496       NEO496     Ukraine 10655    0.0124 Vasi… Centr…
6 Europe    Archaic Vindija33.19 Vindija33… Croatia 41950   24.4    Vind… Centr…
# ℹ 17 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>, sex_set <chr>

Modify your %>% pipeline for the exercise above to answer who is the oldest individual we have available on each continent?

df %>%
  filter(set != "Modern") %>%
  group_by(continent) %>%
  slice_max(age) %>%
  relocate(continent, set, age)
# A tibble: 4 × 26
# Groups:   continent [4]
  continent set        age sample       population country coverage site  region
  <chr>     <chr>    <dbl> <chr>        <chr>      <chr>      <dbl> <chr> <chr> 
1 Africa    Ancient   7885 I10871       I10871     Camero…     15.2 Shum… Centr…
2 America   Ancient  12649 Clovis       Clovis     USA         15.1 Mont… North…
3 Asia      Archaic 125000 AltaiNeande… AltaiNean… Russia      44.8 Deni… North…
4 Europe    Archaic  41950 Vindija33.19 Vindija33… Croatia     24.4 Vind… Centr…
# ℹ 17 more variables: groupLabel <chr>, flag <chr>, latitude <dbl>,
#   longitude <dbl>, dataSource <chr>, age14C <dbl>, ageHigh <dbl>,
#   ageLow <dbl>, datingSource <chr>, sex <chr>, hgMT <chr>, gpAvg <dbl>,
#   ageRaw <chr>, hgYMajor <chr>, hgYMinor <chr>, sex_desc <chr>, sex_set <chr>

Exercise 12: Writing (and reading) tables

Above you create a summary table df_summary. Let’s try writing it to disk using write_tsv(). The .tsv stands for “tab-separated values” (TSV) which is a file format similar to another file you might have heard about, the “comma-separated values” (CSV) file. They can be opened with Excel too, in case this is useful at some point, but they are very useful for computational workflows.


Look up the help in ?write_tsv. What are the parameters you need to specify at minimum to save the file? This function has all of the useful options saved as defaults, so there’s not much you have to do. Still, look up the options!

The only thing we really want to do is specify the data frame to write to disk and the filename:

write_tsv(df_summary, "~/Desktop/df_summary.tsv")

What are other means of writing files you found in ?write_tsv? How would you read a CSV file into R? What about other file formats?

Hint: Look up ?read_csv.


A very useful R package is readxl, for reading and writing Excel files. Google this package, and install it with install.packages("readxl"). If you want and have some Excel file on your computer, try reading it into R with library(readxl); read_excel("path/to/your/excel/file").

“Take home exercises”

Which country has the highest number of aDNA samples (i.e., samples for which set == "Ancient“). How many samples from Estonia, Lithuania, or Latvia do we have?

Hint: Use the group_by(country) and summarize() combo again, together with the n() helper function.

Count the number of samples from each country, and order them by this count:

countries_df <-
  df %>%
  filter(set == "Ancient") %>%
  group_by(country) %>%
  summarise(n = n()) %>%
  arrange(desc(n))

countries_df
# A tibble: 58 × 2
   country        n
   <chr>      <int>
 1 Russia       283
 2 Denmark      184
 3 Sweden       174
 4 Italy        149
 5 UK            76
 6 Kazakhstan    64
 7 Estonia       60
 8 Ukraine       60
 9 Poland        56
10 France        54
# ℹ 48 more rows
countries_df %>% filter(country %in% c("Estonia", "Lithuania", "Latvia"))
# A tibble: 3 × 2
  country       n
  <chr>     <int>
1 Estonia      60
2 Latvia        7
3 Lithuania     1

Is there any evidence (real or not — just out of curiousity!) of a potential bias in terms of how many set == "Ancient" samples do we have from a country and the geographical location of that country, such as the average longitude or latitude of samples from there? Compute a linear model using the lm function of the count the n count of samples from each country as a function of avg_lat or avg_lon (average latitude and longitude) of samples in each country (all of these quantities computed with group_by() and summarize() like you did above).

Hint: If you’re not familiar with this, look up the linear model function ?lm or perhaps this tutorial to see how you can build a linear regression between a response variable (like our count n) and an explanatory variable (such as average longitude or latitude of a country), as computed by the group_by() and summarize() combination (the result of which you should save to a new data frame variable, to be used in the lm() function).

Let’s first compute our summary statistics:

countries_df <-
  df %>%
  filter(set == "Ancient") %>%
  group_by(country) %>%
  summarise(n = n(),
            avg_lat = mean(latitude, na.rm = TRUE),
            avg_lon = mean(longitude, na.rm = TRUE))

countries_df
# A tibble: 58 × 4
   country            n avg_lat avg_lon
   <chr>          <int>   <dbl>   <dbl>
 1 Argentina          5  -50.2    -69.5
 2 Armenia            6   40.1     45.3
 3 Brazil             5  -19.5    -43.9
 4 Cameroon           2    5.86    10.1
 5 Canada             6   47.0    -94.1
 6 Canary Islands     5   28.2    -16.2
 7 Chile              4  -52.7    -72.9
 8 China              7   43.6     93.2
 9 CzechRepublic      9   49.9     15.0
10 Denmark          184   55.6     11.0
# ℹ 48 more rows

Compute the linear regression of n as a function of avg_lon – there doesn’t appear to be any significant relationship between the two variables:

lm_res <- lm(n ~ avg_lon, data = countries_df)
summary(lm_res)

Call:
lm(formula = n ~ avg_lon, data = countries_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-27.785 -24.543 -21.474   7.008 254.543 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.822477   7.431781   3.878 0.000279 ***
avg_lon     -0.005886   0.123753  -0.048 0.962235    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 52.45 on 56 degrees of freedom
Multiple R-squared:  4.039e-05, Adjusted R-squared:  -0.01782 
F-statistic: 0.002262 on 1 and 56 DF,  p-value: 0.9622
plot(countries_df$avg_lon, countries_df$n)
abline(lm_res)

Compute the linear regression of n as a function of avg_lat – it looks like there’s a significant relationship between the number of samples from a country and geographical latitude? Is this a signal of aDNA surviving in more colder climates? :)

lm_res <- lm(n ~ avg_lat, data = countries_df)
summary(lm_res)

Call:
lm(formula = n ~ avg_lat, data = countries_df)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.085 -28.042 -13.834   7.972 242.851 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   8.2158    10.9508   0.750    0.456  
avg_lat       0.5848     0.2501   2.338    0.023 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 50.07 on 56 degrees of freedom
Multiple R-squared:  0.08891,   Adjusted R-squared:  0.07264 
F-statistic: 5.465 on 1 and 56 DF,  p-value: 0.023
plot(countries_df$avg_lat, countries_df$n)
abline(lm_res)

Does it mean anything? Hard to tell, but it does make for a fun exercise. :)


Take your own data and play around with it using the concepts you learned above. If the data isn’t in a form that’s readily readable as a table with something like read_tsv(), please talk to me! tidyverse is huge and there are packages for munging all kinds of data. I’m happy to help you out.

Don’t worry about “getting somewhere”. Playing and experimenting (and doing silly things) is the best way to learn.

For more inspiration on other things you could do with your data, take a look at the dplyr cheatsheet.

In each following session, you’ll have the opportunity to do the same. The next session will focus on more real-world insights from data, and the section after that will focus on creating beautiful visualizations using the ggplot2 package.

Closing remarks

Hopefully you’re now learning that really integrating these couple of tidyverse “verbs” gives you an entirely new natural way of not just working and modifying and filtering data, but “thinking in data” (as pretentious and hippy this must seem to you right now).

When you watch experts doing data science using tidyverse in real time, like an experienced colleague helping you out, you will see that they can “think about the data” at the same time as they are typing tidyverse commands. Over time, with practice, you will get there too.

And this is actually just the beginning! The true power of data science arrives when we learn ggplot2 visualization later. ggplot2 uses the same philosophical principles ad the dplyr commands we practiced above, but allows you to “visualize information at the same time as you’re thinking about it”.