Introduction to tidyverse

(A few remarks and tips before the practical session)

Quick recap from our R bootcamp yesterday

We were not supposed to finish everything, so no stress.

The motivation was to get familiar with the background of what makes a “data frame”.

Vectors and lists

  • Vectors are collections of values of the same type:
sample   <- c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal")
coverage <- c(18.2,        35.2,       13.4,     44.8)
archaic  <- c(FALSE,       FALSE,      FALSE,    TRUE)

. . .

  • Lists are collections of anything:
list("Hello", TRUE, 123)
[[1]]
[1] "Hello"

[[2]]
[1] TRUE

[[3]]
[1] 123

. . .

… and that “anything” can also include other vectors!

An example of such a list of vectors…


From vectors stored as individual variables…


sample   <- c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal")
coverage <- c(18.2,        35.2,       13.4,     44.8)
age      <- c(8050,        45020,      3885,     125000)

An example of such a list of vectors…


To those vectors stored as (named) list…


list(
  sample   = c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal"),
  coverage = c(18.2,        35.2,       13.4,     44.8),
  age      = c(8050,        45020,      3885,     125000)
)

Data frame is just that


A list of vectors…


list(
  sample   = c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal"),
  coverage = c(18.2,        35.2,       13.4,     44.8),
  age      = c(8050,        45020,      3885,     125000)
)

Data frame is just that


… which is just printed as a table.


data.frame(
  sample   = c("Loschbour", "UstIshim", "Saqqaq", "AltaiNeandertal"),
  coverage = c(18.2,        35.2,       13.4,     44.8),
  age      = c(8050,        45020,      3885,     125000)
)
           sample coverage    age
1       Loschbour     18.2   8050
2        UstIshim     35.2  45020
3          Saqqaq     13.4   3885
4 AltaiNeandertal     44.8 125000

Indexing into tables: df[rows, cols]


Indexing by columns (“selecting columns”)

df[, c("sample", "coverage")]
           sample coverage
1       Loschbour     18.2
2        UstIshim     35.2
3          Saqqaq     13.4
4 AltaiNeandertal     44.8

Indexing into tables: df[rows, cols]


Indexing by rows (“filtering rows”)

  1. using row numbers:
df[c(2, 3), ]
    sample coverage   age
2 UstIshim     35.2 45020
3   Saqqaq     13.4  3885

. . .

  1. using TRUE/FALSE for each row:
df[c(FALSE, TRUE, FALSE, TRUE), ]
           sample coverage    age
2        UstIshim     35.2  45020
4 AltaiNeandertal     44.8 125000
df[df$coverage > 30, ] # same thing!
           sample coverage    age
2        UstIshim     35.2  45020
4 AltaiNeandertal     44.8 125000

We can also extract columns with $

If df is our data frame:

           sample coverage    age
1       Loschbour     18.2   8050
2        UstIshim     35.2  45020
3          Saqqaq     13.4   3885
4 AltaiNeandertal     44.8 125000

. . .

  • We can do this:
df$age
[1]   8050  45020   3885 125000

. . .

  • And also this:
mean(df$age)
[1] 45488.75

. . .

  • Or maybe this, etc.:
is.na(df$age)
[1] FALSE FALSE FALSE FALSE

The bootcamp was
“a trial by fire”


tidyverse makes everything we had to do
the hard way infinitely easier.


tidyverse.org



Nine “core” R packages and a “philosophy of data science design” which inspired many many more specialized packages.

link to the paper

What is tidyverse?

The tidyverse is a language for solving data science challenges with R code. Its primary goal is to facilitate a conversation between a human and a computer about data. Less abstractly, the tidyverse is a collection of R packages that share a high-level design philosophy […] so that learning one package makes it easier to learn the next.

The tidyverse encompasses the repeated tasks at the heart of every data science project: data import, tidying, manipulation, visualisation, and programming.

This is still very abstract

In the spirit of hands-on interactivity, we will leave “theory” and practice work hand-in-hand during exercises.

Further companion study material

https://r4ds.hadley.nz

Let’s talk about our example data

“Western Eurasia witnessed several large-scale human migrations during the Holocene. Here, to investigate the cross-continental effects of these migrations, we shotgun-sequenced 317 genomes—mainly from the Mesolithic and Neolithic periods—from across northern and western Eurasia. These were imputed alongside published data to obtain diploid genotypes from more than 1,600 ancient humans [and about 2,500 present-day humans].”

. . .

Our exercises will focus on two MesoNeo data sets:

  • Table of metadata information associated with each sample
  • Genome-wide data set of Identity-by-Descent segments

Why those two data sets?

  • Table of metadata information associated with each sample
  • Genome-wide data set of Identity-by-Descent segments

  1. Best representatives of modern population genetic data
  2. Lots of opportunities to practice tidyverse data processing
  3. Even more opportunities to showcase ggplot2 possibilities

The main reason…

A great example of how to approach totally unfamiliar data!

. . .


True story.


Recently, I was given this exact data set. I had to find my way around it, and figure out how to build a project around it.

. . .


The exercises are retracing my own data exploration journey!

Let’s get started!

  1. Go to www.bodkan.net/simgen
  2. Click on “Introduction to tidyverse in the left panel
  • This session will focus on the metadata
  • The next session “More tidyverse practice” digs into IBD data
  1. “Cheatsheets and handouts” section in the left panel has a single-page version of these slides and the dplyr cheatsheet
  2. Open your RStudio and start working!