library(slendr)
init_env()
## The interface to all required Python modules has been activated.
<- population("CHIMP", time = 7e6, N = 5000)
chimp <- population("AFR", parent = chimp, time = 6e6, N = 15000)
afr <- population("EUR", parent = afr, time = 60e3, N = 3000)
eur <- population("NEA", parent = afr, time = 600e3, N = 1000, remove = 40e3)
nea
<- gene_flow(from = nea, to = eur, rate = 0.03, start = 55000, end = 50000)
gf
<- compile_model(
model populations = list(chimp, nea, afr, eur),
gene_flow = gf,
generation_time = 30
)
# Here we scheduled the sampling of two Neanderthals at 70kya and 40kya
<- schedule_sampling(model, times = c(70000, 40000), list(nea, 1))
nea_samples
# Here we schedule one Chimpanzee sample, 5 African samples, and 10 European samples
<- schedule_sampling(model, times = 0, list(chimp, 1), list(afr, 5), list(eur, 10))
present_samples
# We also schedule the recording of one European sample between 50kya and 2kya,
# every 2000 years
<- seq(40000, 2000, by = -2000)
times <- schedule_sampling(model, times, list(eur, 1))
emh_samples
# Because those functions produce nothing but a data frame, we can bind
# individual sampling schedules together
<- rbind(nea_samples, present_samples, emh_samples)
schedule
# Simulate a tree sequence (with mutations) -- feel free to increase the amount
# of sequence from 20Mb to something like 100Mb (100e6 bp) if you want to reduce
# the statistical noise a bit
<-
ts msprime(model, sequence_length = 20e6, recombination_rate = 1e-8, samples = schedule, random_seed = 1269258439) %>%
ts_mutate(mutation_rate = 1e-8, random_seed = 1269258439)
ts## ╔═══════════════════════════╗
## ║TreeSequence ║
## ╠═══════════════╤═══════════╣
## ║Trees │ 58,891║
## ╟───────────────┼───────────╢
## ║Sequence Length│ 20,000,000║
## ╟───────────────┼───────────╢
## ║Time Units │generations║
## ╟───────────────┼───────────╢
## ║Sample Nodes │ 76║
## ╟───────────────┼───────────╢
## ║Total Size │ 17.1 MiB║
## ╚═══════════════╧═══════════╝
## ╔═══════════╤═══════╤═════════╤════════════╗
## ║Table │Rows │Size │Has Metadata║
## ╠═══════════╪═══════╪═════════╪════════════╣
## ║Edges │210,568│ 6.4 MiB│ No║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Individuals│ 38│ 1.1 KiB│ No║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Migrations │ 0│ 8 Bytes│ No║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Mutations │132,951│ 4.7 MiB│ No║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Nodes │ 44,368│ 1.2 MiB│ No║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Populations│ 4│341 Bytes│ Yes║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Provenances│ 2│ 5.7 KiB│ No║
## ╟───────────┼───────┼─────────┼────────────╢
## ║Sites │132,553│ 3.2 MiB│ No║
## ╚═══════════╧═══════╧═════════╧════════════╝
Tree-sequence statistics
In this exercise, you will build on top of the results from the exercise on programming demographic models. Specifically, we will learn how to compute popgen statistics on slendr-simulated tree sequences using slendr’s interface to the tskit Python module.
Note: This exercise session is split into individual “Parts” intended to be solved in order. Different tasks within each individual Part are highlighted in bold.
If you ever get totally stuck, take a look at the provided solution. The point of these exercises is to learn things, not fight through endless frustration. That said, try not to look at the solution automatically, always take a stab at solving each exercise on your own first.
First, create a new R script stats.R
and paste in the following code. This is one of the possible solutions to the Exercise 1, and it’s easier if we all use it to be on the same page from now on, starting from the same model and the same simulated tree sequence:
As a sanity check, let’s use a couple of tidyverse table-munging tricks to make sure the tree sequence does contain a set of sample which matches our intended sampling schedule (particularly the time series of European individuals and the two Neanderthals):
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# total number of recorded individuals in the tree sequence
ts_samples(ts) %>% nrow
[1] 38
# times of sampling of each recorded European individual
ts_samples(ts) %>% filter(pop == "EUR") %>% pull(time)
[1] 40000 38000 36000 34000 32000 30000 28000 26000 24000 22000 20000 18000
[13] 16000 14000 12000 10000 8000 6000 4000 2000 0 0 0 0
[25] 0 0 0 0 0 0
# times of sampling of each recorded Neanderthal
ts_samples(ts) %>% filter(pop == "NEA") %>% pull(time)
[1] 70000 40000
# count of individuals in each population
ts_samples(ts) %>%
group_by(pop, present_day = time == 0) %>%
%>%
tally select(pop, present_day, n) %>%
arrange(present_day)
# A tibble: 5 × 3
# Groups: pop [4]
pop present_day n
<chr> <lgl> <int>
1 EUR FALSE 20
2 NEA FALSE 2
3 AFR TRUE 5
4 CHIMP TRUE 1
5 EUR TRUE 10
Note: These bits of tidyverse code are extremely helpful when you’re working with large tree sequences with many individuals as sanity checks that your sampling worked as intended. I’m listing them here in case you’ve never worked with the tidyverse family of R packages before (such as the dplyr package where filter()
, group_by()
, tally()
, and pull()
come from).
Everything looks good! Having made sure that the ts
object contains the individuals we want, let’s move to the exercise.
Part 1: Computing nucleotide diversity
The toy model of ancient human history plotted above makes a fairly clear prediction of what would be the nucleotide diversity expected in the simulated populations. Compute the nucleotide diversity in all populations using the slendr function ts_diversity()
in your tree sequence ts
. Do you get numbers that (relatively between all populations) match what would expect from the model given the \(N_e\) that you programmed for each?
Hint: Nearly every slendr statistic function interfacing with tskit accepts a ts
tree-sequence object as its first argument, with further arguments being either a vector of individual names representing a group of samples to compute a statistic on, or a (named) list of such vectors (each element of that list for a group of samples) – these lists are intended to be equivalent to the sample_sets =
argument of many tskit Python methods (which you’ve learned about in your activity on tskit), except that they allow symbolic names of individuals, rather then integer indices of nodes in a tree sequence.
Although you can get all the above information by processing the table produced by the ts_samples()
function, slendr provides a useful helper function ts_names()
which only returns the names of individuals as a vector (or a named list of such vectors, one vector per population as shown below).
When you call it directly, you get a plain vector of individual names:
ts_names(ts)
[1] "NEA_1" "EUR_1" "NEA_2" "EUR_2" "EUR_3" "EUR_4" "EUR_5"
[8] "EUR_6" "EUR_7" "EUR_8" "EUR_9" "EUR_10" "EUR_11" "EUR_12"
[15] "EUR_13" "EUR_14" "EUR_15" "EUR_16" "EUR_17" "EUR_18" "EUR_19"
[22] "EUR_20" "AFR_1" "AFR_2" "AFR_3" "AFR_4" "AFR_5" "CHIMP_1"
[29] "EUR_21" "EUR_22" "EUR_23" "EUR_24" "EUR_25" "EUR_26" "EUR_27"
[36] "EUR_28" "EUR_29" "EUR_30"
This is not super helpful, unless we want to compute some statistic for everyone in the tree sequence, regardless of their population assignment. Perhaps a bit more useful is to call the function like this, because it will produce a result which can be immediately used as the sample_sets =
argument mentioned in the Hint above:
ts_names(ts, split = "pop")
$AFR
[1] "AFR_1" "AFR_2" "AFR_3" "AFR_4" "AFR_5"
$CHIMP
[1] "CHIMP_1"
$EUR
[1] "EUR_1" "EUR_2" "EUR_3" "EUR_4" "EUR_5" "EUR_6" "EUR_7" "EUR_8"
[9] "EUR_9" "EUR_10" "EUR_11" "EUR_12" "EUR_13" "EUR_14" "EUR_15" "EUR_16"
[17] "EUR_17" "EUR_18" "EUR_19" "EUR_20" "EUR_21" "EUR_22" "EUR_23" "EUR_24"
[25] "EUR_25" "EUR_26" "EUR_27" "EUR_28" "EUR_29" "EUR_30"
$NEA
[1] "NEA_1" "NEA_2"
As you can see, this gave us a normal R list, with each element containing a vector of individual names in a population. Note that we can use standard R list indexing to get subsets of individuals:
<- ts_names(ts, split = "pop")
names
"NEA"] names[
$NEA
[1] "NEA_1" "NEA_2"
c("EUR", "NEA")] names[
$EUR
[1] "EUR_1" "EUR_2" "EUR_3" "EUR_4" "EUR_5" "EUR_6" "EUR_7" "EUR_8"
[9] "EUR_9" "EUR_10" "EUR_11" "EUR_12" "EUR_13" "EUR_14" "EUR_15" "EUR_16"
[17] "EUR_17" "EUR_18" "EUR_19" "EUR_20" "EUR_21" "EUR_22" "EUR_23" "EUR_24"
[25] "EUR_25" "EUR_26" "EUR_27" "EUR_28" "EUR_29" "EUR_30"
$NEA
[1] "NEA_1" "NEA_2"
etc.
Many of the following exercises will use these kinds of tricks to instruct various slendr / tskit functions to compute statistics on subsets of all individuals sub-sampled in this way.
After you computed nucleotide diversity per-population, compute it for each individual separately using the same function ts_diversity()
(which, in this setting, gives you effectively the heterozygosity for each individual). If you are familiar with plotting in R, visualize the individual-based heterozygosities across all populations.
Hint: You can do this by giving a vector of names as sample_sets =
(so not an R list of vectors). You could also use the data frame produced by ts_samples(ts)
to get the names, just adding the heterozygosities to that data frame as a new column.
Part 2: Computing pairwise divergence
Use the function ts_divergence()
to compute genetic divergence between all pairs of populations. Again, do you get results compatible with our demographic model in terms of expectation given the split times between populations as you programmed them for your model?
Hint: Again, you can use the same concept of sample_sets =
we discussed in the previous part. In this case, the function computes pairwise divergence between each element of the list given as sample_sets =
(i.e., for each vector of individual names).
Part 3: Detecting Neanderthal admixture in Europeans
Let’s now pretend its about 2008, we’ve sequenced the first Neanderthal genome, and we are working on a project that will change human evolution research forever. We also have the genomes of a couple of people from Africa and Europe, which we want to use to answer the most burning question of all evolutionary anthropology: “Do some people living today carry Neanderthal ancestry?”
Earlier you’ve learned about \(f\)-statistics of various kinds. You have also heard that an \(f_4\) statistic (or its equivalent \(D\) statistic) can be used as a test of “treeness”. Simply speaking, for some “quartet” of individuals or population samples, they can be used as a hypothesis test of whether the history of those samples is compatible with there not having been an introgression.
Compute the \(f_4\) test of Neanderthal introgression in EUR individuals using the slendr function ts_f4()
. When you’re running it, you will have to provide individuals to compute the statistic using a slightly different format. Take a look at the help page available as ?ts_f4
for more information. When you’re computing the \(f_4\), make sure to set mode = "branch"
argument of the ts_f4()
function (we will get to why a bit later).
Note: By default, each slendr / tskit statistic function operates on mutations, and this will switch them to use branch length (as you might know, \(f\)-statistics are mathematically defined using branch lengths in trees and mode = "branch"
does exactly that).
Hint: If you haven’t learned this in your \(f\)-statistics lecture, you want to compute (and compare) the values of these two statistics using the slendr function ts_f4()
:
\(f_4\)(<some African>, <another African>; <Neanderthal>, <Chimp>)
\(f_4\)(<some African>, <a test European>; <Neanderthal>, <Chimp>),
here <individual>
can be the name of any individual recorded in your tree sequence, such as names you saw as name
column in the table returned by ts_samples(ts)
(i.e. "NEA_1"
could be used as a “representative” <Neanderthal> in those equations, similarly for "CHIMP_1"
as the fourth sample in the \(f_4\) quarted representing the outgroup).
To simplify things a lot, we can understand the above equations as comparing the counts of so-called BABA and ABBA allele patterns between the quarted of samples specified in the statistics:
\[ f_4(AFR, X; NEA, CHIMP) = \frac{\#BABA - \#ABBA}{\#SNPs} \]
The first \(f_4\) statistic above is not expected to give values “too different” from 0 (even in case of Neanderthal introgression into Europeans) because we don’t expect two African individuals to differ “significantly” in terms of how much alleles they share with a Neanderthal (because their ancestors never met Neanderthals!). The other should – if there was a Neanderthal introgression into Europeans some time in their history – be “significantly negative”.
Is the second of those two statistics “much more negative” than the first, as expected assuming introgression from Neanderthals into Europeans?
Why am I putting “significantly” and “much more negative” in quotes in the previous sentence? What are we missing here for this being a true hypothesis test as you might be accustomed to from computing \(f\)-statistics using a tool such as ADMIXTOOLS? (We will get to this again in the following part of this exercise.)
Part 4: Detecting Neanderthal admixture in Europeans v2.0
The fact that we don’t get something equivalent to a p-value in these kinds of simulations is generally not a problem, because we’re often interested in establishing a trend of a statistic under various conditions, and understanding when and how its expected value behaves in a certain way. If statistical noise is a problem, we work around this by computing a statistic on multiple simulation replicates or even increasing the sample sizes.
Note: To see this in practice, you can check out a paper in which I used this approach quite successfully on a related problem.
On top of that, p-value of something like an \(f\)-statistic (whether it’s significantly different from zero) is also strongly affected by quality of the data, sequencing errors, coverage, etc. (which can certainly be examined using simulations!). However, these are aspects of modeling which are quite orthogonal to the problem of investigating the expectations and trends of statistics given some underlying evolutionary model, which is what we’re after in these exercises.
That said, even in perfect simulated data, what exactly does “significantly different from zero compared to some baseline expectation” mean can be blurred by noise with simple single-individual comparisons that we did above. Let’s increase the sample size a bit to see if a statistical pattern expected in \(f_4\) statistic from our Neanderthal introgression model becomes more apparent.
Compute the first \(f_4\) statistic (the baseline expectation between a pair of Africans) and the second \(f_4\) statistic (comparing an African and a European), but this time on all recorded Africans and all recorded Europeans, respectively. Plot the distributions of those two sets of statistics. This should remove lots of the uncertainty and make a statistical trend stand out much more clearly.
Hint: Whenever you need to compute something for many things in sequence, looping is very useful. One way to do compute, say, an \(f_4\) statistic over many individuals is by using this kind of pattern using R’s looping function lapply()
:
# Loop over vector of individual names (variable x) and apply a given ts_f4()
# expression on each individual (note the ts_f4(..., X = x, ...) in the code)
<- lapply(
list_f4 c("ind_1", "ind_2", ...),
function(x) ts_f4(ts, W = "AFR_1", X = x, Y = "NEA_1", Z = "CHIMP_1", mode = "branch")
)
# The above gives us a list of data frames, so we need to bind them all into a
# single table for easier interpretation and visualization
<- do.call(rbind, list_f4) df_f4