library(slendr)
init_env(quiet = TRUE)
source(here::here("popgen_files/popgen_utils.R"))
<- population("popA", time = 3000, N = 5000)
popA <- population("popB", time = 1500, N = 5000, parent = popA)
popB <- population("popC", time = 1500, N = 5000, parent = popA)
popC
<- compile_model(list(popA, popB, popC), generation_time = 1)
model
<- schedule_sampling(model, times = seq(3000, 0, by = -200),
schedule list(popA, 10), list(popB, 10), list(popC, 10))
plot_model(model, proportions = TRUE, samples = schedule)
Exploring PCA patterns
Note: We’re unlikely to make it to this part, so this is extra bonus which can be ignored!
In earlier lectures you’ve learned about the Principal Component Analysis, a dimensional-reduction technique that’s a central piece of many (if not most) papers studying population history of many (hundreds) of individuals genotyped across many (millions) SNPs.
Although incredibly popular, PCA has recently been a topic of controversy, with a highly combative paper criticizing the application of PCA in all of population genetics, almost to a point of claiming that none of the results can be published. As an example, take a look at Figure 5 in the paper, demonstrating that an Indian population can be arbitrarily placed in proximity to Europeans, East Asians, and Africans, simply by wiggling the sample sizes of all groups before they applying PCA on their genotype data.
This activity session is not a place to debate the merit of PCA (or any other popgen method for that matter). What we can do (as we’ve done in previous exercises) is to show how slendr can be used in a very easy way to evaluate patterns, expected behavior, and overall dynamics of many popgen metrics in a controlled settings – and in this way, verify our intuition and check the robustness of a method in question – using PCA as another example. We’ll be specifically looking at patterns which arise in PCA depending on the exact spatio-temporal sampling from a demographic model.
Part 1: Create a model
Start a new script named exercise5.R
with the following setup of another toy demographic model:
As you can see, this model describes the demographic history of three populations: one ancestral population “popA” starting at 3000 generations ago, which splits into two populations “popB” and “popC” the same time at 1500 generations ago. We den instruct slendr to record 10 individuals from each of the three populations starting from 3000 generations ago all the way to 0 generations ago (i.e. the “present”), every 200 generations (remeber the seq()
R function!).
Part 2: Simulate a (mutated!) tree sequence
To be able to run PCA using the smartsnp R package (below), we will need to simulate data in the EIGENSTRAT file format. And to do that, we need our tree sequence with mutations.
Recall that all of our previous exercises managed to do away with mutations completely, owing to the amazing nature of the succint tree sequence data structure invented by the people behind the tskit project. However, all traditional popgen software and tools still rely on genotype data, which is why we now have to simulate mutations as well. Luckily, this is very easy – instead of the traditional
<- msprime(model, sequence_length = ..., recombination_rate = ..., samples = ...) ts
we will run this:
# First run a normal msprime simulation creating a tree-sequence object, then
# directly pipe it into a function which adds (neutral!) mutations to it
<- msprime(model, sequence_length = ..., recombination_rate = ..., samples = ...) %>%
ts ts_mutate(mutation_rate = ...)
which is equivalent to running this without the %>%
“pipe operator”:
# First run a normal msprime simulation creating a tree-sequence object...
<- msprime(model, sequence_length = ..., recombination_rate = ..., samples = ...)
ts_nomuts # ... then add (neutral!) mutations to it
<- ts_mutate(ts_nomuts, mutation_rate = ...) ts
With that out of the way, simulate a tree sequence from the popA/popB/popC model
above, which will be 50 Mb (50e6) long, with a recombination rate 1e-8, and overlay mutations on it at a rate 1e-8. Check that it has mutations either by typing out ts
in the console and looking for a “Mutations” section of the summary, or by using the function ts_table(ts, "mutations")
. Then count how many individuals you have recorded for each population using the table produced by ts_samples()
.
# First run a normal msprime simulation creating a tree-sequence object, then
# directly pipe it into a function which adds (neutral!) mutations to it
<- msprime(model, samples = schedule, sequence_length = 50e6, recombination_rate = 1e-8, random_seed = 1702182272)
ts_nomuts
# Notice we have no mutations on the tree sequence, just as before...
ts_nomuts
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 160,114║
╟───────────────┼───────────╢
║Sequence Length│ 50,000,000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 620║
╟───────────────┼───────────╢
║Total Size │ 26.1 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │619,886│ 18.9 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 310│ 8.5 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │ 92,209│ 2.5 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 3│303 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 1│ 5.2 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧═══════╧═════════╧════════════╝
<- ts_mutate(ts_nomuts, mutation_rate = 1e-8)
ts
# ... but we have them now!
ts
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 160,114║
╟───────────────┼───────────╢
║Sequence Length│ 50,000,000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 620║
╟───────────────┼───────────╢
║Total Size │ 36.1 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │619,886│ 18.9 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 310│ 8.5 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │168,709│ 6.0 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │ 92,209│ 2.5 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 3│303 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 2│ 6.0 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │168,423│ 4.0 MiB│ No║
╚═══════════╧═══════╧═════════╧════════════╝
# Get the table of individuals (and process it a bit for tidier plotting later)
<- ts_samples(ts) %>% mutate(pop = factor(pop, levels = c("popA", "popB", "popC")))
samples
# Count how many individuals do we have for each population
%>% group_by(pop) %>% count() samples
# A tibble: 3 × 2
# Groups: pop [3]
pop n
<fct> <int>
1 popA 150
2 popB 80
3 popC 80
Part 3: Converting a tree sequence into EIGENSTRAT
The function to use for converting a tree-sequence object we have in R (in our exercises the thing we usually had in the ts
variable) to disk in form of genotypes in the EIGENSTRAT format is called ts_eigenstrat()
. The standard way to call it (but see ?ts_eigenstrat
for more options) is like this:
ts_eigenstrat(ts, prefix = "path/to/a/desired/EIGENSTRAT/prefix")
Which creates three files .ind
, .snp
, and .geno
as: - path/to/a/desired/EIGENSTRAT/prefix.ind
, - path/to/a/desired/EIGENSTRAT/prefix.snp
, and - path/to/a/desired/EIGENSTRAT/prefix.geno
,
just as you would expect for any EIGENSTRAT file.
Take your tree sequence ts
just just simulated, and convert it to EIGENSTRAT format under the prefix popgen_files/ABC_all
.
ts_eigenstrat(ts, "popgen_files/ABC_all")
181 multiallelic sites (0.107% out of 168423 total) detected and removed
EIGENSTRAT object
=================
components:
ind file: popgen_files/ABC_all.ind
snp file: popgen_files/ABC_all.snp
geno file: popgen_files/ABC_all.geno
Check that the EIGENSTRAT files really appeared at the path that you specified (in the terminal).
Part 4: Inspect the EIGENSTRAT data produced by slendr
Years ago I developed a small R package to help me with \(f\)-statistics based projects using the ADMIXTOOLS software (which operates on data in the EIGENSTRAT file format), called admixr
Use the following code to examine one of the EIGENSTRAT data sets you’ve just created. Just look at the results and see if they make sense in terms of what you’ve learned about this in earlier lectures.
library(admixr)
<- eigenstrat("<prefix of a trio of EIGENSTRAT .ind/.snp/.geno files")
eigen
# Print out a summary of the EIGENSTRAT data
eigen
# Read the .ind file as a table into R
read_ind(eigen)
# Read the .snp file as a table into R
read_snp(eigen)
# Read the .geno file as a table into R
read_geno(eigen)
library(admixr)
<- eigenstrat("popgen_files/ABC_all")
eigen
# Print out a summary of the EIGENSTRAT data
eigen
# Read the .ind file as a table into R
read_ind(eigen)
# Read the .snp file as a table into R
read_snp(eigen)
# Read the .geno file as a table into R
read_geno(eigen)
Part 5: Principal Component Analysis on the entire simulated data set
Now, at long last, we have everything we need to be able to run ABC on the data generated by our slendr model. To avoid making this exercise even longer, I provided a helper function for you called plot_pca()
. But this function isn’t doing anything magical – it uses the smartsnp R package to compute the principal components and visualize the results using ggplot2. This is something many of you could do given enough time but we want to focus on simulations and PCA, not pure R coding. If you’re interested, take a look at my implementation of plot_pca()
here.
Here’s how you can use this function (remeber that you need to put source(here::here("popgen_files/popgen_utils.R"))
into your script!):
- Plot PCA while coloring each individual by their population assignment:
plot_pca("path/to/prefix", <tree sequence used to create EIGENSTRAT>, color_by = "pop")
- Plot PCA while coloring each individual by their time of sampling:
plot_pca("path/to/prefix", <tree sequence used to create EIGENSTRAT>, color_by = "time")
- By default, the function plots PC 1 vs PC 2, but you can customize things by providing an optional argument
pc =
like this:
plot_pca("path/to/prefix", <tree sequence used to create EIGENSTRAT>, color_by = "pop", pc = c(2, 3))
Use the provided plot_pca()
function to run PCA based on genotypes for all recorded individuals that you just converted as EIGENSTRAT "popgen_files/ABC_all"
from the ts
tree sequence. Visualize PC 1 vs PC 2 by first ccolor each individual by their population label (color_by = "pop"
) then by the time of their sampling (color_by = "time"
).
Does the PCA of PC 1 vs PC 2 capture the relationship between all individuals across the populations and across time?
plot_pca("popgen_files/ABC_all", ts, color_by = "pop", pc = c(1, 2))
PCA cache for the given EIGENSTRAT data was not found. Generating it now (this might take a moment)...
plot_pca("popgen_files/ABC_all", ts, color_by = "time", pc = c(1, 2))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
It looks like the PCA from PC 1 vs 2 cannot “partition out” the drift along the ancestral “popA” lineage prior to the population splits!
Use plot_pca()
to compute the PCAon this exact same data, but examine how does the shape of the PCA scatterplot change when you switch the pairs of PCs plotted (i.e., PC 2 vs PC 3, PC 3 vs PC 4, PC 4 vs PC 6, etc.). Which pair of PCs does the best job at recapitulating the demographic model?
Note: We’re doing this purely for educational purposes and for fun, using an extremely idealistic demographic model which is perfectly known (by definition, because we simulated it) and perfect sampling scheme. The point is to explore what does doing a PCA mean in practice, visually, and to built intuition into it.
# We can see that the overall shape of the demographic model tree is now nicely
# reflected in the PCA shape
plot_pca("popgen_files/ABC_all", ts, color_by = "pop", pc = c(2, 3))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
plot_pca("popgen_files/ABC_all", ts, color_by = "time", pc = c(2, 3))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
plot_pca("popgen_files/ABC_all", ts, color_by = "pop", pc = c(3, 4))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
plot_pca("popgen_files/ABC_all", ts, color_by = "time", pc = c(3, 4))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
# Things are getting progressively wilder!
plot_pca("popgen_files/ABC_all", ts, color_by = "pop", pc = c(4, 5))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
plot_pca("popgen_files/ABC_all", ts, color_by = "time", pc = c(4, 5))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
# ...
plot_pca("popgen_files/ABC_all", ts, color_by = "pop", pc = c(5, 6))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
plot_pca("popgen_files/ABC_all", ts, color_by = "time", pc = c(5, 6))
PCA cache for the given EIGENSTRAT data was found. Loading it to save computational time...
Bonus 1: Tree-sequence simplification and EIGENSTRAT conversion
One of our goals in this exercise was to investigate how does the shape of a PCA look like based on the sampling of individuals across populations and also across time – all of that from the same demographic history. In order to do that, we need to be able to select only a defined subset of individuals from a given tree sequence. Which brings us to the last tree-sequence processing function in slendr callsed ts_simplify()
. Implemented on top of the simplify()
method in tskit, it has a very simple interface:
<- ts_simplify(ts_big, simplify_to = c(<subset of individuals as a vector>)) ts_small
This function call creates a new tree sequence, which is smaller and only contains a those individuals whose names were specified in simplify_to =
(again, we’re talking about the “symbolic names” of individuals, such as “NEA_1”, “AFR_42”, etc., not integer numbers of tskit nodes).
Whenever you want to create smaller subsets of a large tree sequence, it is often helpful to work with the table of all individuals in the original tree sequence, because it contains every individual’s name
, pop
assignment and the time
in which it lived, so let’s save it for further use now:
<- ts_samples(ts)
samples
nrow(samples)
[1] 310
For instance, we can get only individuals from “popB” and “popC” sampled at the present using this code:
<- filter(samples, pop %in% c("popB", "popC"), time == 0)
subset
nrow(subset)
[1] 20
You know that the table of samples contains the name
of each individual, which you can access as subset$name
. Use the ts_simplify()
function to create a new tree sequence called ts_BC0
which contains only this subset of individuals. Check that it really does contain only the defined subset of individuals using ts_samples(ts_BC0)
.
<- ts_simplify(ts, simplify_to = subset$name)
ts_BC0
ts_samples(ts_BC0)
# A tibble: 20 × 3
name time pop
<chr> <dbl> <chr>
1 popB_71 0 popB
2 popB_72 0 popB
3 popB_73 0 popB
4 popB_74 0 popB
5 popB_75 0 popB
6 popB_76 0 popB
7 popB_77 0 popB
8 popB_78 0 popB
9 popB_79 0 popB
10 popB_80 0 popB
11 popC_71 0 popC
12 popC_72 0 popC
13 popC_73 0 popC
14 popC_74 0 popC
15 popC_75 0 popC
16 popC_76 0 popC
17 popC_77 0 popC
18 popC_78 0 popC
19 popC_79 0 popC
20 popC_80 0 popC
When you have a smaller tree sequence like this, you can convert it to an EIGENSTRAT file format using ts_eigenstrat()
just like we did above.
Part 2: PCA visualization on subsets of individuals
TODO: Try to replicate some features of the results discovered for EUR/ASIAN/AFR/Indian populations.