library(slendr)
init_env()
# <replace this with `population()` definitions like in the slides>
# <replace this with your gene-flow definition in variable `gf`>
<- compile_model(
model populations = list(...), # <put your list of populations here>
gene_flow = gf,
generation_time = 30
)
Demographic models
In this exercise, you will practice building demographic models from scratch using the programmable interface provided by the slendr R package. In this context, you can understand “demographic model” as “a tree-like topological structure encoding the relationships between populations and gene flows between them”. For the time being, these models will be always neutral and will conform to Wright-Fisher assumptions.
Part 1: Building a demographic model in slendr
Use functions such as population()
, gene_flow()
, and compile_model()
, which we discussed in the “slendr crash course”, to program the following toy model of human demographic history in slendr. (Apologies for my bad handwriting and the lack of any artistic skill.)
Note: You could easily program the model so that different ancestral populations are represented by separate population()
commands (i.e., your model would start with a population called “human_chimp_ancestor” from which a “CHIMP” and “hominin_ancestor” populations would split at 6 Mya, etc.) but generally this is too annoying to do and requires too much code.
Feel free to write the model so that “CHIMP” is the first population, then “AFR” population splits from it at 6 Mya, etc… Although it probably isn’t the most accurate way to describe the real evolutionary history, it simplifies the coding significantly.
[Mya = million years ago; kya = thousand years ago]
Hint: Create a new script models.R
in your RStudio session using the following “template”. Then add a sequence of appropriate population()
calls using the syntax from the introductory slides (using the parent = <pop>
argument for programming splits of daughter populations – which will be all except the CHIMP lineage in our example), etc.
Note: With slendr you can specify time in whatever format is more convenient or readable for your model. For instance here, because we’re dealing with historical events which are commonly expressed in times given as”years ago”, we can write them in a decreasing order – i.e. 7Mya → 6Mya → …, as shown above – or, in terms of R code, 7e6 (or 7000000), 6e6 (6000000), etc.
In a later example, you will see that you can also encode the events in the time direction going “forward” (i.e., the first event starting in generation 1, a following event in generation 42, and so on).
Hint: Remember that slendr is designed with interactivity in mind! When you write a chunk of code (such as a command to create a population through a population split, or model compilation to create a model
object), execute that bit of code in the R console and inspect the summary information printed by evaluating the respective R object you just created. You can either copy-pasted stuff from your script to the R console, or use a convenient RStudio shortcut like Ctrl+Enter (Linux and Windows), or Cmd+Enter (Mac).
library(slendr)
init_env()
## The interface to all required Python modules has been activated.
dir.create("popgen_files/introgression", recursive = TRUE, showWarnings = FALSE)
# Chimpanzee outgroup
<- population("CHIMP", time = 7e6, N = 5000)
chimp
# Two populations of anatomically modern humans: Africans and Europeans
<- population("AFR", parent = chimp, time = 6e6, N = 15000)
afr <- population("EUR", parent = afr, time = 60e3, N = 3000)
eur
# Neanderthal population splitting at 600 ky ago from modern humans
# (becomes extinct by 40 ky ago)
<- population("NEA", parent = afr, time = 600e3, N = 1000, remove = 40e3)
nea
# Neanderthal introgression event (3% admixture between 55-50 kya)
<- gene_flow(from = nea, to = eur, rate = 0.03, start = 55000, end = 50000)
gf
# Compile the entire model into a single slendr R object
<- compile_model(
model populations = list(chimp, nea, afr, eur),
gene_flow = gf,
generation_time = 30,
path = "popgen_files/introgression", # <--- don't worry about these two
overwrite = TRUE, force = TRUE # <--- lines of code (ask me if interested)
)
Part 2: Inspecting the model visually
To visualize a slendr model, you use the function plot_model()
. Plot your compiled model
to make sure you programmed it correctly! Your figure should roughly correspond to my doodle above.
Note: Plotting of models in slendr can be sometimes a little wonky, especially if many things are happening at once. When plotting your model, experiment with arguments log = TRUE
, proportions = TRUE
, gene_flow = TRUE
. Check ?plot_model
for more information on these.
plot_model(model)
plot_model(model, sizes = FALSE)
plot_model(model, sizes = FALSE, log = TRUE)
plot_model(model, log = TRUE, proportions = TRUE)
Part 3: Simulating genomic data
Once you have a compiled slendr model stored in an R variable (from now on, model
will always mean a variable containing a compiled slendr model object relevant for the given exercise, for simplicity), we can simulate data from it. By default, slendr models always produce a tree sequence.
Note: Tree sequence provides an extremely efficient means to store and work with genomic data at a massive scale. However, you can always get simulated data even in traditional file formats, such as VCF, EIGENSTRAT, or a plain old table of ancestral/derived genotypes.
In this activity we will be only working with tree sequences, because it’s much easier and faster to get interesting statistics from it directly in R.
There are two simulation engines built into slendr implemented by functions msprime()
and slim()
. For traditional, non-spatial, neutral demographic models, the engine provided by the msprime()
function is much more efficient, so we’ll be using that for the time being. However, from a popgen theoretical perspective, both simulation functions will give you the same results for any given compiled slendr model (up to some level of stochastic noise, of course).
Note: Yes, this means you don’t have to write any msprime (or SLiM) code to simulate data from a slendr model!
Here’s how you can use the function to simulate a tree sequence from the model you’ve just created using compile_model()
in your script:
<- msprime(
ts
model,sequence_length = <length of sequence to simulate [as bp]>,
recombination_rate = <uniform recombination rate [per bp per generation]>
)
You will be seeing this kind of pattern over and over again in this exercise, so it’s a good idea to keep it in mind.
Hint: The msprime()
function has also arguments debug
and run
which can be extremely useful for debugging.
Simulate a tree sequence from your compiled model
using the msprime()
engine, storing it to a variable ts
as shown right above. Use sequence_length = 1e6
(so 1 Mb of sequence) and recombination_rate = 1e-8
(crossover events per base pair per generation). Then experiment with setting debug = TRUE
(this prints out msprime’s own debugging summary which you might already be familiar with from your previous activity?) and then run = FALSE
(this prints out a raw command-line which can run a slendr simulation in the shell).
# This simulates a tskit tree sequence from a slendr model. Note that you didn't have
# to write any msprime or tskit Python code!
<- msprime(model, sequence_length = 1e6, recombination_rate = 1e-8)
ts
# Setting `debug = TRUE` instructs slendr's built-in msprime script to print
# out msprime's own debugger information. This can be very useful for debugging,
# in addition to the visualization of the model as shown above.
<- msprime(model, sequence_length = 1e6, recombination_rate = 1e-8, debug = TRUE)
ts ## DemographyDebugger
## ╠═════════════════════════════════════╗
## ║ Epoch[0]: [0, 1.67e+03) generations ║
## ╠═════════════════════════════════════╝
## ╟ Populations (total=4 active=4)
## ║ ┌─────────────────────────────────────────────────────────────────────┐
## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │ EUR │
## ║ ├─────────────────────────────────────────────────────────────────────┤
## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ EUR│ 3000.0│ 3000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ └─────────────────────────────────────────────────────────────────────┘
## ╟ Events @ generation 1.67e+03
## ║ ┌─────────────────────────────────────────────────────────────────────────────────────────┐
## ║ │ time│type │parameters │effect │
## ║ ├─────────────────────────────────────────────────────────────────────────────────────────┤
## ║ │ 1666│Migration rate │source=EUR, dest=NEA, │Backwards-time migration rate from EUR │
## ║ │ │change │rate=0.000179640718562 │to NEA → 0.00017964071856287425 │
## ║ │ │ │87425 │ │
## ║ └─────────────────────────────────────────────────────────────────────────────────────────┘
## ╠════════════════════════════════════════════╗
## ║ Epoch[1]: [1.67e+03, 1.83e+03) generations ║
## ╠════════════════════════════════════════════╝
## ╟ Populations (total=4 active=4)
## ║ ┌───────────────────────────────────────────────────────────────────────────┐
## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │ EUR │
## ║ ├───────────────────────────────────────────────────────────────────────────┤
## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ EUR│ 3000.0│ 3000.0│ 0 │ 0 │ 0 │ 0.0001796 │ 0 │
## ║ └───────────────────────────────────────────────────────────────────────────┘
## ╟ Events @ generation 1.83e+03
## ║ ┌────────────────────────────────────────────────────────────────────────────────────────┐
## ║ │ time│type │parameters │effect │
## ║ ├────────────────────────────────────────────────────────────────────────────────────────┤
## ║ │ 1833│Migration rate │source=EUR, dest=NEA, │Backwards-time migration rate from EUR │
## ║ │ │change │rate=0 │to NEA → 0 │
## ║ │┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈┈│
## ║ │ 1833│Census │ │Insert census nodes to record the │
## ║ │ │ │ │location of all lineages │
## ║ └────────────────────────────────────────────────────────────────────────────────────────┘
## ╠═════════════════════════════════════════╗
## ║ Epoch[2]: [1.83e+03, 2e+03) generations ║
## ╠═════════════════════════════════════════╝
## ╟ Populations (total=4 active=4)
## ║ ┌─────────────────────────────────────────────────────────────────────┐
## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │ EUR │
## ║ ├─────────────────────────────────────────────────────────────────────┤
## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ │ EUR│ 3000.0│ 3000.0│ 0 │ 0 │ 0 │ 0 │ 0 │
## ║ └─────────────────────────────────────────────────────────────────────┘
## ╟ Events @ generation 2e+03
## ║ ┌───────────────────────────────────────────────────────────────────────────┐
## ║ │ time│type │parameters │effect │
## ║ ├───────────────────────────────────────────────────────────────────────────┤
## ║ │ 2000│Population │derived=[EUR], │Moves all lineages from the 'EUR' │
## ║ │ │Split │ancestral=AFR │derived population to the ancestral │
## ║ │ │ │ │'AFR' population. Also set 'EUR' to │
## ║ │ │ │ │inactive, and all migration rates to │
## ║ │ │ │ │and from the derived population to │
## ║ │ │ │ │zero. │
## ║ └───────────────────────────────────────────────────────────────────────────┘
## ╠══════════════════════════════════════╗
## ║ Epoch[3]: [2e+03, 2e+04) generations ║
## ╠══════════════════════════════════════╝
## ╟ Populations (total=4 active=3)
## ║ ┌───────────────────────────────────────────────────────────────┐
## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │ NEA │
## ║ ├───────────────────────────────────────────────────────────────┤
## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │ 0 │
## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │ 0 │
## ║ │ NEA│ 1000.0│ 1000.0│ 0 │ 0 │ 0 │ 0 │
## ║ └───────────────────────────────────────────────────────────────┘
## ╟ Events @ generation 2e+04
## ║ ┌────────────────────────────────────────────────────────────────────────────┐
## ║ │ time│type │parameters │effect │
## ║ ├────────────────────────────────────────────────────────────────────────────┤
## ║ │ 2e+04│Population │derived=[NEA], │Moves all lineages from the 'NEA' │
## ║ │ │Split │ancestral=AFR │derived population to the ancestral │
## ║ │ │ │ │'AFR' population. Also set 'NEA' to │
## ║ │ │ │ │inactive, and all migration rates to │
## ║ │ │ │ │and from the derived population to │
## ║ │ │ │ │zero. │
## ║ └────────────────────────────────────────────────────────────────────────────┘
## ╠══════════════════════════════════════╗
## ║ Epoch[4]: [2e+04, 2e+05) generations ║
## ╠══════════════════════════════════════╝
## ╟ Populations (total=4 active=2)
## ║ ┌─────────────────────────────────────────────────────────┐
## ║ │ │ start│ end│growth_rate │ CHIMP │ AFR │
## ║ ├─────────────────────────────────────────────────────────┤
## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │ 0 │ 0 │
## ║ │ AFR│ 15000.0│ 15000.0│ 0 │ 0 │ 0 │
## ║ └─────────────────────────────────────────────────────────┘
## ╟ Events @ generation 2e+05
## ║ ┌──────────────────────────────────────────────────────────────────────────────┐
## ║ │ time│type │parameters │effect │
## ║ ├──────────────────────────────────────────────────────────────────────────────┤
## ║ │ 2e+05│Population │derived=[AFR], │Moves all lineages from the 'AFR' │
## ║ │ │Split │ancestral=CHIMP │derived population to the ancestral │
## ║ │ │ │ │'CHIMP' population. Also set 'AFR' to │
## ║ │ │ │ │inactive, and all migration rates to │
## ║ │ │ │ │and from the derived population to │
## ║ │ │ │ │zero. │
## ║ └──────────────────────────────────────────────────────────────────────────────┘
## ╠════════════════════════════════════╗
## ║ Epoch[5]: [2e+05, inf) generations ║
## ╠════════════════════════════════════╝
## ╟ Populations (total=4 active=1)
## ║ ┌─────────────────────────────────────────┐
## ║ │ │ start│ end│growth_rate │
## ║ ├─────────────────────────────────────────┤
## ║ │ CHIMP│ 5000.0│ 5000.0│ 0 │
## ║ └─────────────────────────────────────────┘
# For debugging of technical issues (with msprime, with slendr, or both), it is
# very useful to have the `msprime` function dump the "raw" command-line to
# run the simulation on the terminal using plain Python interpreter
msprime(model, sequence_length = 1e6, recombination_rate = 1e-8, run = FALSE)
## /Users/mp/Library/r-miniconda-arm64/envs/Python-3.12_msprime-1.3.4_tskit-0.6.4_pyslim-1.0.4_tspop-0.0.2/bin/python /Users/mp/Code/simgen/popgen_files/introgression/script.py --seed 1064927500 --model /Users/mp/Code/simgen/popgen_files/introgression --sequence-length 1000000 --recombination-rate 1e-08 --coalescent_only --path path_to_a_trees_file.trees
Part 4: Inspecting the tree-sequence object
As we will see later, slendr provides an R-friendly interface to accessing a subset of tskit’s functionality for working with tree sequences and for computing various popgen statistics.
For now, type out the ts
object in the terminal – what do you see? You should get a summary of a tree-sequence object that you’re familiar with from your msprime and tskit activity earlier in the day.
Note: This is a very important feature of slendr – when a simulation is concluded (doesn’t matter if it was a slim()
or msprime()
simulation), you will get a normal tskit object. In fact, the fact that slendr supports (so far, and likely always) only a “subset” of all of tskit’s functionality isn’t stopping you to write custom Python/tskit processing code of a tree sequence generated from a slendr model. Under the hood, a slendr simulation really is just an msprime (or SLiM) simulation! It’s just executed through a simplified interface.
# Typing out the object with the result shows that it's a good old tskit
# tree-sequence object
ts
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 9,548║
╟───────────────┼───────────╢
║Sequence Length│ 1,000,000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 48,000║
╟───────────────┼───────────╢
║Total Size │ 8.8 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │133,898│ 4.1 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 24,000│656.3 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │104,796│ 2.8 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 4│341 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 1│ 2.7 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧═══════╧═════════╧════════════╝
The brilliance of the tree-sequence data structure rests on its elegant table-based implementation (much more information on that is here). slendr isn’t really designed to run very complex low-level manipulations of tree-sequence data (its strength lies in the convenient interface to popgen statistical functions implemented by tskit), but it does contain a couple of functions which can be useful for inspecting the lower-level nature of a tree sequence. Let’s look at a couple of them now.
Use the ts_table
function to inspect the low-level table-based representation of a tree sequence. For instance, you can get the table of nodes with ts_table(ts, "nodes")
, edges with ts_table(ts, "edges")
, and do the same thing for “individuals”, “mutations”, and “sites”. Does your tree sequence contain any mutations? If not, why, and how can we even do any popgen with data without any mutations? As you’re doing this, take a look at at the following figure (this was made from a different tree sequence than you have, but that’s OK) to help you relate the information in the tables to a tree sequence which those tables (particularly tables of nodes and edges) implicitly encode.
This should convince you that the final product of a slendr simulation really is the same kind of tree-sequence object that you learned about in the previous activities today. You don’t have to study these tables in detail!
# slendr provides a helper function which allows access to all the low-level
# components of every tree-sequence object
ts_table(ts, "nodes")
# A tibble: 104,796 × 5
node_id ind_id pop_id time time_tskit
<int> <int[1d]> <int> <dbl> <dbl>
1 0 0 0 0 0
2 1 0 0 0 0
3 2 1 0 0 0
4 3 1 0 0 0
5 4 2 0 0 0
6 5 2 0 0 0
7 6 3 0 0 0
8 7 3 0 0 0
9 8 4 0 0 0
10 9 4 0 0 0
# ℹ 104,786 more rows
ts_table(ts, "edges")
# A tibble: 133,898 × 5
id child parent left right
<dbl> <int> <int> <dbl> <dbl>
1 0 25833 48000 0 1000000
2 1 29734 48000 0 1000000
3 2 26211 48001 0 1000000
4 3 29328 48001 0 1000000
5 4 30938 48002 0 1000000
6 5 36473 48002 0 1000000
7 6 19987 48003 0 1000000
8 7 20993 48003 0 1000000
9 8 40933 48004 0 1000000
10 9 41298 48004 0 1000000
# ℹ 133,888 more rows
ts_table(ts, "individuals")
# A tibble: 24,000 × 5
ind_id time pop_id sampled time_tskit
<dbl> <dbl> <int[1d]> <lgl[1d]> <dbl[1d]>
1 0 0 0 TRUE 0
2 1 0 0 TRUE 0
3 2 0 0 TRUE 0
4 3 0 0 TRUE 0
5 4 0 0 TRUE 0
6 5 0 0 TRUE 0
7 6 0 0 TRUE 0
8 7 0 0 TRUE 0
9 8 0 0 TRUE 0
10 9 0 0 TRUE 0
# ℹ 23,990 more rows
# We didn't simulate any mutations, so we only have genealogies for now.
ts_table(ts, "mutations")
# A tibble: 0 × 5
# ℹ 5 variables: id <dbl>, site <int>, node <int>, time <dbl>, time_tskit <dbl>
ts_table(ts, "sites")
# A tibble: 0 × 2
# ℹ 2 variables: id <dbl>, position <dbl>
There are also two slendr-specific functions called ts_samples()
(which retrieves the “symbolic names” and dates of all recorded individuals at the end of a simulation) and ts_nodes()
. You can run them simply as ts_samples(ts)
and ts_nodes(ts)
. How many individuals (samples) are in your tree sequence as you simulated it? How is the result of ts_nodes()
different from ts_samples()
?
# slendr provides a convenient function `ts_samples()` which allows us to
# inspect the contents of a simulated tree sequence in a more human-readable,
# simplified way. We can see that our tree sequence contains a massive number
# of individuals. Too many, in fact -- we recorded every single individual alive
# at the end of our simulation, which is something we're unlikely to be ever lucky
# enough to have, regardless of which species we study.
ts_samples(ts)
# A tibble: 24,000 × 3
name time pop
<chr> <dbl> <chr>
1 CHIMP_1 0 AFR
2 CHIMP_2 0 AFR
3 CHIMP_3 0 AFR
4 CHIMP_4 0 AFR
5 CHIMP_5 0 AFR
6 CHIMP_6 0 AFR
7 CHIMP_7 0 AFR
8 CHIMP_8 0 AFR
9 CHIMP_9 0 AFR
10 CHIMP_10 0 AFR
# ℹ 23,990 more rows
ts_samples(ts) %>% nrow()
[1] 24000
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
ts_samples(ts) %>% group_by(pop) %>% tally
# A tibble: 4 × 2
pop n
<chr> <int>
1 AFR 15000
2 CHIMP 5000
3 EUR 3000
4 NEA 1000
# This function returns a table similar to the one produced by `ts_table(ts, "nodes")`
# above, except that it contains additional slendr metadata (names of individuals
# belonging to each node, spatial coordinates of nodes for spatial models, etc.).
# It's a bit more useful for analyzing tree-sequence data than the "low-level" functions.
ts_nodes(ts) %>% head(5)
# A tibble: 5 × 8
name pop ind_id node_id time time_tskit sampled pop_id
<chr> <fct> <dbl> <int> <dbl> <dbl> <lgl> <int>
1 CHIMP_1 AFR 5000 10000 0 0 TRUE 1
2 CHIMP_1 AFR 5000 10001 0 0 TRUE 1
3 CHIMP_2 AFR 5001 10002 0 0 TRUE 1
4 CHIMP_2 AFR 5001 10003 0 0 TRUE 1
5 CHIMP_3 AFR 5002 10004 0 0 TRUE 1
ts_nodes(ts) %>% tail(5)
# A tibble: 5 × 8
name pop ind_id node_id time time_tskit sampled pop_id
<chr> <fct> <dbl> <int> <dbl> <dbl> <lgl> <int>
1 <NA> CHIMP NA 104791 7485409. 249514. FALSE 0
2 <NA> CHIMP NA 104792 7521789. 250726. FALSE 0
3 <NA> CHIMP NA 104793 7544096. 251470. FALSE 0
4 <NA> CHIMP NA 104794 7763627. 258788. FALSE 0
5 <NA> CHIMP NA 104795 7879666. 262656. FALSE 0
Part 5: Scheduling sampling events
In the table produced by the ts_samples()
function you saw that the tree sequence we simulated recorded everyone. It’s very unlikely, unless we’re extremely lucky, that we’ll ever have a sequence of every single individual in a population that we study. To get a little closer to the scale of the genomic data that we usually work with on a day-to-day basis, we can restrict our simulation to only record a subset of individuals.
We can precisely define which individuals (from which populations, and at which times) should be recorded in a tree sequence using the slendr function schedule_sampling()
. For instance, if we have a model
with some slendr populations in variables eur
and afr
, we can schedule the recording of 5 individuals from each at times 10000 (years ago) and 0 (present-day) (using the “years before present” direction of time in our current model of Neanderthal introgression) with the following code:
<- schedule_sampling(model, times = c(10000, 0), list(eur, 5), list(afr, 5)) pop_schedule
This function simply returns a data frame. As such, we can create multiple of such schedules (of arbitrary complexity and granularity), and then bind them together into a single sampling schedule with a single line of code, like this:
# Note that the `times =` argument of the `schedule_sampling()` function can be
# a vector of times like here...
<- c(40000, 30000, 20000, 10000)
ancient_times <- schedule_sampling(model, times = ancient_times, list(eur, 1))
eur_samples
# ... but also just a single number like here
<- schedule_sampling(model, times = 0, list(afr, 1))
afr_samples <- schedule_sampling(model, time = 60000, list(nea, 1))
nea_samples
# But whatever the means you create the individual sampling schedules with,
# you can always bind them all to a single table with the `rbind()` function
<- rbind(eur_samples, afr_samples, nea_samples)
schedule schedule
Using the function schedule_sampling
(and with the help of rbind
as shown in the previous code chunk), program the sampling of the following sample sets at given times, saving it to variable called schedule
:
time | population | # individuals |
---|---|---|
70000 | nea | 1 |
40000 | nea | 1 |
0 | chimp | 1 |
0 | afr | 5 |
0 | eur | 10 |
Additionally, schedule the sampling of a single eur
individual at the following times:
<- seq(40000, 2000, by = -2000) t
Note: You can provide a vector variable (such as t
in this example) as the times =
argument of schedule_sampling()
.
In total, you should schedule the recording of 38 individuals.
# Here we scheduled the sampling of two Neanderthals at 70kya and 40kya
<- schedule_sampling(model, times = c(70000, 40000), list(nea, 1))
nea_samples # (this function produces a plain old data frame!) nea_samples
# A tibble: 2 × 8
time pop n name y_orig x_orig y x
<int> <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl>
1 40000 NEA 1 NA NA NA NA NA
2 70000 NEA 1 NA NA NA NA NA
# Here we schedule one Chimpanzee sample, 5 African samples, and 10 European samples
<- schedule_sampling(model, times = 0, list(chimp, 1), list(afr, 5), list(eur, 10))
present_samples
# We also schedule the recording of one European sample between 50kya and 2kya,
# every 2000 years
<- seq(40000, 2000, by = -2000)
times <- schedule_sampling(model, times, list(eur, 1))
emh_samples
# Because those functions produce nothing but a data frame, we can bind
# individual sampling schedules together
<- rbind(nea_samples, present_samples, emh_samples)
schedule schedule
# A tibble: 25 × 8
time pop n name y_orig x_orig y x
<int> <chr> <int> <lgl> <lgl> <lgl> <lgl> <lgl>
1 40000 NEA 1 NA NA NA NA NA
2 70000 NEA 1 NA NA NA NA NA
3 0 CHIMP 1 NA NA NA NA NA
4 0 AFR 5 NA NA NA NA NA
5 0 EUR 10 NA NA NA NA NA
6 2000 EUR 1 NA NA NA NA NA
7 4000 EUR 1 NA NA NA NA NA
8 6000 EUR 1 NA NA NA NA NA
9 8000 EUR 1 NA NA NA NA NA
10 10000 EUR 1 NA NA NA NA NA
# ℹ 15 more rows
Then, verify the correctness of your overall sampling schedule
by visualizing it together with your model
like this:
Note: As you’ve seen above, the visualization is often a bit wonky and convoluted with overlapping elements and it can be even worse with samples added, but try to experiment with arguments to plot_model
described above to make the plot a bit more helpful for sanity checking.
plot_model(model, samples = schedule)
plot_model(model, sizes = FALSE, samples = schedule)
plot_model(model, sizes = FALSE, log = TRUE, samples = schedule)
Part 6: Simulating a defined set of individuals
You have now both a compiled slendr model
and a well-defined sampling schedule
.
Use your combined sampling schedule stored in the schedule
variable to run a new tree-sequence simulation from your model (again using the msprime()
function), this time restricted to just those individuals scheduled for recording. You can do this by providing the combined sampling schedule
as the samples = schedule
argument of the function msprime
you used above. Just replace the line(s) with your first msprime()
from the previous part of this exercise with the new one, which uses the schedule
for customized sampling.
Also, while you’re doing this, use the ts_mutate()
function to overlay neutral mutations on the simulated tree sequence right after the msprime()
call. (Take a look at the handounts for a reminder of the %>%
pipe patterns I showed you.)
# The command below will likely take a few minutes to run, so feel free to go
# down from 100 Mb sequence_length to even 10Mb (it doesn't matter much).
# (The `random_seed =` argument is there for reproducibility purposes.)
<-
ts msprime(model, sequence_length = 100e6, recombination_rate = 1e-8, samples = schedule, random_seed = 1269258439) %>%
ts_mutate(mutation_rate = 1e-8, random_seed = 1269258439)
# Time difference of 2.141642 mins
# If you're bothered by ho long this takes, feel free to call these two lines
# to 100% reproduce my results without any expensive computation:
<- read_model(here::here("popgen_files/introgression"))
model <- ts_read(here::here(file = "popgen_files/introgression.trees"), model = model)
ts
# We can save a tree sequence object using a slendr function `ts_write` (this
# can be useful if we want to save the results of a simulation for later use).
dir.create("data", showWarnings = FALSE)
ts_write(ts, here::here("popgen_files/introgression.trees"))
Inspect the tree-sequence object saved in the ts
variable by typing it into the R console again (this interactivity really helps with catching nasty bugs early during the programming of your script). You can also do a similar thing via the table produced by the ts_samples()
function. You should see a much smaller number of individuals being recorded, indicating that the simulation was much more efficient and produced genomic data for only the individuals of interest.
Note: When you think about it, it’s actually quite astonishing how fast msprime and tskit are when dealing with such a huge amount of sequence data from tens of thousands of individuals on a simple laptop!
# Inspect the (tskit/Python-based) summary of the new tree sequence
# (note the much smaller number of "sample nodes"!)
ts
╔═══════════════════════════╗
║TreeSequence ║
╠═══════════════╤═══════════╣
║Trees │ 10,070║
╟───────────────┼───────────╢
║Sequence Length│ 1,000,000║
╟───────────────┼───────────╢
║Time Units │generations║
╟───────────────┼───────────╢
║Sample Nodes │ 48,000║
╟───────────────┼───────────╢
║Total Size │ 8.9 MiB║
╚═══════════════╧═══════════╝
╔═══════════╤═══════╤═════════╤════════════╗
║Table │Rows │Size │Has Metadata║
╠═══════════╪═══════╪═════════╪════════════╣
║Edges │136,213│ 4.2 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Individuals│ 24,000│656.3 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Migrations │ 0│ 8 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Mutations │ 0│ 16 Bytes│ No║
╟───────────┼───────┼─────────┼────────────╢
║Nodes │105,367│ 2.8 MiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Populations│ 4│341 Bytes│ Yes║
╟───────────┼───────┼─────────┼────────────╢
║Provenances│ 1│ 2.7 KiB│ No║
╟───────────┼───────┼─────────┼────────────╢
║Sites │ 0│ 16 Bytes│ No║
╚═══════════╧═══════╧═════════╧════════════╝
# Get the table of all recorded samples in the tree sequence
ts_samples(ts)
# A tibble: 24,000 × 3
name time pop
<chr> <dbl> <chr>
1 CHIMP_1 0 AFR
2 CHIMP_2 0 AFR
3 CHIMP_3 0 AFR
4 CHIMP_4 0 AFR
5 CHIMP_5 0 AFR
6 CHIMP_6 0 AFR
7 CHIMP_7 0 AFR
8 CHIMP_8 0 AFR
9 CHIMP_9 0 AFR
10 CHIMP_10 0 AFR
# ℹ 23,990 more rows
# Compute the count of individuals in different time points
library(dplyr)
ts_samples(ts) %>% group_by(pop, present_day = time == 0) %>% tally %>% select(present_day, pop, n)
# A tibble: 4 × 3
# Groups: pop [4]
present_day pop n
<lgl> <chr> <int>
1 TRUE AFR 15000
2 TRUE CHIMP 5000
3 TRUE EUR 3000
4 TRUE NEA 1000