library(slendr)
init_env(quiet = TRUE)
# This line sources a script in which I provide a few useful helper functions
# which you can use in this exercise
source(here::here("files/popgen/popgen_utils.R"))
# African ancestral population
<- population("AFR", time = 65000, N = 5000)
afr
# First migrants out of Africa
<- population("OOA", parent = afr, time = 60000, N = 5000, remove = 27000) %>%
ooa resize(N = 2000, time = 40000, how = "step")
# Eastern hunter-gatherers
<- population("EHG", parent = ooa, time = 28000, N = 5000, remove = 6000)
ehg
# European population
<- population("EUR", parent = ehg, time = 25000, N = 5000) %>%
eur resize(N = 10000, how = "exponential", time = 5000, end = 0)
# Anatolian farmers
<- population("ANA", time = 28000, N = 5000, parent = ooa, remove = 4000)
ana
# Yamnaya steppe population
<- population("YAM", time = 8000, N = 5000, parent = ehg, remove = 2500)
yam
# Define gene-flow events
<- list(
gf gene_flow(from = ana, to = yam, rate = 0.75, start = 7500, end = 6000),
gene_flow(from = ana, to = eur, rate = 0.5, start = 6000, end = 5000),
gene_flow(from = yam, to = eur, rate = 0.6, start = 4000, end = 3500)
)
# Compile all populations into a single slendr model object
<- compile_model(
model populations = list(afr, ooa, ehg, eur, ana, yam),
gene_flow = gf, generation_time = 30
)
# Schedule the sampling from four European populations roughly before their
# disappearance (or before the end of the simulation)
<- rbind(
schedule schedule_sampling(model, times = 0, list(eur, 50)),
schedule_sampling(model, times = 6000, list(ehg, 50)),
schedule_sampling(model, times = 4000, list(ana, 50)),
schedule_sampling(model, times = 2500, list(yam, 50))
)
Natural selection
Note: Unfortunately, the most recent SLiM v4.3 is a little broken, largely sabotaging the second half of this exercise in which slendr uses SLiM for non-neutral simulations. That said, the first half of the exercise works perfectly fine and, once SLiM v5.0 comes out (soon), everything from the first half will apply to the selection simulations. Alternatively, if you have the option, use SLiM 4.2.x.
The primary motivation for designing slendr was to make demographic modelling in R as trivially easy and fast as possible, focusing exclusively on neutral models. However, as slendr became popular, people have been asking for the possibility of simulating natural selection. After all, a large part of slendr’s functionality deals with population genetic models across geographical landscapes, which requires SLiM. So why not support selection simulations using slendr as well?
In December 2024 I caved in and added support for modifying slendr demographic models with bits of SLiM code, which allows simulating pretty much any arbitrary selection scenario you might be interested in.
This exercise is a quick demonstration of how this works and how you might simulate selection using slendr. We will do this using another toy model of ancient human history, which we will first use as a basis for simulating the frequency trajectory of an allele under positive selection, and then implementing a toy selection scan using Tajima’s D.
To speed things up, create a new selection.R
script and copy the following code as a starting point for this exercise:
Next, visualize the demographic model. If you did a bit of work in human population genetics, you might recognize it as a very simplified model of demographic history of Europe over the past 50 thousand years or so. As you can see, we are recording 50 individuals from four populations – for Europeans we sample 50 individuals at “present-day”, for the remaining populations we’re recording 50 individuals just before their disappearance. Also note that there’s quite a bit of gene-flow! This was an important thing we’ve learned about human history in the past 10 years or so – everyone is mixed with pretty much everyone, there isn’t (and never was) anything as a “pure population”.
Note: We didn’t discuss it earlier, but slendr also provides the option to specify a remove =
argument in a population()
call which instructs the simulation engine to delete a population from a simulation at a given point. For our msprime()
simulations in earlier examples it wasn’t really important, but for the slim()
simulation we will be running below, we want to make a population extinct at a certain timepoint. Which is why our ancient populations in the starting script model have the remove =
parameter specified.
plot_model(model, proportions = TRUE, samples = schedule)
Part 1: Simulating a tree sequence and computing Tajima’s D
Although the point of this exercise is to simulate selection, let’s first simulate a normal neutral model using slendr’s msprime()
engine as a sanity check. Simulate 10 Mb of sequence with a recombination rate 1e-8
and a sampling schedule
defined above. Let’s not worry about adding any mutations, just to change things up a little bit. We’ll be working with branch-based statistics here (which means adding mode = "branch"
whenever we will be computing a statistic, such as Tajima’s D).
Inspect the table of all individuals recorded in our tree sequence using the function ts_samples()
, making sure we have all the individuals scheduled for tree-sequence recording. (Again, there’s no such a thing as too many sanity checks when doing research!)
As you’ve already learned in an earlier exercise, tskit functions in slendr generally operate on vectors (or lists) of individual names, like those produced by ts_names()
above. Get a vector of names of individuals in every population recorded in the tree sequence, then use this to compute Tajima’s D using the slendr function ts_tajima()
. (Use the same approach as you have with ts_diversity()
or ts_divergence()
above, using the list of names of individuals as the sample_sets =
argument for ts_tajima()
). Do you see any striking differences in the Tajima’s D values across populations? Check this for some general guidance.
Part 2: Computing Tajima’s D in windows
Let’s take this one step forward. Even if there is a locus under positive selection somewhere along our chromosome, it might be quite unlikely that we would find a Tajima’s D value significant enough for the entire chromosome (which is basically what we did in Part 1 now). Fortunately, thanks to the flexibility of the tskit module, the slendr function ts_tajima()
has an argument windows =
, which allows us to specify the coordinates of windows into which a sequence should be broken into, with Tajima’s D computed separately for each window. Perhaps this will allow us to see the impact of positive selection after we get to adding selection to our model. So let’s first built some code towards that.
Define a variable windows
which will contain a vector of coordinates of 100 windows, starting at position 0
, and ending at position 10e6
(i.e., the end of our chromosome). Then provide this variable as the windows =
argument of ts_tajima()
on a new, separate line of your script. Save the result of ts_tajima()
into the variable tajima_wins
, and inspect its contents in the R console.
Hint: You can use the R function seq()
and its argument length.out = 100
, to create the coordinates of window boundaries very easily.
The default output format of ts_tajima()
is not super user-friendly. Process the result using a helper function process_tajima(tajima_wins)
that I provided for you (perhaps save it as tajima_df
), and visualize it using another of my helper functions plot_tajima(tajima_df)
.
Note: Making the process_tajima()
and plot_tajima()
function available in your R code is the purpose of the source(here::here("files/popgen/popgen_utils.R"))
command at the beginning of your script for this exercise.
Part 3: Adding positive selection to the base demographic model
Although primarily designed for neutral demographic models, slendr allows optional simulation of natural selection by providing a “SLiM extension code snippet” with customization SLiM code as an optional argument extension =
of compile_model()
(a function you’re closely familiar with at this point).
Unfortunately we don’t have any space to explain SLiM here (and I have no idea, at the time of writing, whether or not you will have worked with SLiM earlier in this workshop). Suffice to say that SLiM is another very popular population genetic simulator software which allows simulation of selection, and which requires you to write custom code in a different programming language called Eidos.
Take a look at the file slim_extension.txt
provided in your working directory (it’s also part of the GitHub repository here). If you worked with SLiM before, glance through the script casually and see if it makes any sense to you. If you have not worked with SLiM before, look for the strange {elements}
in curly brackets in the first ten lines of the script. Those are the parameters of the selection model we will be customizing the standard neutral demographic model we started with in the next step.
Specifically, when you inspect the slim_extension.txt
file, you can see that this “SLiM extension script” I provided for you has three parameters:
origin_pop
– in which population should a beneficial allele appear,s
– what should be the selection coefficient of the beneficial allele, andonset_time
– at which time should the allele appear in theorigin_pop
.
However, at the start, the SLiM extension snippet doesn’t contain any concrete values of those parameters, but only their {origin_pop}
, {s}
, and {onset_time}
placeholders.
Use the slendr function substitute_values()
to substitute concrete values for those parameters like this:
<- substitute_values(
extension template = here::here("files/popgen/slim_extension.txt"),
origin_pop = "EUR",
s = 0.15,
onset_time = 12000
) extension
[1] "/var/folders/h2/qs0z_44x2vn2sskqc0cct7540000gn/T//RtmpGQb8NR/file101692e2dd50e"
You can see that substitute_values()
returned a path to a file. Take a look at that file in your terminal – you should see each of the three {placeholder}
parameters replaced with a concrete given value.
And that’s all the extra work we need to turn our purely neutral demographic slendr model into a model which includes natural selection! (In this case, only a simple selection acting on a single locus, as you’ll see later, but this can be generalized to any imaginable selection scenario.)
How do we use the SLiM extension for our simulation? It’s very simple – we just have to provide the extension
variable as an additional argument of good old compile_model()
. This will compile a new slendr model which will now include the new functionality for simulating natural selection:
Compile a new model
of the history of populations afr
, ooa
, ehg
, etc., by following the instructions above, providing a new extension =
argument to the compile_model()
function.
Part 4: Running a selection simulation using slim()
Now we can finally run our selection simulation!
There are two modifications to our previous simulation workflows:
- Because we need to run a non-neutral simulation, we have to switch from using the
msprime()
slendr engine toslim()
. The latter can still interpret the same demographic model we programmed in R, just like themsprime()
engine can, but will run the model using SLiM (and thus leveraging the new SLiM extension code that we have customized usingsubstitute_values()
above). We simply do this by switching from this:
<- msprime(model, sequence_length = 10e6, recombination_rate = 1e-8, samples = schedule) ts
to this:
<- slim(model, sequence_length = 10e6, recombination_rate = 1e-8, samples = schedule) ts
As you can see, you don’t have to modify anything in your model code, just switching from msprime
to slim
in the line of code which produces the simulation result.
- The customized model will not only produce a tree sequence, but will also generate a table of allele frequencies in each population (SLiM experts might have noticed the revelant SLiM code when they were inspecting
slim_extension.txt
). We need to be able to load both of these files after the simulation and thus need a path to a location we can find those files. We can do this by calling theslim()
function aspath <- slim(..., path = TRUE)
(so with the extrapath =
argument). This will return a path to where theslim()
engine saved all files with our desired results.
Run a simulation from the modified model of selection with the slim()
engine as instructed in points number 1. and 2. above, then use the list.files(path)
function in R to take a look in the directory. Which files were produced by the simulation?
Part 5: Investigating allele frequency trajectories
Use another helper function read_trajectory(path)
which I provided for this exercise to read the simulated frequency trajectories of the positively selected mutation in all of our populations into a variable traj_df
. Then run a second helper function plot_trajectory(traj_df)
to inspect the trajectories visually.
Recall that you used the function substitute_values()
to parametrize your selection model so that the allele under selection occurs in Europeans 15 thousand years ago, and is programmed to be under very strong selection of \(s = 0.15\). Do the trajectories visualized by plot_trajectory()
make sense given the demographic model of European prehistory plotted above?
Part 6: Tajima’s D (genome-wide and window-based) from the selection model
Recall that your simulation run saved results in the location stored in the path
variable:
list.files(path)
[1] "slim.trees" "trajectory.tsv"
From this path
, we’ve already successfuly investigated the frequency trajectories.
Now let’s compute Tajima’s D on the tree sequence simulated from our selection model. Hopefully we should see an interesting pattern in our selection scan? For instance, we don’t know yet where in the genome is the putative locus under selection!
To read a tree sequence simulated with slim()
by our customized selection setup, we need to do a bit of work. To simplify things a bit, here’s the R code which makes it possible to do. Just copy it in your selection.R
script as it is:
# Let's use my own saved simulation results, so that we're all on the
# same page going forward
<- here::here("files/popgen/selection")
path
<-
ts file.path(path, "slim.trees") %>% # 1. compose full path to the slim.trees file
ts_read(model) %>% # 2. read the tree sequence file into R
ts_recapitate(Ne = 5000, recombination_rate = 1e-8) # 3. perform recapitation
Very briefly, because our tree sequence was generated by SLiM, it’s very likely that not all genealogies along the simulated genome will be fully coalesced (i.e., not all tree will have a single root). To explain why this is the case is out of the scope of this session, but read here if you’re interested in learning more. For the time being, it suffices to say that we can pass the (uncoalesced) tree sequence into the ts_recapitate()
function, which then takes a SLiM tree sequence and simulates all necessary “ancestral history” that was missing on the uncoalesced trees, thus ensuring that the entire tree sequence is fully coalesced and can be correctly computed on.
Now that you have a ts
tree sequence object resulting from a new selection simulation run, repeat the analyses of genome-wide and window-based Tajima’s D from Part 1 and Part 2 of this exercise, again using the provided helper functions process_tajima()
and plot_tajima()
. Can you identify which locus has been the likely focal point of the positive selection? Which population shows evidence of selection? Which doesn’t and why (look again at the visualization of the demographic model above)?
Bonus 1: Examine the pattern of ancestry tracts along the simulated genome
Bonus 2: Investigate the impact of recombination around the selected locus
Vary the uniform recombination rate and observe what happens with Tajima’s D in windows along the genome.
Bonus 3: Simulate origin of the allele in EHG
Simulate the origin of the beneficial allele in the EHG population – what do the trajectories look like now? How does that change the Tajima’s D distribution along the genome in our European populations?
Bonus 4: Other statistics in windows
As a practice of your newly acquired tree-sequence computation skills with slendr, calculate some other statistics in the same windows along the simulated genome, visualize them yourself, and compare the results to the window-based Tajima’s D pattern. For instance, ts_diversity()
, ts_divergence()
, or ts_segregating()
might be quite interesting to look at.