⚠️⚠️⚠️
The demografr R package is still under active development!
⚠️⚠️⚠️
Although Approximate Bayesian Computation is arguably the most popular method for simulation-based inference, it’s certainly not the only one. Another useful technique for exploring the behavior of statistics under given combination of model parameters, and perhaps even very simple model fitting are grid simulations. Briefly, we define a set of parameters of a model, run simulation replicates for each combination of parameters, computing summary statistics of interest for each replicate, and then—for instance—inspect which parameter combinations lead to the values of summary statistics closes to those we observed in the data, or simply inspect how do different parameter values affect the statistics in general.
To demonstrate the features of demografr for grid-based simulation inference, let’s try to reproduce the result from Petr et al. (PNAS 2018). Briefly, this paper made an attempt to explain discrepancies between two different ways to estimate the proportion of Neanderthal ancestry in Europe over tens of thousands of years.
The question was, why are these two measures of Neanderthal ancestry leading to different conclusions? Could some violations of their assumptions (such as an unaccounted-for gene-flow events between modern human populations) break one of them?
The way the PNAS 2018 study approached the problem was to run many simulation replicates across a large parameter grid. This required writing custom coalescent simulations, orchestrating running them across the parameter grid, collecting the genotypes the simulations produced, and computing population genetic statistics (here direct and indirect -ratios)—all in all, a couple of hundred lines of Python, R, and bash.
This vignette shows how to use demografr to perform these kind of workflows in a much more straightforward and reproducible manner, entirely in R.
As in any standard ABC pipeline using demografr, we leverage the slendr R package to first define a demografic model, encoding it as a simple R function. Again, as with standard ABC, the arguments of this function correspond to model parameters. Note that those will come from the parameter grid, not via sampling from priors as with ABC.
In our example, because the research question of interest involved various gene-flow rates between populations in the model, we fix split times and values because they are not that interesting or important for this simple inference workflow (we ignore the Eurasia to Africa “backflow” for the same reason):
model <- function(rate_ea, rate_aa) {
# create populations
chimp <- population("chimp", time = 7e6, N = 10000)
anc <- population("anc", time = 6e6, N = 10000, parent = chimp)
nea <- population("nea", time = 600e3, N = 2000, parent = anc)
afr1 <- population("afr1", time = 600e3, N = 15000, parent = anc)
afr2 <- population("afr2", time = 150e3, N = 15000, parent = afr1)
eur <- population("eur", time = 75e3, N = 5000, parent = afr2)
# define gene-flow events
gf <- list(
# back flow from Eurasia into Africa -- one parameter of interest
gene_flow(from = eur, to = afr1, start = 5e3, end = 0, rate = rate_ea),
gene_flow(from = eur, to = afr2, start = 5e3, end = 0, rate = rate_ea),
# gene flow within Africa -- another parameter of interest
gene_flow(from = afr1, to = afr2, start = 50e3, end = 0, rate = rate_aa),
gene_flow(from = afr2, to = afr1, start = 50e3, end = 0, rate = rate_aa),
# Neanderthal introgression (fixed to 3%)
gene_flow(from = nea, to = eur, start = 60e3, end = 50e3, rate = 0.03)
)
# generate a slendr model
model <- compile_model(
populations = list(chimp, anc, nea, afr1, afr2, eur), gene_flow = gf,
generation_time = 30, serialize = FALSE
)
# specify sampling events
samples <- rbind(
# Altai (70 kya) and Vindija (40 kya) Neanderthals
schedule_sampling(model, times = c(70e3, 40e3), list(nea, 1)),
# Europeans from 40 kya to the present
schedule_sampling(model, times = seq(40000, 0, -2000), list(eur, 1)),
# two African populations
schedule_sampling(model, times = 0, list(afr1, 1), list(afr2, 1)),
# Chimpanzee outgroup
schedule_sampling(model, times = 0, list(chimp, 1))
)
return(list(model, samples))
}
Let’s check the model function by visualizing it (we set the gene-flow rates to arbitrary values, because this is purely a visualization test):
test_model <- model(rate_ea = 0.42, rate_aa = 0.42)[[1]]
plot_grid(
plot_model(test_model, gene_flow = FALSE),
plot_model(test_model, log = TRUE, gene_flow = TRUE)
)
The model looks OK, so in the next step we define summary statistics statistics we are interested in (again, just as we would for an ABC pipeline).
Recall that we want to explore the impact of modern-human gene flow on Neanderthal ancestry as inferred using two ways of computing it: “indirect” and “direct” -ratio statistics.
Because our model samples 21 European individuals across the last 40 thousand years, we will compute the -ratio statistics on all of these individuals in sequence.
demografr uses the R package slendr not just for
encoding demographic models and popgen simulations, but also for
tree-sequences computation via its interface to the powerful library tskit.
One advantage of slendr is that it allows referring to
simulated samples via their “symbolic names”. This means that, for
instance, the first individual sampled from a population “eur” is named
“eur_1”, the second is named as “eur_2”, etc. Therefore, we can capture
the names of every European individual in a vector X
by
running:
X <- paste0("eur_", 1:21)
X
#> [1] "eur_1" "eur_2" "eur_3" "eur_4" "eur_5" "eur_6" "eur_7" "eur_8"
#> [9] "eur_9" "eur_10" "eur_11" "eur_12" "eur_13" "eur_14" "eur_15" "eur_16"
#> [17] "eur_17" "eur_18" "eur_19" "eur_20" "eur_21"
and use this vector in any tree-sequence function of slendr/tskit. Specifically, we can define the “indirect” and “direct” -ratio statistics in this way (using the equations defined in the figure above):
# compute the "indirect" f4-ratio statistic via slendr/tskit function ts_f4ratio,
# returning its data frame output as it is
indirect <- function(ts) {
X <- paste0("eur_", 1:21)
ts_f4ratio(ts, X = X, A = "afr1_1", B = "afr2_1", C = "nea_2", O = "chimp_1")
}
# compute the "direct" f4-ratio statistic via slendr/tskit function ts_f4ratio,
# returning its data frame output as it is
direct <- function(ts) {
X <- paste0("eur_", 1:21)
ts_f4ratio(ts, X = X, A = "nea_2", B = "nea_1", C = "afr1_1", O = "chimp_1")
}
Exactly as it’s the case for ABC, for the purposes of grid-based simulations we have to define functions which will be used to compute summary statistics whose behavior across a parameter grid we want to investigate:
# bind both functions in a named list
functions <- list(direct = direct, indirect = indirect)
This looks quite neat and tidy, but how can we make sure that our
tree-sequence summary functions really work? We need a simulated
tree-sequence object first! demografr provides an easy way to
get an arbitrary tree sequence from a model function via its helper
function simulate_model
. The simples way to use it is like
this:
# we don't care about the parameter values, as we only want to get a tree sequence
# (also, note we did not specify sequence length, recombination rate, etc.)
ts <- simulate_model(
model, parameters = list(rate_ea = 0.1, rate_aa = 0.15),
sequence_length = 1e6, recombination_rate = 1e-8, mutation_rate = 1e-8
)
With the tiny tree sequence for testing, we can now check if our tree-sequence functions work as they’re supposed to:
functions$direct(ts)
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.106
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.00455
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.00455
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.0121
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0766
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.0842
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0197
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.191
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0485
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0106
#> # ℹ 11 more rows
functions$indirect(ts)
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 afr1_1 afr2_1 nea_2 chimp_1 0.704
#> 2 eur_2 afr1_1 afr2_1 nea_2 chimp_1 1.02
#> 3 eur_3 afr1_1 afr2_1 nea_2 chimp_1 0.959
#> 4 eur_4 afr1_1 afr2_1 nea_2 chimp_1 1.06
#> 5 eur_5 afr1_1 afr2_1 nea_2 chimp_1 0.878
#> 6 eur_6 afr1_1 afr2_1 nea_2 chimp_1 0.819
#> 7 eur_7 afr1_1 afr2_1 nea_2 chimp_1 0.920
#> 8 eur_8 afr1_1 afr2_1 nea_2 chimp_1 0.788
#> 9 eur_9 afr1_1 afr2_1 nea_2 chimp_1 0.922
#> 10 eur_10 afr1_1 afr2_1 nea_2 chimp_1 0.884
#> # ℹ 11 more rows
Excellent, we get the output tables from ts_f4ratio
functions as we’re supposed to (and get one estimate for each of the
21):
functions$direct(ts) %>% nrow
#> [1] 21
functions$indirect(ts) %>% nrow
#> [1] 21
Finally (and yet again, analogously to an equivalent ABC pipeline) we
define parameters to simulate over. Specifically, for a grid simulation,
we define a data frame with one column for each parameter of the
slendr model function (the names of parameter columns must
match the model function arguments!). For this purpose, the function
crossing
from the tidyr R package is particularly
helpful, so let’s use it here:
library(tidyr)
grid <- crossing(
rate_aa = seq(0, 0.25, 0.025),
rate_ea = seq(0, 0.25, 0.025)
)
head(grid)
#> # A tibble: 6 × 2
#> rate_aa rate_ea
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 0.025
#> 3 0 0.05
#> 4 0 0.075
#> 5 0 0.1
#> 6 0 0.125
In this way, we defined 121 parameter combinations and can finally proceed to the simulation inference.
In a final analogy to the ABC inference with demografr, we
have a function simulate_grid()
which works a little
similarly to simulate_abc()
introduced in
the first vignette. In fact, the function has a nearly identical
interface—we plug in the model function, our parameter grid, the list of
summary tree-sequence functions, the number of simulation replicates to
perform for each parameter combination (and a few other obvious
parameters shown here).
First, we will set up an automated parallelization scheme using the future R package:
library(future)
# parallelize simulations across all CPUs of the machine
plan(multisession, workers = availableCores())
(See much more detail on parallelization of demografr inference pipelines in this vignette).
results <- simulate_grid(
model, grid, functions, replicates = 50,
sequence_length = 25e6, mutation_rate = 1e-8, recombination_rate = 1e-8
)
The total runtime for the grid simulations was 1 hours 11 minutes 40 seconds parallelized across 96 CPUs.
The function simulate_grid()
returns all results in a
data frame object in the following format:
print(results, n = 10)
#> # A tibble: 6,050 × 5
#> rep rate_aa rate_ea direct indirect
#> <int> <dbl> <dbl> <list> <list>
#> 1 1 0 0 <tibble [21 × 6]> <tibble [21 × 6]>
#> 2 1 0 0.025 <tibble [21 × 6]> <tibble [21 × 6]>
#> 3 1 0 0.05 <tibble [21 × 6]> <tibble [21 × 6]>
#> 4 1 0 0.075 <tibble [21 × 6]> <tibble [21 × 6]>
#> 5 1 0 0.1 <tibble [21 × 6]> <tibble [21 × 6]>
#> 6 1 0 0.125 <tibble [21 × 6]> <tibble [21 × 6]>
#> 7 1 0 0.15 <tibble [21 × 6]> <tibble [21 × 6]>
#> 8 1 0 0.175 <tibble [21 × 6]> <tibble [21 × 6]>
#> 9 1 0 0.2 <tibble [21 × 6]> <tibble [21 × 6]>
#> 10 1 0 0.225 <tibble [21 × 6]> <tibble [21 × 6]>
#> # ℹ 6,040 more rows
The first column indicates a replicate number, the following few
columns are the values of the parameter grid. Most important are the
last columns, though, each named after a single summary statistics
(i.e., their symbolic names are taken from the named-list of summary
functions above). For conciseness, the values of these summary
statistics are stored as so-called “list
columns”— for each replicate and each parameter combination, these
columns store the summary statistics as “nested” data frames. For
instance, here are the results for the “direct”
statistic across the first three simulation runs. You can see that the
list-column "direct"
contains a list of data frames, one
data frame for each simulation replicate:
results$direct[1:3]
#> [[1]]
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.0251
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.0455
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.0214
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.00815
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0527
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.00988
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0542
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0184
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0320
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0439
#> # ℹ 11 more rows
#>
#> [[2]]
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.00374
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.00157
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.00775
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.00576
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0341
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.0333
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.00683
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0167
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0114
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.00341
#> # ℹ 11 more rows
#>
#> [[3]]
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.0864
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.0504
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.0486
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.0525
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0343
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.0486
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0537
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0495
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0527
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0416
#> # ℹ 11 more rows
List columns are very efficient and concise for storing complex data
like this, but in order to be able to analyze data stored in them, we
need a function unnest()
from tidyverse, which “unpacks” a list-column with stored data
frames into a wider data frame which stores the actual values of a given
statistic as normal numerical column. In this example:
Direct f4-ratio:
#> # A tibble: 127,050 × 10
#> rep rate_aa rate_ea X A B C O alpha indirect
#> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <list>
#> 1 1 0 0 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.0251 <tibble>
#> 2 1 0 0 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.0455 <tibble>
#> 3 1 0 0 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.0214 <tibble>
#> 4 1 0 0 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.00815 <tibble>
#> 5 1 0 0 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0527 <tibble>
#> 6 1 0 0 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.00988 <tibble>
#> 7 1 0 0 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0542 <tibble>
#> 8 1 0 0 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0184 <tibble>
#> 9 1 0 0 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0320 <tibble>
#> 10 1 0 0 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0439 <tibble>
#> # ℹ 127,040 more rows
Indirect f4-ratio:
#> # A tibble: 127,050 × 10
#> rep rate_aa rate_ea direct X A B C O alpha
#> <int> <dbl> <dbl> <list> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 0 0 <tibble [21 × 6]> eur_1 afr1_1 afr2… nea_2 chim… 1.00
#> 2 1 0 0 <tibble [21 × 6]> eur_2 afr1_1 afr2… nea_2 chim… 0.949
#> 3 1 0 0 <tibble [21 × 6]> eur_3 afr1_1 afr2… nea_2 chim… 1.03
#> 4 1 0 0 <tibble [21 × 6]> eur_4 afr1_1 afr2… nea_2 chim… 1.00
#> 5 1 0 0 <tibble [21 × 6]> eur_5 afr1_1 afr2… nea_2 chim… 0.955
#> 6 1 0 0 <tibble [21 × 6]> eur_6 afr1_1 afr2… nea_2 chim… 0.994
#> 7 1 0 0 <tibble [21 × 6]> eur_7 afr1_1 afr2… nea_2 chim… 0.978
#> 8 1 0 0 <tibble [21 × 6]> eur_8 afr1_1 afr2… nea_2 chim… 0.967
#> 9 1 0 0 <tibble [21 × 6]> eur_9 afr1_1 afr2… nea_2 chim… 0.961
#> 10 1 0 0 <tibble [21 × 6]> eur_10 afr1_1 afr2… nea_2 chim… 0.916
#> # ℹ 127,040 more rows
Because we are interested in the behavior of trajectories of indirect and direct estimates of Neanderthal ancestry over time assuming different values of some demographic parameters, we will first create a small data frame indicating the times of sampling of each simulated European individual. We will use this small data frame for join operations for the purpose of plotting below:
Let’s look at the values of indirect and direct -ratio statistics assuming different values of gene flow between the two African subpopulations:
results %>%
unnest(indirect) %>%
filter(rate_ea == 0) %>% # ignore models with Eurasian -> African gene flow
inner_join(samples, by = "X") %>% # join the samples table to add time information
ggplot(aes(time, 1 - alpha, color = rate_aa, group = rate_aa)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dotted") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of within-Africa gene flow (afr1 <-> afr2) on 'indirect f4-ratio'",
"Red line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
results %>%
unnest(direct) %>%
filter(rate_ea == 0) %>%
inner_join(samples, by = "X") %>%
ggplot(aes(time, alpha, color = rate_aa, group = rate_aa)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dotted") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of within-Africa gene flow (afr1 <-> afr2) on 'direct f4-ratio'",
"Dashed line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
We can see that increasing rates of gene flow within Africa shifts apparent Neanderthal ancestry proportions inferred using “indirect” -ratio increasingly upwards (compare the estimated trajectories to the true proportion shown as dashed line). However, the “direct” -ratio remains stable, and relatively closer to the true admixture proportion indicated by the dashed line.
Now let’s investigate the behavior of both statistic in models in which there’s some proportion of backflow from Eurasia back into Africa:
results %>%
unnest(indirect) %>%
filter(rate_aa == 0.2) %>%
inner_join(samples, by = "X") %>%
ggplot(aes(time, 1 - alpha, color = rate_ea, group = rate_ea)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dotted") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of Eurasia -> Africa gene flow on 'indirect f4-ratio'",
"Dashed line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
results %>%
unnest(direct) %>%
filter(rate_aa == 0.2) %>%
inner_join(samples, by = "X") %>%
ggplot(aes(time, alpha, color = rate_ea, group = rate_ea)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dotted") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of Eurasia -> Africa gene flow on 'direct f4-ratio'",
"Dashed line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
We can see that gene flow from Eurasia back into Africa creates an artificial decline of Neanderthal ancestry if measured via “indirect” -ratio statistic! However, the “direct” -ratio remains stable and, again, relatively close to the true admixture proportion.
As we can see, the very simple simulation-based investigation described in this vignette (in total just a couple of dozen lines of easy-to-read R code) can qualitatively replicate the much original result from the ppaerpaper (which required a couple of hundreds of lines of Python / shell / R scripts):