⚠️⚠️⚠️
Note: The demografr R package is still under active development. As a result, its documentation is in a draft stage at best. Typos, inconsistencies, and other issues are unfortunately expected.
⚠️⚠️⚠️
Although Approximate Bayesian Computation is arguably the most popular method for simulation-based inference, it’s certainly not the only one. Another useful method for exploring the behavior of statistics under given combination of model parameters, and perhaps even very simple model parameter fitting are grid simulations. Briefly, we define a set of parameters of a model, run simulation replicates for each combination of parameters, computing summary statistics of interest for each replicat, and then—for instance—inspect which parameter combinations lead to the values of summary statistics closes to those we observed in the data, or simply inspect how do different parameter values affect the statistics in general.
To demonstrate the features of demografr for grid-based simulation inference, let’s try to reproduce (at least qualitatively) the result from Petr et al. (PNAS 2018). Briefly, this paper made an attempt to explain discrepancies between two different ways to estimate the proportion of Neanderthal ancestry in Europe over tens of thousands of years.
The question was, why are these two measures of Neanderthal ancestry leading to different conclusions? Could some violations of their assumptions (such as unnacounted-for gene-flow events between modern human populations) break one of them?
The way the PNAS 2018 study approached the problem was tu run many simulation replicates across a large parameter grid. This required writing custom coalescent simulations, orchestrating running them across the parameter grid, collecting outputs, and computing population genetic statistics (here direct and indirect \(f_4\)-ratios)—all in all, a couple of hundred lines of Python, R, and bash.
This vignette shows how to use demografr to perform this kind of analyses in a much more straightforward and reproducible manner, entirely in R.
As in any standard ABC pipeline using demografr, we leverage the slendr R package to first define a demografic model, encoding it as a simple R function. Again, as with standard ABC, the arguments of this function correspond to model parameters. Note that those will come from the parameter grid, not via sampling from priors as with ABC!
In our example, because the research question of interest involved various gene-flow rates between populations in the model, we fix split times and \(N_e\) values because they are not that interesting or important for this simple analysis (we ignore the Eurasia to Africa “backflow” for the same reason):
model <- function(rate_ea, rate_aa) {
# create populations
chimp <- population("chimp", time = 7e6, N = 10000)
anc <- population("anc", time = 6e6, N = 10000, parent = chimp)
nea <- population("nea", time = 600e3, N = 2000, parent = anc)
afr1 <- population("afr1", time = 600e3, N = 15000, parent = anc)
afr2 <- population("afr2", time = 150e3, N = 15000, parent = afr1)
eur <- population("eur", time = 75e3, N = 5000, parent = afr2)
# define gene-flow events
gf <- list(
# back flow from Eurasia into Africa -- one parameter of interest
gene_flow(from = eur, to = afr1, start = 5e3, end = 0, rate = rate_ea),
gene_flow(from = eur, to = afr2, start = 5e3, end = 0, rate = rate_ea),
# gene flow within Africa -- another parameter of interest
gene_flow(from = afr1, to = afr2, start = 50e3, end = 0, rate = rate_aa),
gene_flow(from = afr2, to = afr1, start = 50e3, end = 0, rate = rate_aa),
# Neanderthal introgression (fixed to 3%)
gene_flow(from = nea, to = eur, start = 60e3, end = 50e3, rate = 0.03)
)
# generate a slendr model
model <- compile_model(
populations = list(chimp, anc, nea, afr1, afr2, eur), gene_flow = gf,
generation_time = 30, serialize = FALSE
)
# specify sampling events
samples <- rbind(
# Altai (70 kya) and Vindija (40 kya) Neanderthals
schedule_sampling(model, times = c(70e3, 40e3), list(nea, 1)),
# Europeans from 40 kya to the present
schedule_sampling(model, times = seq(40000, 0, -2000), list(eur, 1)),
# two African populations
schedule_sampling(model, times = 0, list(afr1, 1), list(afr2, 1)),
# Chimpanzee outgroup
schedule_sampling(model, times = 0, list(chimp, 1))
)
return(list(model, samples))
}
Let’s check the model function by visualizing it (we set the gene-flow rates to arbitrary values, because this is purely a visualization test):
test_model <- model(rate_ea = 0.42, rate_aa = 0.42)[[1]]
# we don't plot the gene flow arrows here for clarity
cowplot::plot_grid(
plot_model(test_model, gene_flow = FALSE),
plot_model(test_model, log = TRUE, gene_flow = FALSE)
)
The model looks OK, so in the next step we define tree-sequence popgen summary statistics we are interested in (again, just as we would for an ABC pipeline).
Recall that we want to explore the impact of modern-human gene flow on Neanderthal ancestry as inferred using two ways of computing it: “indirect” and “direct” \(f_4\)-ratio statistics.
Because our model samples 21 European individuals across the last 40 thousand years, we will compute the \(f_4\)-ratio statistics on all of these individuals in sequence.
demografr uses the R package slendr not just for
encoding demographic models and popgen simulations, but also for
analysis of tree sequences via its interface to the powerful library tskit.
One advantage of slendr is that it allows referring to
simulated samples via their “symbolic names”. This means that, for
instance, the first individual sampled from a population “eur” is named
“eur_1”, the second is named as “eur_2”, etc. Therefore, we can capture
the names of every European individual in a vector X
by
running:
X <- paste0("eur_", 1:21)
X
#> [1] "eur_1" "eur_2" "eur_3" "eur_4" "eur_5" "eur_6" "eur_7" "eur_8"
#> [9] "eur_9" "eur_10" "eur_11" "eur_12" "eur_13" "eur_14" "eur_15" "eur_16"
#> [17] "eur_17" "eur_18" "eur_19" "eur_20" "eur_21"
and use this vector as input for any tree-sequence function of slendr/tskit. Specifically, we can define the “indirect” and “direct” \(f_4\)-ratio statistics in this way (using the equations defined in the figure above):
# compute the "indirect" f4-ratio statistic via slendr/tskit function ts_f4ratio,
# returning its data frame output as it is
indirect <- function(ts) {
X <- paste0("eur_", 1:21)
ts_f4ratio(ts, X = X, A = "afr1_1", B = "afr2_1", C = "nea_2", O = "chimp_1")
}
# compute the "direct" f4-ratio statistic via slendr/tskit function ts_f4ratio,
# returning its data frame output as it is
direct <- function(ts) {
X <- paste0("eur_", 1:21)
ts_f4ratio(ts, X = X, A = "nea_2", B = "nea_1", C = "afr1_1", O = "chimp_1")
}
Exactly as it’s the case for ABC, for the purposes of grid-based simulations we have to
# bind both functions in a named list
functions <- list(direct = direct, indirect = indirect)
This looks quite neat and tidy, but how can we make sure that our
tree-sequence summary functions really work? We need a tree-sequence
object as an input first! demografr provides an easy way to get
an arbitrary tree sequence from a model function via its helper function
simulate_ts
. The simples way to use it is like this:
# we don't care about the parameter values, as we only want to get *some* tree sequence
# (also, note we did not specify sequence length, recombination rate, etc.)
ts <- simulate_ts(model, parameters = list(rate_ea = 0.1, rate_aa = 0.15), mutation_rate = 1e-8)
With the tiny tree sequence for testing, we can now check if our tree-sequence functions work as they’re supposed to:
functions$direct(ts)
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.472
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.475
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.475
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.505
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.534
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.534
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.475
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.473
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 -0.00329
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.475
#> # ℹ 11 more rows
functions$indirect(ts)
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 afr1_1 afr2_1 nea_2 chimp_1 0.759
#> 2 eur_2 afr1_1 afr2_1 nea_2 chimp_1 0.771
#> 3 eur_3 afr1_1 afr2_1 nea_2 chimp_1 0.771
#> 4 eur_4 afr1_1 afr2_1 nea_2 chimp_1 0.392
#> 5 eur_5 afr1_1 afr2_1 nea_2 chimp_1 0.0125
#> 6 eur_6 afr1_1 afr2_1 nea_2 chimp_1 0.0125
#> 7 eur_7 afr1_1 afr2_1 nea_2 chimp_1 0.771
#> 8 eur_8 afr1_1 afr2_1 nea_2 chimp_1 0.765
#> 9 eur_9 afr1_1 afr2_1 nea_2 chimp_1 1.04
#> 10 eur_10 afr1_1 afr2_1 nea_2 chimp_1 0.771
#> # ℹ 11 more rows
Excellent, we get the output tables from ts_f4ratio
functions as we’re supposed to (and get one estimate for each of the
21):
functions$direct(ts) %>% nrow
#> [1] 21
functions$indirect(ts) %>% nrow
#> [1] 21
Finally (and yet again, analogously to an equivalent ABC pipeline) we
define parameters. Specifically, for a grid simulation, we define a data
frame with one column for each parameter of the slendr model
function (the names of parameter columns must match the model function
arguments!). For this purpose, the function crossing
from
the tidyr R package is particularly helpful, so let’s use it
here:
#> # A tibble: 6 × 2
#> rate_aa rate_ea
#> <dbl> <dbl>
#> 1 0 0
#> 2 0 0.025
#> 3 0 0.05
#> 4 0 0.075
#> 5 0 0.1
#> 6 0 0.125
In this way, we defined 121 parameter combinations and can finally proceed to the simulation inference.
In a final analogy to the ABC inference with demografr, we
have a function simulate_grid()
which performs the
equivalent to simulate_abc()
. In fact, the function has a
nearly identical interface—we plug in the model function, our parameter
grid, the list of summary tree-sequence functions, the number of
simulation replicates to perform for each parameter combination (and a
few other obvious parameters shown here).
First, we will set up an automated parallelization scheme using the future R package:
library(future)
# parallelize simulations across all CPUs of the machine
plan(multisession, workers = availableCores())
(See much more detail on parallelization of demografr inference pipelines in this vignette).
results <- simulate_grid(
model, grid, functions, replicates = 50,
sequence_length = 25e6, mutation_rate = 1e-8, recombination_rate = 1e-8
)
The total runtime for the grid simulations was 1 hours 3 minutes 19 seconds parallelized across 96 CPUs.
The function simulate_grid()
returns all results in a
data frame object in the following format:
print(results, n = 10)
#> # A tibble: 6,050 × 5
#> rep rate_aa rate_ea direct indirect
#> <int> <dbl> <dbl> <list> <list>
#> 1 1 0 0 <tibble [21 × 6]> <tibble [21 × 6]>
#> 2 1 0 0.025 <tibble [21 × 6]> <tibble [21 × 6]>
#> 3 1 0 0.05 <tibble [21 × 6]> <tibble [21 × 6]>
#> 4 1 0 0.075 <tibble [21 × 6]> <tibble [21 × 6]>
#> 5 1 0 0.1 <tibble [21 × 6]> <tibble [21 × 6]>
#> 6 1 0 0.125 <tibble [21 × 6]> <tibble [21 × 6]>
#> 7 1 0 0.15 <tibble [21 × 6]> <tibble [21 × 6]>
#> 8 1 0 0.175 <tibble [21 × 6]> <tibble [21 × 6]>
#> 9 1 0 0.2 <tibble [21 × 6]> <tibble [21 × 6]>
#> 10 1 0 0.225 <tibble [21 × 6]> <tibble [21 × 6]>
#> # ℹ 6,040 more rows
The first column indicates a replicate number, the following few columns are the values of the parameter grid. Most important are the last columns, though, each named after a single summary statistics. For conciseness, these summary statistics are stored as so-called “list columns”—for each replicate and each parameter combination, these columns store the summary statistics as “nested” data frames like this:
results$direct[1:2]
#> [[1]]
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.0169
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.0237
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.00826
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.0136
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0183
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.0199
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0370
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0266
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0277
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0314
#> # ℹ 11 more rows
#>
#> [[2]]
#> # A tibble: 21 × 6
#> X A B C O alpha
#> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.0450
#> 2 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.0278
#> 3 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.0552
#> 4 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.0444
#> 5 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0330
#> 6 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.0345
#> 7 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0237
#> 8 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0342
#> 9 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0189
#> 10 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0535
#> # ℹ 11 more rows
List columns are very efficient and concise for storing complex data
like this, but in order to be able to analyze data stored in them, we
need a function unnest()
from tidyverse, which “explodes” a list-column with stored data
frames into a wider data frame which can be immediately analyzed. In
this example:
Direct f4-ratio:
results %>% unnest(direct)
#> # A tibble: 127,050 × 10
#> rep rate_aa rate_ea X A B C O alpha indirect
#> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <list>
#> 1 1 0 0 eur_1 nea_2 nea_1 afr1_1 chimp_1 0.0169 <tibble>
#> 2 1 0 0 eur_2 nea_2 nea_1 afr1_1 chimp_1 0.0237 <tibble>
#> 3 1 0 0 eur_3 nea_2 nea_1 afr1_1 chimp_1 0.00826 <tibble>
#> 4 1 0 0 eur_4 nea_2 nea_1 afr1_1 chimp_1 0.0136 <tibble>
#> 5 1 0 0 eur_5 nea_2 nea_1 afr1_1 chimp_1 0.0183 <tibble>
#> 6 1 0 0 eur_6 nea_2 nea_1 afr1_1 chimp_1 0.0199 <tibble>
#> 7 1 0 0 eur_7 nea_2 nea_1 afr1_1 chimp_1 0.0370 <tibble>
#> 8 1 0 0 eur_8 nea_2 nea_1 afr1_1 chimp_1 0.0266 <tibble>
#> 9 1 0 0 eur_9 nea_2 nea_1 afr1_1 chimp_1 0.0277 <tibble>
#> 10 1 0 0 eur_10 nea_2 nea_1 afr1_1 chimp_1 0.0314 <tibble>
#> # ℹ 127,040 more rows
Indirect f4-ratio:
results %>% unnest(indirect)
#> # A tibble: 127,050 × 10
#> rep rate_aa rate_ea direct X A B C O alpha
#> <int> <dbl> <dbl> <list> <chr> <chr> <chr> <chr> <chr> <dbl>
#> 1 1 0 0 <tibble [21 × 6]> eur_1 afr1_1 afr2… nea_2 chim… 0.956
#> 2 1 0 0 <tibble [21 × 6]> eur_2 afr1_1 afr2… nea_2 chim… 0.924
#> 3 1 0 0 <tibble [21 × 6]> eur_3 afr1_1 afr2… nea_2 chim… 0.945
#> 4 1 0 0 <tibble [21 × 6]> eur_4 afr1_1 afr2… nea_2 chim… 0.955
#> 5 1 0 0 <tibble [21 × 6]> eur_5 afr1_1 afr2… nea_2 chim… 0.927
#> 6 1 0 0 <tibble [21 × 6]> eur_6 afr1_1 afr2… nea_2 chim… 0.941
#> 7 1 0 0 <tibble [21 × 6]> eur_7 afr1_1 afr2… nea_2 chim… 0.927
#> 8 1 0 0 <tibble [21 × 6]> eur_8 afr1_1 afr2… nea_2 chim… 0.970
#> 9 1 0 0 <tibble [21 × 6]> eur_9 afr1_1 afr2… nea_2 chim… 0.959
#> 10 1 0 0 <tibble [21 × 6]> eur_10 afr1_1 afr2… nea_2 chim… 0.952
#> # ℹ 127,040 more rows
Note that unnesting each results column creates a plain old data frame which can be analyzed by whatever means you find convenient.
Because we were interested in the behavior of trajectories of indirect and direct estimates of Neanderthal ancestry over time, we will first create a small data frame indicating the times of sampling of each simulated European individual. We will use this small data frame for join operations for the purpose of plotting below:
results %>%
unnest(indirect) %>%
filter(rate_ea == 0) %>% # ignore models with Eurasian -> African gene flow
inner_join(samples, by = "X") %>% # join the samples table to add time information
ggplot(aes(time, 1 - alpha, color = rate_aa, group = rate_aa)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of within-Africa gene flow (afr1 <-> afr2) on 'indirect f4-ratio'",
"Red line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
results %>%
unnest(direct) %>%
filter(rate_ea == 0) %>%
inner_join(samples, by = "X") %>%
ggplot(aes(time, alpha, color = rate_aa, group = rate_aa)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of within-Africa gene flow (afr1 <-> afr2) on 'direct f4-ratio'",
"Red line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
We can see that increasing rates of gene flows within Africa shifts apparent Neanderthal ancestry proportions inferred using “indirect” \(f_4\)-ratio increasingly upwards (compare the estimated trajectories to the true proportion shown as dashed line). However, the “direct” \(f_4\)-ratio remains stable.
results %>%
unnest(indirect) %>%
filter(rate_aa == 0.2) %>%
inner_join(samples, by = "X") %>%
ggplot(aes(time, 1 - alpha, color = rate_ea, group = rate_ea)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of Eurasia -> Africa gene flow on 'indirect f4-ratio'",
"Red line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
results %>%
unnest(direct) %>%
filter(rate_aa == 0.2) %>%
inner_join(samples, by = "X") %>%
ggplot(aes(time, alpha, color = rate_ea, group = rate_ea)) +
geom_line(stat = "smooth", se = FALSE) +
geom_hline(yintercept = 0.03, color = "red", linetype = "dashed") +
geom_hline(yintercept = 0, color = "black", linetype = "dashed") +
labs(x = "years before present", y = "Neanderthal ancestry proportion") +
xlim(40000, 0) +
coord_cartesian(y = c(0, 0.1)) +
ggtitle("The effect of Eurasia -> Africa gene flow on 'direct f4-ratio'",
"Red line indicates the true Neanderthal ancestry proportion")
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Even more importantly, gene flow from Eurasia back into Africa creates an artificial decline of Neanderthal ancestry if measured via “indirect” \(f_4\)-ratio statistic! However, the “direct” \(f_4\)-ratio remains stable.