⚠️⚠️⚠️
The demografr R package is still under active development!
⚠️️⚠️⚠️
This vignette contains an expanded version of the basic ABC inference example from the homepage of demografr. It walks through the process of setting up a demografr ABC inference pipeline step by step.
Imagine that we sequenced genomes of individuals from populations “A”, “B”, “C”, and “D”.
Let’s also assume that we know that the populations are phylogenetically related in the following way but we don’t know anything else (i.e., we have no idea about the values of , split times, or gene-flow proportions):
After sequencing the genomes of individuals from these populations, we computed nucleotide diversity in these populations as well as their pairwise genetic divergence and a one statistic, observing the following values of these summary statistics (which we saved in standard R data frames—perhaps saved by the software we used for computing these from empirical sequence data):
observed_diversity <- read.table(system.file("examples/basics_diversity.tsv", package = "demografr"), header = TRUE)
observed_diversity
#> set diversity
#> 1 A 8.030512e-05
#> 2 B 3.288576e-05
#> 3 C 1.013804e-04
#> 4 D 8.910909e-05
observed_divergence <- read.table(system.file("examples/basics_divergence.tsv", package = "demografr"), header = TRUE)
observed_divergence
#> x y divergence
#> 1 A B 0.0002378613
#> 2 A C 0.0002375761
#> 3 A D 0.0002379385
#> 4 B C 0.0001088217
#> 5 B D 0.0001157056
#> 6 C D 0.0001100633
observed_f4 <- read.table(system.file("examples/basics_f4.tsv", package = "demografr"), header = TRUE)
observed_f4
#> W X Y Z f4
#> 1 A B C D -3.262146e-06
Now let’s develop a simple ABC pipeline which will infer the posterior distributions of two sets of parameters we are interested in: of each population lineage, as well as split times between our populations of interest, and the gene-flow proportion.
Let’s begin by loading demografr together with the R package slendr on which demografr relies on for building and simulating demographic models.
library(demografr)
library(slendr)
# we also have to activate slendr's internal environment for tree sequences
# simulation and analysis
init_env()
# setup parallelization across all CPUs
library(future)
plan(multisession, workers = availableCores())
For the purpose of our ABC inference, we will bind all statistics in an R list, naming them appropriately. The names of each statistic (here “diversity”, “divergence”, and “f4”) have a meaning and are quite important for later steps:
observed <- list(
diversity = observed_diversity,
divergence = observed_divergence,
f4 = observed_f4
)
The first step in a demografr ABC analysis is setting up a “scaffold” model—a slendr function which will produce a compiled slendr model object, and which will accept the model parameters in form of normal R function arguments. In our simple examples, we will define the following function:
model <- function(Ne_A, Ne_B, Ne_C, Ne_D, T_AB, T_BC, T_CD, gf_BC) {
A <- population("A", time = 1, N = Ne_A)
B <- population("B", time = T_AB, N = Ne_B, parent = A)
C <- population("C", time = T_BC, N = Ne_C, parent = B)
D <- population("D", time = T_CD, N = Ne_D, parent = C)
gf <- gene_flow(from = B, to = C, start = 9000, end = 9301, rate = gf_BC)
model <- compile_model(
populations = list(A, B, C, D), gene_flow = gf,
generation_time = 1, simulation_length = 10000,
direction = "forward", serialize = FALSE
)
samples <- schedule_sampling(
model, times = 10000,
list(A, 25), list(B, 25), list(C, 25), list(D, 25),
strict = TRUE
)
# when a specific sampling schedule is to be used, both model and samples
# must be returned by the function
return(list(model, samples))
}
We are interested in estimating the of all populations, their split times, and gene-flow proportion which means we need to specify priors. demografr makes this very easy using its own readable syntax:
priors <- list(
Ne_A ~ runif(1000, 3000),
Ne_B ~ runif(100, 1500),
Ne_C ~ runif(5000, 10000),
Ne_D ~ runif(2000, 7000),
T_AB ~ runif(1, 4000),
T_BC ~ runif(3000, 9000),
T_CD ~ runif(5000, 10000),
gf_BC ~ runif(0, 0.3)
)
In an ABC simulation step below, the formulas are used to draw the values of each parameter from specified distributions (in this case, all uniform distributions).
For more detail into how demografr’s prior sampling formulas work (and why), take a look at this vignette.
Each run of a demografr ABC simulation internally produces a tree sequence. Because tree sequence represents an efficient, succint representation of the complete genealogical history of a set of samples, it is possible to compute population genetic statistics directly on the tree sequence without having to first save the outcome of each simulation to disk for computation in different software. Thanks to slendr’s library of tree-sequence functions serving as an R interface to the tskit module, we can specify summary statistics to be computed for ABC using normal R code.
In our example, because we computed nucleotide diversity, pairwise divergence, and from the sequence data, we will define the following summary statistic functions. Crucially, when run on a tree-sequence object, they will produce a data frame in the format analogous to the empirical statistics shown in data frames with the observed summary statistics above. This is a very important point: each observed statistic must have the same format (and dimension) as the data frames produced by the simulation summary functions. This minor inconvenience during ABC setup saves us the headache of having to match values of statistics between observed and simulated data during ABC inference itself.
compute_diversity <- function(ts) {
samples <- ts_names(ts, split = "pop")
ts_diversity(ts, sample_sets = samples)
}
compute_divergence <- function(ts) {
samples <- ts_names(ts, split = "pop")
ts_divergence(ts, sample_sets = samples)
}
compute_f4 <- function(ts) {
samples <- ts_names(ts, split = "pop")
A <- samples["A"]; B <- samples["B"]
C <- samples["C"]; D <- samples["D"]
ts_f4(ts, A, B, C, D)
}
functions <- list(
diversity = compute_diversity,
divergence = compute_divergence,
f4 = compute_f4
)
Of course, having to run ABC simulations in order to check that the
summary functions have been correctly defined would be slow and
potentially costly in terms of wasted computation. To speed this process
up, demografr provides a helper function
simulate_model()
which allows us to simulate a single
tree-sequence object from the specified model. We can then use that tree
sequence to develop (and test) our summary functions before proceeding
further, like this:
ts <- simulate_model(model, priors, sequence_length = 1e6, recombination_rate = 1e-8)
ts
#> ╔═══════════════════════════╗
#> ║TreeSequence ║
#> ╠═══════════════╤═══════════╣
#> ║Trees │ 1782║
#> ╟───────────────┼───────────╢
#> ║Sequence Length│ 1000000║
#> ╟───────────────┼───────────╢
#> ║Time Units │generations║
#> ╟───────────────┼───────────╢
#> ║Sample Nodes │ 200║
#> ╟───────────────┼───────────╢
#> ║Total Size │ 383.1 KiB║
#> ╚═══════════════╧═══════════╝
#> ╔═══════════╤════╤═════════╤════════════╗
#> ║Table │Rows│Size │Has Metadata║
#> ╠═══════════╪════╪═════════╪════════════╣
#> ║Edges │7981│249.4 KiB│ No║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Individuals│ 100│ 2.8 KiB│ No║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Migrations │ 0│ 8 Bytes│ No║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Mutations │ 0│ 16 Bytes│ No║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Nodes │2360│ 64.5 KiB│ No║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Populations│ 4│331 Bytes│ Yes║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Provenances│ 1│ 2.6 KiB│ No║
#> ╟───────────┼────┼─────────┼────────────╢
#> ║Sites │ 0│ 16 Bytes│ No║
#> ╚═══════════╧════╧═════════╧════════════╝
With this ts
object, we can, for instance, test that our
compute_f4
summary function is defined correctly (meaning
that it returns
`
statistics table formatted in exactly the same way as the observed table
above):
functions$f4(ts)
#> # A tibble: 1 × 5
#> W X Y Z f4
#> <chr> <chr> <chr> <chr> <dbl>
#> 1 A B C D 0
Looks good! (We have zero
values because we didn’t specify mutation rate in
simulate_model
, as we were only interested in the
compatibility of the formats and dimensions of both simulated and
observed
tables). Of course, in your pipeline, you might (should!) check that all
of your summary functions are set up correctly.
Having defined the scaffold model, a set of priors for our parameters
of interest
(,
split times, gene-flow proportion), as well as two summary statistic
functions, we can integrate all this information into the function
simulate_abc()
.
Before we run a potentially computationally costly simulations, it is
a good idea to again validate the ABC components we have so far
assembled using the function validate_abc()
. This provides
much more elaborate correctness checks beyond the simple testing of the
summary functions as we did above.
validate_abc(model, priors, functions, observed)
#> ======================================================================
#> Testing sampling of each prior parameter:
#> - Ne_A ✅
#> - Ne_B ✅
#> - Ne_C ✅
#> - Ne_D ✅
#> - T_AB ✅
#> - T_BC ✅
#> - T_CD ✅
#> - gf_BC ✅
#> ---------------------------------------------------------------------
#> The model is a slendr function
#> ---------------------------------------------------------------------
#> Checking the return statement of the model function... ✅
#> ---------------------------------------------------------------------
#> Checking the presence of required model function arguments... ✅
#> ---------------------------------------------------------------------
#> Simulating tree sequence from the given model... ✅
#> ---------------------------------------------------------------------
#> Computing user-defined summary functions:
#> - diversity ✅
#> - divergence ✅
#> - f4 ✅
#> ---------------------------------------------------------------------
#> Checking the format of simulated summary statistics:
#> - diversity [data frame] ✅
#> - divergence [data frame] ✅
#> - f4 [data frame] ✅
#> ======================================================================
#> No issues have been found in the ABC setup!
Having verified that all model components are set up correctly, we
can proceed to the ABC simulations themselves, using
demografr’s function simulate_abc()
:
data <- simulate_abc(
model, priors, functions, observed, iterations = 10000,
sequence_length = 10e6, recombination_rate = 1e-8, mutation_rate = 1e-8
)
The total runtime for the ABC simulations was 0 hours 26 minutes 3 seconds parallelized across 96 CPUs.
At this point we have generated summary statistics for simulations of models using parameters drawn from our priors. In the next step, we can finally do inference of our parameters.
Having all the information about observed and simulated data bound in
a single R object data
, we can finally infer the posterior
distribution of our model parameters using ABC. demografr
includes a convenient function run_abc()
which reformats
the simulated and observed data in a format required by the R package abc and
internally calls its own function abc()
.
Note that run_abc
is just convenience wrapper around the
abc()
function in the package
abc, saving us a lot of work with the reformatting of data that's otherwise necessary. As such, all parameters of the function
abc()can be provided to
run_abc()`,
which will then pass them on appropriately.
Before we proceed with inferring values of the model parameters from their posterior distributions, we should check whether our model can actually produce summary statistics which match the statistics observed in the empirical data.
In order to do this, we can use demografr’s
predict()
method which accepts two mandatory function
arguments: the first is the abc
object generated by the
run_abc()
function (containing, among other things, the
information about the inferred posteriors), and the number of draws to
take from the posterior distribution of the parameters:
# because we set up a parallelization plan() above, the predictions are
# computed in parallel across all available CPUs
predictions <- predict(abc, samples = 1000)
If we take a closer look at the produced result, we see that it’s a
data frame object with several special columns (so-called “list
columns”). Those columns contain nested data frames with the values
of tree-sequence summary statistics computed for each combination of the
parameter values. Because this data has the same format as the output of
the function simulate_grid()
(in fact, internally it
is produced by this function), see the vignette on grid simulations for more detail how
to analyse this.
In order to examine the posterior predictive check results for
individual (or all) summary statistics, we can use the function
extract_prediction()
. A convenient alternative to check the
results visually is provided by the function
plot_prediction()
. For instance, we can take a look at the
distributions of divergences produced from the parameter posterior like
this:
plot_prediction(predictions, "diversity")
plot_prediction(predictions, "divergence")
plot_prediction(predictions, "f4")
Nice! It appears that the posterior from our simplistic model captures the observed summary statistics (dashed vertical lines) quite well.
(To make evaluation a little bit easier, there’s an option
file =
which instructs plot_prediction
to save
a figure to disk instead of plotting it using a graphical device—useful
for work on remote servers!).
For more details on additional downstream validation and troubleshooting options, please see this vignette.
Now that we have the ABC output object ready and have more or less convinced ourselves that our model can capture the summary statistics we’ve chosen via posterior predictive checks, we can proceed with parameter inference.
First, we can get a data frame with summary statistics of the posterior distributions of our parameters. For instance, we can easily read the maximum a posteriori probability (MAP) of the parameters in the row labelled “Mode:”:
extract_summary(abc)
#> Ne_A Ne_B Ne_C Ne_D T_AB
#> Min.: 1609.744 682.6193 7389.026 1846.029 -96.82908
#> Weighted 2.5 % Perc.: 1812.239 729.5846 7585.374 2691.913 1242.73702
#> Weighted Median: 2030.741 837.2673 8588.420 3785.385 1907.90411
#> Weighted Mean: 2019.638 841.6564 8674.207 3805.943 1949.07359
#> Weighted Mode: 2049.985 847.0022 8469.365 3642.722 1911.88919
#> Weighted 97.5 % Perc.: 2204.442 979.8260 9543.851 4742.401 2690.25909
#> Max.: 2281.020 1038.0898 10965.790 5915.021 2921.08324
#> T_BC T_CD gf_BC
#> Min.: 5272.336 6608.244 -0.02143034
#> Weighted 2.5 % Perc.: 5630.022 7105.185 0.03373237
#> Weighted Median: 6133.605 7829.842 0.09864958
#> Weighted Mean: 6110.468 7814.700 0.10241672
#> Weighted Mode: 6236.457 7855.537 0.09655264
#> Weighted 97.5 % Perc.: 6603.022 8379.513 0.17629759
#> Max.: 6838.022 8551.676 0.21388432
Because large tables can get a little hard to read, it is possible to subset to only a specific type of parameter:
extract_summary(abc, param = "Ne")
#> Ne_A Ne_B Ne_C Ne_D
#> Min.: 1609.744 682.6193 7389.026 1846.029
#> Weighted 2.5 % Perc.: 1812.239 729.5846 7585.374 2691.913
#> Weighted Median: 2030.741 837.2673 8588.420 3785.385
#> Weighted Mean: 2019.638 841.6564 8674.207 3805.943
#> Weighted Mode: 2049.985 847.0022 8469.365 3642.722
#> Weighted 97.5 % Perc.: 2204.442 979.8260 9543.851 4742.401
#> Max.: 2281.020 1038.0898 10965.790 5915.021
extract_summary(abc, param = "T")
#> T_AB T_BC T_CD
#> Min.: -96.82908 5272.336 6608.244
#> Weighted 2.5 % Perc.: 1242.73702 5630.022 7105.185
#> Weighted Median: 1907.90411 6133.605 7829.842
#> Weighted Mean: 1949.07359 6110.468 7814.700
#> Weighted Mode: 1911.88919 6236.457 7855.537
#> Weighted 97.5 % Perc.: 2690.25909 6603.022 8379.513
#> Max.: 2921.08324 6838.022 8551.676
Alternatively, we can also extract the posterior summary for a single model parameter like this:
extract_summary(abc, param = "gf_BC")
#> gf_BC
#> Min.: -0.02143034
#> Weighted 2.5 % Perc.: 0.03373237
#> Weighted Median: 0.09864958
#> Weighted Mean: 0.10241672
#> Weighted Mode: 0.09655264
#> Weighted 97.5 % Perc.: 0.17629759
#> Max.: 0.21388432
Because a chart is always more informative than a table, we can
easily get a visualization of our posteriors using the function
plot_posterior()
:
plot_posterior(abc, param = "Ne") + ggplot2::coord_cartesian(xlim = c(0, 10000))
Excellent! It looks like we got really nice and informative posterior distributions of values!
Similarly, we get nice and informative posterior distributions of split times:
plot_posterior(abc, param = "T") + ggplot2::coord_cartesian(xlim = c(0, 10000))
And the same is true for the gene-flow proportion:
plot_posterior(abc, param = "gf") + ggplot2::coord_cartesian(xlim = c(0, 1))
Because the internals of demografr ABC objects are represented by standard objects created by the abc package, we have all of the standard diagnostics functions of the abc R package at our disposal.
plot(abc, param = "T_BC")
plot(abc, param = "gf_BC")
Again, there are numerous tools for additional diagnostics, which you can learn about here.