A Toolkit for Simulation-Based Inference in Population Genetics • demografr

⚠️⚠️⚠️

Please note that until a peer reviewed publication describing the demografr package is published, the software should be regarded as experimental and potentially unstable.

⚠️️⚠️⚠️

You can now read our preprint on bioRxiv! All comments and feedback are welcome!

The goal of demografr is to simplify and streamline the development of simulation-based inference pipelines in population genetics and evolutionary biology, such as Approximate Bayesian Computation (ABC) or parameter grid inferences, and make them more reproducible. demografr also aims to make these inferences orders of magnitude faster and more efficient by leveraging the tree sequence as an internal data structure and computation engine.

Unlike traditional ABC and other simulation-based approaches, which generally involve custom-built pipelines and scripts for population genetic simulation and computation of summary statistics, demografr makes it possible to perform simulation, computation of summary statistics, and the inference itself within a single reproducible R script. By eliminating the need to write custom simulation code and scripting for integration of various population genetic tools for computing summary statistics, it lowers the barrier for new users and facilitates reproducibility for everyone regardless of their level of experience by eliminating many common sources of bugs.

How does demografr help with ABC?

demografr streamlines every step of a typical ABC pipeline by leveraging the slendr framework as a building block for simulation and data analysis, making it possible to write complete simulation-based workflows entirely in R. Specifically:

slendr’s intuitive, interactive interface for definning population genetic models makes it easy to encode even complex demographic models with only a bare minimum of R knowledge needed.
demografr makes it possible to encode prior distributions of parameters using familiar R interface resembling standard probabilistic statements, and provides an automated function which simulates ABC replicates drawing parameters from priors in a trivial, one-step manner. Automated routines for exploring model parameter grids in settings other than ABC are also provided.
Because slendr embraces the tree sequence as its default internal data structure, most population genetic statistics can be computed directly on such tree sequences using R functions which are part of slendr’s statistical library. A tree sequence is never saved to disk and no conversion between file formats is required, which significantly speeds up every workflow. That said, using custom simulation scripts as well as computation of summary statistics in third party software is also possible (while still leveraging the consistent demografr workflow approach and most of its standard simulation and inference functions).
demografr facilitates tight integration with the powerful R package abc by automatically feeding it simulation data for inference and diagnostics. At the moment, the abc package represents a core inference engine of demografr. Other inference engines and statistical approaches will be developed in the future.

Installation

Once the demografr package appears on CRAN, you will be able to install the latest released version with the following command:

install.packages("demografr")

However, especially early after an initial release, it might be worth keeping an eye on the changelog for bugfixes and improvements, which you can always obtain by installing the development version of demografr with:

devtools::install_github("bodkan/demografr")

Note that this requires an R package devtools, which you can install simply by running install.packages("devtools").

An example ABC pipeline

Note: A much more detailed explanation of this toy example can be found in the following vignette. Or in the demografr manuscript available on bioRxiv. Please take the code below as a bare minimum demonstration of some basic functionality of the package.

Imagine that we sequenced genomes of individuals from populations “A”, “B”, “C”, and “D”.

Let’s also assume that we know that the populations are phylogenetically related in the following way, with an indicated gene-flow event at a certain time in the past, but we don’t know anything else (i.e., we have no idea about the $N_e$ of the populations, their split times, or the proportion of gene flow):

After sequencing the genomes of individuals from these populations, we computed the nucleotide diversity in each of them, their pairwise genetic divergence, and $f_4$ statistic, and observed the following values (which we saved in two standard R data frames). In a traditional ABC setting, this computation and table formatting would typically be done in some 3rd party software like PLINK, ADMIXTOOLS, etc.

Nucleotide diversity in each population:

observed_diversity <- read.table(system.file("examples/basics_diversity.tsv", package = "demografr"), header = TRUE)

observed_diversity
#>   set    diversity
#> 1   A 8.030512e-05
#> 2   B 3.288576e-05
#> 3   C 1.013804e-04
#> 4   D 8.910909e-05

Pairwise divergence between populations X and Y:

observed_divergence <- read.table(system.file("examples/basics_divergence.tsv", package = "demografr"), header = TRUE)

observed_divergence
#>   x y   divergence
#> 1 A B 0.0002378613
#> 2 A C 0.0002375761
#> 3 A D 0.0002379385
#> 4 B C 0.0001088217
#> 5 B D 0.0001157056
#> 6 C D 0.0001100633

Value of the following $f_4$ -statistic:

observed_f4  <- read.table(system.file("examples/basics_f4.tsv", package = "demografr"), header = TRUE)

observed_f4
#>   W X Y Z            f4
#> 1 A B C D -3.262146e-06

Note that the value of each given statistic is given as the last column of each data frame, with additional columns providing relevant “metadata”. Keeping consistent naming throughout an inference pipeline is extremely important for demografr, and goes a long way towards reproducibility and towards avoiding sneaky bugs! For this reason, demografr often loudly complains if something looks inconsistent. Better safe than sorry!

A complete ABC pipeline in a single R script

This is how we would use demografr to estimate the $N_e$ , split times for all populations, as well as the rate of the indicated gene-flow event with Approximate Bayesian Computation in a single R script. Again, detailed description of each component can be found throughout our vignettes, and a motivation for various design decisions is also detailed in the demografr manuscript.

library(demografr)
library(slendr)
# running setup_env() first might be necessary to set up slendr's internal
# simulation environment
init_env() 

# set up parallelization across all CPUs on the current machine
library(future)
plan(multisession, workers = availableCores())

#--------------------------------------------------------------------------------
# bind data frames with empirical summary statistics into a named list
observed <- list(
  diversity  = observed_diversity,
  divergence = observed_divergence,
  f4         = observed_f4
)

#--------------------------------------------------------------------------------
# define a model generating function using the slendr interface
# (each of the function parameters correspond to a parameter we want to infer)

model <- function(Ne_A, Ne_B, Ne_C, Ne_D, T_AB, T_BC, T_CD, gf_BC) {
  A <- population("A", time = 1,    N = Ne_A)
  B <- population("B", time = T_AB, N = Ne_B, parent = A)
  C <- population("C", time = T_BC, N = Ne_C, parent = B)
  D <- population("D", time = T_CD, N = Ne_D, parent = C)

  gf <- gene_flow(from = B, to = C, start = 9000, end = 9301, rate = gf_BC)

  model <- compile_model(
    populations = list(A, B, C, D), gene_flow = gf,
    generation_time = 1, simulation_length = 10000,
    direction = "forward", serialize = FALSE
  )

  samples <- schedule_sampling(
    model, times = 10000,
    list(A, 25), list(B, 25), list(C, 25), list(D, 25),
    strict = TRUE
  )

  # when a specific sampling schedule is to be used, both model and samples
  # must be returned by the function
  return(list(model, samples))
}

#--------------------------------------------------------------------------------
# setup priors for model parameters

priors <- list(
  Ne_A  ~ runif(1000, 3000),
  Ne_B  ~ runif(100,  1500),
  Ne_C  ~ runif(5000, 10000),
  Ne_D  ~ runif(2000, 7000),

  T_AB  ~ runif(1,    4000),
  T_BC  ~ runif(3000, 9000),
  T_CD  ~ runif(5000, 10000),

  gf_BC ~ runif(0, 0.3)
)

#--------------------------------------------------------------------------------
# define summary functions to be computed on simulated data (must be of the
# same format as the summary statistics computed on empirical data)

compute_diversity <- function(ts) {
  samples <- ts_names(ts, split = "pop")
  ts_diversity(ts, sample_sets = samples)
}
compute_divergence <- function(ts) {
  samples <- ts_names(ts, split = "pop")
  ts_divergence(ts, sample_sets = samples)
}
compute_f4 <- function(ts) {
  samples <- ts_names(ts, split = "pop")
  A <- samples["A"]; B <- samples["B"]
  C <- samples["C"]; D <- samples["D"]
  ts_f4(ts, A, B, C, D)
}

# the summary functions must be also bound to an R list named in the same
# way as the empirical summary statistics

functions <- list(
  diversity  = compute_diversity,
  divergence = compute_divergence,
  f4         = compute_f4
)

#--------------------------------------------------------------------------------
# validate the individual ABC components for correctness and consistency
validate_abc(model, priors, functions, observed,
             sequence_length = 1e6, recombination_rate = 1e-8)

#--------------------------------------------------------------------------------
# run ABC simulations
data <- simulate_abc(
  model, priors, functions, observed, iterations = 10000,
  sequence_length = 50e6, recombination_rate = 1e-8, mutation_rate = 1e-8
)

#--------------------------------------------------------------------------------
# infer posterior distributions of parameters using the abc R package
abc <- run_abc(data, engine = "abc", tol = 0.01, method = "neuralnet")

Examining posterior distributions of parameters

After we run this R script, we end up with an object called abc (see the last row of the script). This object contains the complete information about the results of our inference. In particular, it carries the posterior samples for our parameters of interest ( $N_e$ of populations and their split times).

What can we do with this, to get an idea about the most likely parameters of the assumed evolutionary history of our populations of interest?

For instance, we can get a summary table of all parameter posteriors with the function extract_summary():

extract_summary(abc)
#>                            Ne_A      Ne_B      Ne_C     Ne_D      T_AB     T_BC
#> Min.:                  1492.557  526.6231  6373.061 2254.591  859.5749 5131.473
#> Weighted 2.5 % Perc.:  1774.758  672.5853  7344.040 2895.795 1318.3159 5595.637
#> Weighted Median:       2032.148  848.1467  8553.144 3814.660 1934.0008 6136.188
#> Weighted Mean:         2021.722  838.5594  8678.066 3804.614 1954.3343 6112.305
#> Weighted Mode:         2054.933  861.3408  8428.777 3721.917 1933.3694 6230.341
#> Weighted 97.5 % Perc.: 2270.334 1003.8421 10162.320 4530.273 2522.5919 6611.315
#> Max.:                  2438.657 1047.0935 12703.479 5764.139 2650.7248 6851.800
#>                            T_CD       gf_BC
#> Min.:                  6752.669 -0.04681798
#> Weighted 2.5 % Perc.:  7125.349  0.02469992
#> Weighted Median:       7835.810  0.09867650
#> Weighted Mean:         7824.532  0.10282631
#> Weighted Mode:         7851.617  0.09804401
#> Weighted 97.5 % Perc.: 8426.377  0.18467366
#> Max.:                  8476.157  0.21801675

We can also specify a subset of model parameters to select, or provide a regular expression for this subsetting:

extract_summary(abc, param = "gf_BC")
#>                              gf_BC
#> Min.:                  -0.04681798
#> Weighted 2.5 % Perc.:   0.02469992
#> Weighted Median:        0.09867650
#> Weighted Mean:          0.10282631
#> Weighted Mode:          0.09804401
#> Weighted 97.5 % Perc.:  0.18467366
#> Max.:                   0.21801675

Of course, we can also visualize the posterior distributions. Rather than plotting many different distributions at once, let’s first check out the posterior distributions of inferred $N_e$ values:

plot_posterior(abc, param = "Ne")

Similarly, we can take a look at the inferred posteriors of the split times:

plot_posterior(abc, param = "T")

And, finally, the rate of gene flow:

plot_posterior(abc, param = "gf") + ggplot2::coord_cartesian(xlim = c(0, 1))

Additionally, we have the full diagnostic functionality of the abc R package at our disposal:

plot(abc, param = "Ne_C")

Many diagnostic and model selection functions implemented by abc are also supported by demografr. For more information, see this vignette.

Additional functionality

demografr also provides a couple of functions designed to make development (and troubleshooting) a little easier.

For instance, assuming we have priors set up as above, we can visualize the prior distribution(s) like this:

plot_prior(priors, "Ne")

To make developing complete pipelines more efficient, demografr also provides means to test and evaluate their individual components even further. For instance, the function simulate_model() simulates data from a single simulation run:

ts <- simulate_model(model, priors, sequence_length = 1e6, recombination_rate = 1e-8, mutation_rate = 1e-8)
#> Warning: The argument `rate` is about to be deprecated because of its confusing
#> naming and behavior. If you want to specify the rate of migration per
#> unit of time, please use the new argument `migration_rate`. If you want
#> to specify the total amount of ancestry which the `to` population should
#> received from the `from` population, use the new argument `proportion`
#> (this corresponds to the original interpretation of the deprecated `rate`
#> argument, and a simple replacement of `rate` with `proportion` will thus
#> retain the original meaning of your code all).

ts
#> ╔═══════════════════════════╗
#> ║TreeSequence               ║
#> ╠═══════════════╤═══════════╣
#> ║Trees          │      1,880║
#> ╟───────────────┼───────────╢
#> ║Sequence Length│  1,000,000║
#> ╟───────────────┼───────────╢
#> ║Time Units     │generations║
#> ╟───────────────┼───────────╢
#> ║Sample Nodes   │        200║
#> ╟───────────────┼───────────╢
#> ║Total Size     │  530.2 KiB║
#> ╚═══════════════╧═══════════╝
#> ╔═══════════╤═════╤═════════╤════════════╗
#> ║Table      │Rows │Size     │Has Metadata║
#> ╠═══════════╪═════╪═════════╪════════════╣
#> ║Edges      │8,353│261.0 KiB│          No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Individuals│  100│  2.8 KiB│          No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Migrations │    0│  8 Bytes│          No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Mutations  │2,164│ 78.2 KiB│          No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Nodes      │2,391│ 65.4 KiB│          No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Populations│    4│331 Bytes│         Yes║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Provenances│    2│  3.3 KiB│          No║
#> ╟───────────┼─────┼─────────┼────────────╢
#> ║Sites      │2,161│ 52.8 KiB│          No║
#> ╚═══════════╧═════╧═════════╧════════════╝

With this one simulation data instance ts (in this case, a tree-sequence object of which the above-mentioned function simulate_abc() would produce millions, making troubleshooting challenging and slow), we can apply individual summary statistic functions using another helper function summarise_data():

summarise_data(ts, functions)
#> $diversity
#> # A tibble: 4 × 2
#>   set   diversity
#>   <chr>     <dbl>
#> 1 A     0.0000797
#> 2 B     0.0000587
#> 3 C     0.000156 
#> 4 D     0.000154 
#> 
#> $divergence
#> # A tibble: 6 × 3
#>   x     y     divergence
#>   <chr> <chr>      <dbl>
#> 1 A     B       0.000237
#> 2 A     C       0.000247
#> 3 A     D       0.000247
#> 4 B     C       0.000168
#> 5 B     D       0.000194
#> 6 C     D       0.000188
#> 
#> $f4
#> # A tibble: 1 × 5
#>   W     X     Y     Z             f4
#>   <chr> <chr> <chr> <chr>      <dbl>
#> 1 A     B     C     D     -0.0000130

By comparing the format of this result to the observed data (given in the list observed at the beginning of this page), we can make sure that both simulated and observed summary statistics are mutually compatible, and can be thus directly compared in later steps of the ABC workflow.

See the reference for a complete overview of the functionality of this package.