vignettes/vignette-04-parallelization.Rmd
vignette-04-parallelization.Rmd
⚠️⚠️⚠️
Note: The demografr R package is still under active development. As a result, its documentation is in a draft stage at best. Typos, inconsistencies, and other issues are unfortunately expected.
⚠️⚠️⚠️
One of the main selling points of demografr is computational efficiency. Although R isn’t exactly known for computational speed, by utilizing msprime (and, optionally, SLiM) for simulation and the tskit tree-sequence module for population genetic computation without having to rely on saving files to disk, demografr is surprisingly fast and efficient for an R package.
Still, optimized means of simulation and computation can only get us so far. Especially for computationally very heavy analyses such an ABC, parallelization is absolutely crucial. To this end, demografr provides two ways of parallelizing analyses. One is more sophisticated and easier to do, but it does require a little bit of setup. Another is more straightforward but a little less convenient.
In this vignette, we will briefly describe both methods, using the
simulate_grid()
function for simplicity and this trivial
population genetic model:
model <- function(Ne_p1, Ne_p2) {
p1 <- population("p1", time = 10000, N = Ne_p1)
p2 <- population("p2", time = 5000, N = Ne_p2, parent = p1)
model <- compile_model(populations = list(p1, p2), generation_time = 1)
return(model)
}
Let’s say we want to explore the levels of genetic diversity in both populations across the following parameter grid (note that this grid is extremely small in order to make this example easy and quick to run):
#> # A tibble: 9 × 2
#> Ne_p1 Ne_p2
#> <dbl> <dbl>
#> 1 10 10
#> 2 10 100
#> 3 10 1000
#> 4 100 10
#> 5 100 100
#> 6 100 1000
#> 7 1000 10
#> 8 1000 100
#> 9 1000 1000
Here is our summary function:
functions <- list(
diversity = function(ts) {
ts_diversity(ts, sample_sets = ts_names(ts, split = "pop"))
}
)
Just to verify we have set up the components of our analysis correctly, let’s quickly simulate a tree sequence and try to compute genetic diversity in both populations using our summary function:
ts <- simulate_ts(model, parameters = list(Ne_p1 = 100, Ne_p2 = 1000), mutation_rate = 1e-8)
functions$diversity(ts)
#> # A tibble: 2 × 2
#> set diversity
#> <chr> <dbl>
#> 1 p1 0.00000479
#> 2 p2 0.0000319
Everything works!
Now, how can we parallelize the computation of diversity across all parameters of the grid?
The simplest way to parallelize demografr analyses is via the R package future.
The magic of future lies in its ability to set up parallelized schemes of arbitrary topologies by automatic scheduling of jobs across a given number of CPUs distributed even across several remote compute servers.
Going into detail in setting up future-based parallelization would be too far out of scope of this simple vignette (or demografr’s documentation in general). Luckily, once future parallelization is set up, it doesn’t matter whether the computation is done by demografr or anything else, as the future architecture is completely general.
In order to use future effectively, you should study its documentation and tutorials (a good start would be the vignette here), but here is a brief introduction focused on parallelizing a toy demografr analysis set up above:
By default, if we would run simulate_grid()
, all
simulations would be run sequentially:
data <- simulate_grid(model, grid, functions, replicates = 10, sequence_length = 10000, recombination_rate = 1e-8)
data
#> # A tibble: 90 × 4
#> rep Ne_p1 Ne_p2 diversity
#> <int> <dbl> <dbl> <list>
#> 1 1 10 10 <tibble [2 × 2]>
#> 2 1 10 100 <tibble [2 × 2]>
#> 3 1 10 1000 <tibble [2 × 2]>
#> 4 1 100 10 <tibble [2 × 2]>
#> 5 1 100 100 <tibble [2 × 2]>
#> 6 1 100 1000 <tibble [2 × 2]>
#> 7 1 1000 10 <tibble [2 × 2]>
#> 8 1 1000 100 <tibble [2 × 2]>
#> 9 1 1000 1000 <tibble [2 × 2]>
#> 10 2 10 10 <tibble [2 × 2]>
#> # ℹ 80 more rows
The total runtime for the grid simulations was 0 hours 2 minutes 35 seconds parallelized across 96 CPUs.
In order to distribute simulations across all (or some) available CPU
cores of your machine, you only need to put the following bit of R code
before your call to simulate_grid()
. For instance,
on my personal laptop which has 10 CPUs, I can run the above analysis
using all available CPUs like this:
First we setup future:
Then we can run the simulations without having to do anything
else! This last point is important: demografr can
automatically leverage whatever parallelization scheme you set up via
future’s function plan()
.
data <- simulate_grid(model, grid, functions, replicates = 10, sequence_length = 100000, recombination_rate = 1e-8)
data
#> # A tibble: 90 × 4
#> rep Ne_p1 Ne_p2 diversity
#> <int> <dbl> <dbl> <list>
#> 1 1 10 10 <tibble [2 × 2]>
#> 2 1 10 100 <tibble [2 × 2]>
#> 3 1 10 1000 <tibble [2 × 2]>
#> 4 1 100 10 <tibble [2 × 2]>
#> 5 1 100 100 <tibble [2 × 2]>
#> 6 1 100 1000 <tibble [2 × 2]>
#> 7 1 1000 10 <tibble [2 × 2]>
#> 8 1 1000 100 <tibble [2 × 2]>
#> 9 1 1000 1000 <tibble [2 × 2]>
#> 10 2 10 10 <tibble [2 × 2]>
#> # ℹ 80 more rows
The total runtime for the grid simulations was 0 hours 0 minutes 42 seconds parallelized across 96 CPUs.
Notice the speed up even for this trivial toy analysis!
If you read the documentation of the future package (which you really should), you will discover many more options for parallelization giving you near limitless options for scheduling parallel simulations.
For instance, imagine you have access to remote machines
server01
, server02
, server03
, all
with available R as well as demografr and slendr R
packages, with set up slendr environments. In other words, you
can start an R interpreter on each of these machines and eventually
call:
#> The interface to all required Python modules has been activated.
Then, you can distribute the simulate_grid()
run (but
also simulate_abc()
, of course!) simply by adjusting the
plan()
setup like this:
We won’t be running this example here in practice because the “server01”, etc., don’t really exist, but you should get an idea.
In short, the future R package (and it’s support provided by
demografr) makes it very easy to set up arbitrary
parallelization schemes without doing additional work beyond setting up
the parallelization plan()
.
By default, if you don’t set up any plan()
at all, the
sequential simulation runs of demografr are implicitly
organized by:
plan(sequential)
But this is just for completeness. It’s not something that makes much sense to do in practice, perhaps beyond debugging parallelization issues.
In personal experience, it can be a bit of a hassle to distribute jobs across different machines, despite the conveniences of the future R package. If you’re having such problems, I first highly encourage you to read the basic overview vignette and look up troubleshooting advice on Google.
Still, demografr provides rudimentary support for “manual” paralelization across multiple machines (each of those machines parallelizing on multiple CPUs), followed by merging simulation runs into a single data set.
To demonstrate this, let’s pretend we ran five simulation grid runs across two different computers like this (we were originally interested in distributing 10 replicates, so we’ll run two times five replicates individually):
library(future)
plan(multisession, workers = 5)
run1_file <- tempfile(fileext = ".rds")
run2_file <- tempfile(fileext = ".rds")
# on machine #1
simulate_grid(model, grid, functions, replicates = 5, sequence_length = 100000, recombination_rate = 1e-8, file = run1_file)
# on machine #2
simulate_grid(model, grid, functions, replicates = 5, sequence_length = 100000, recombination_rate = 1e-8, file = run2_file)
Then, once we collected the individual .rds
files on a
single computer from the remote machines (perhaps using something like
the scp
unix command), we can use another
demografr function combine_data()
to merge all
individual runs into a single result:
data <- combine_data(run1_file, run2_file)
data
#> # A tibble: 90 × 4
#> rep Ne_p1 Ne_p2 diversity
#> <int> <dbl> <dbl> <list>
#> 1 1 10 10 <tibble [2 × 2]>
#> 2 1 10 100 <tibble [2 × 2]>
#> 3 1 10 1000 <tibble [2 × 2]>
#> 4 1 100 10 <tibble [2 × 2]>
#> 5 1 100 100 <tibble [2 × 2]>
#> 6 1 100 1000 <tibble [2 × 2]>
#> 7 1 1000 10 <tibble [2 × 2]>
#> 8 1 1000 100 <tibble [2 × 2]>
#> 9 1 1000 1000 <tibble [2 × 2]>
#> 10 2 10 10 <tibble [2 × 2]>
#> # ℹ 80 more rows
For each combination of parameters we ran five replicates across “two different machines”, totalling 9 x 5 x 2 = 90 simulations, so we should get the same number of rows in the merged data frame:
nrow(data)
#> [1] 90
One thing to note that simulate_grid()
as well as
simulate_abc()
really do save standard RDS files as one
normally would using function saveRDS()
(in fact, they use
this function internally). An additional behefit of using this approach
combined with combine_data()
is that these functions
provide additional sanity checks making sure that the simulation data
sets being merged are consistent (coming from the same model function,
with the same priors or summary statistics, etc.).
For completeness, combine_data()
not only accepts file
paths as individual function arguments but also file names as vectors,
as well as individual data frame objects generated by
simulate_grid()
and simulate_abc()
. In other
words, the following approaches will all result in the same combined
data set:
Merging data from serialized data files:
combined1 <- combine_data(run1_file, run2_file)
combined2 <- combine_data(list(run1_file, run2_file))
Merging data objects themselves directly:
run1 <- simulate_grid(model, grid, functions, replicates = 5, sequence_length = 100000, recombination_rate = 1e-8)
run2 <- simulate_grid(model, grid, functions, replicates = 5, sequence_length = 100000, recombination_rate = 1e-8)
combined3 <- combine_data(run1, run2)
combined4 <- combine_data(list(run1, run2))
All combine_data()
approaches give the same
result:
all.equal(combined1, combined2)
#> [1] TRUE
all.equal(combined3, combined4)
#> [1] TRUE
all.equal(combined2, combined3)
#> [1] TRUE