\(N_e\) inference with AFS

So far we’ve learned how slendr provides an easy way to define demographic models in R and simulate (even very large!) tree sequences from them. This allows us to quickly verify our intuition about some popgen problem (things like “Hmmm, I wonder what would an \(f_4\) statistic look like if my model includes this particular gene-flow event?), in just a few lines of R. There have been instances in which we’ve been able to even answer questions like this directly in a meeting, pretty much on the spot! This makes slendr a very powerful “popgen calculator”.

Now let’s take things one step further. Imagine you gathered some empirical data, like an allele frequency spectrum (AFS) from a population that you study. That data was, in the real world, produced by some (hidden) biological process (demographic history) that we want to learn about. For instance, the population we study had some \(N_e\), which we don’t know the value of (the only thing we have is the observed AFS) but we want to infer that value.

Simulations can be a great tool to estimate the most likely value of such an unknown parameter. Briefly speaking, in this particular toy example, we can simulate a large number of AFS vectors (each resulting from a different assumed \(N_e\) value) and then pick just those \(N_e\) values (or just one \(N_e\) value) which produced a simulated AFS closest to the observed AFS.

This is exactly what you’ll be doing just now in Exercise 3.

Part 1: A self-contained slendr function of \(N_e \rightarrow \textrm{AFS}\)

In a new script afs.R write a custom R function called simulate_afs(), which will take Ne as its only parameter. Use this function to compute (and return) AFS vectors for a couple of Ne values of your choosing, but staying between Ne = 1000 and Ne = 30000 Plot those AFS vectors and observe how (and why?) do they differ based on Ne parameter you used in each respective simulation.

Hint: The function should create a one-population forward-time model (our population starting at time = 1, with the model simulation_length = 100000 and generation_time = 1 in compile_model()), simulate 10Mb tree sequence using msprime() (recombination rate 1e-8) and then overlay neutral mutations on it at mutation_rate = 1e-8), compute AFS for 10 samples and return the AFS vector as result of this custom function.

Hint: If you’ve never programmed before, the concept of a “custom function” might be very alien to you. Again, if you need help, feel free to start building your afs.R solution based on this “template” (just fill in missing relevant bits of slendr code that you should be already familiar with):

library(slendr)
init_env()

simulate_afs <- function(Ne) {
  # In here you should write code which will:
  #   1. create one population with a given Ne (provided as a function argument)
  #   2. compile a model using `simulation_length =` and `generation_time =`
  #   3. simulate a tree sequence
  #   4. select names of 10 samples (doesn't matter which, "pop_1", "po2_", ...)
  #   5. compute AFS vector from those 10 individuals using `ts_afs()`
  
  # `result` is a variable with your 10-sample AFS vector (we remove the
  # first element because it's not meaningful for our example)
  return(result[-1]) 
}

afs_1 <- simulate_afs(Ne = 1000) # simulate AFS from a Ne = 1000 model...
plot(afs_1, type ="o")           # ... and plot it

Note: Remember that you should drop the first element of the AFS vector produced by ts_afs() (for instance with something like result[-1] if result contains the output of ts_afs()) technical reasons related to tskit. You don’t have to worry about that here, but you can read this for more detail.

Hint: If the above still doesn’t make any sense to you, feel free to copy-paste the function from the solution below into your script and work with that function instead!

When used in R, your custom function should work like this (the simulation is stochastic, so your numbers will be different, of course):

# This gives us a vector of singletons, doubletons, etc., etc., all the way
# to the number of fixed mutations in our sample of 10 individuals
simulate_afs(Ne = 1000)
 [1] 378 203 138 111  84  69  41  42  48  33  37  33  22  21  22  41  25  29  15
[20]  11

A function can be understood as a independent unit of a computer program which executes a block of code inside the {…} brackets given some values of some parameters. In our example, we programmed a function simulate_sfs() which accepts a single parameter, Ne.

simulate_afs <- function(Ne) {
  # create a slendr model with a single population of size Ne = N
  pop <- population("pop", N = Ne, time = 1)
  model <- compile_model(pop, generation_time = 1, simulation_length = 100000)

  # simulate a tree sequence
  ts <-
    msprime(model, sequence_length = 10e6, recombination_rate = 1e-8) %>%
    ts_mutate(mutation_rate = 1e-8)

  # get a random sample of names of 10 individuals
  samples <- ts_names(ts) %>% sample(10)

  # compute the AFS vector (dropping the 0-th element added by tskit)
  afs <- ts_afs(ts, sample_sets = list(samples))[-1]

  afs
}

Our functions is supposed to produce an AFS vector of counts of alleles observed at a given frequency a the population sample:

Let’s use our custom function to simulate AFS vector for Ne = 1k, 10k, and 30k:

afs_1k <- simulate_afs(1000)
afs_10k <- simulate_afs(10000)
afs_30k <- simulate_afs(30000)

Here’s one of those vectors. We can see that the function does, indeed, produce a result of the correct format:

afs_1k
 [1] 403 200 149 103 112  47  62  41  40  41  53  36  23  15  36  16  21  23  29
[20]  13

To see the results of this function in a clearer context, let’s visualize the vectors in the same plot:

plot(afs_30k, type = "o", main = "AFS, Ne = 30000", col = "cyan",)
lines(afs_10k, type = "o", main = "AFS, Ne = 10000", col = "purple")
lines(afs_1k, type = "o", main = "AFS, Ne = 1000", col = "blue")
legend("topright", legend = c("Ne = 1k", "Ne = 10k", "Ne = 30k"),
       fill = c("blue", "purple", "cyan"))

Part 2: Estimating unknown \(N_e\) from empirical AFS

Imagine you sequenced 10 samples from a population and computed the following AFS vector (which contains, sequentially, the number of singletons, doubletons, etc., in your sample from a population):

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
                  264, 218,  133, 173, 159, 142, 167, 129, 125, 143)

You know (maybe from some fossil evidence) that the population probably had a constant \(N_e\) somewhere between 1000 and 30000 for the past 100,000 generations, and had mutation and recombination rates of 1e-8 (i.e., parameters already implemented by your simulate_afs() function – how convenient!).

Use slendr simulations to guess the true (and hidden!) \(N_e\) given the observed AFS by running simulations for a range of \(N_e\) values and finding out which \(N_e\) produces the closest AFS vector to the afs_observed vector above using one of the following two approaches.

  • Option 1 [easy]: Plot AFS vectors for various \(N_e\) values (i.e. simulate several of them using your function simulate_afs()), then eyeball which looks closest to the observed AFS based on the figures alone. (This is, of course, not how proper statistical inference is done, but it will be good enough for this exercie!)

  • Option 2 [hard]: Simulate AFS vectors in steps of possible Ne (maybe lapply()?), and find the \(N_e\) which gives the closest AFS to the observed AFS based on Mean squared error.

This is the observed AFS with which we want to compare our simulated AFS vectors:

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
                  264, 218,  133, 173, 159, 142, 167, 129, 125, 143)

We know that the true \(N_e\) is supposed to be between 1000 and 30000, so let’s simulate a bunch of AFS vectors for different \(N_e\) values using our new AFS simulation function:

afs_Ne1k <- simulate_afs(Ne = 1000)
afs_Ne5k <- simulate_afs(Ne = 5000)
afs_Ne6k <- simulate_afs(Ne = 6000)
afs_Ne10k <- simulate_afs(Ne = 10000)
afs_Ne20k <- simulate_afs(Ne = 20000)
afs_Ne30k <- simulate_afs(Ne = 30000)

Now let’s plot our simulated AFS vectors together with the observed AFS (highlighting it in black):

plot(afs_observed, type = "b", col = "black", lwd = 3,
     xlab = "allele count bin", ylab = "count", ylim = c(0, 13000))
lines(afs_Ne1k, lwd = 2, col = "blue")
lines(afs_Ne5k, lwd = 2, col = "green")
lines(afs_Ne6k, lwd = 2, col = "pink")
lines(afs_Ne10k, lwd = 2, col = "purple")
lines(afs_Ne20k, lwd = 2, col = "orange")
lines(afs_Ne30k, lwd = 2, col = "cyan")
legend("topright",
       legend = c("observed AFS", "Ne = 1000", "Ne = 5000",
                  "Ne = 6000", "Ne = 10000", "Ne = 20000", "Ne = 30000"),
       fill = c("black", "blue", "green", "pink", "purple", "orange", "cyan"))

The true \(N_e\) was 6543!

This is the observed AFS with which we want to compare our simulated AFS vectors:

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
                  264, 218,  133, 173, 159, 142, 167, 129, 125, 143)

We know that the true \(N_e\) is supposed to be between 1000 and 30000. Let’s generate regularly spaced values of potential Ne values whose AFS we want to investigate and compare to the obesrved AFS (our parameter grid):

Ne_grid <- seq(from = 1000, to = 30000, by = 500)
Ne_grid
 [1]  1000  1500  2000  2500  3000  3500  4000  4500  5000  5500  6000  6500
[13]  7000  7500  8000  8500  9000  9500 10000 10500 11000 11500 12000 12500
[25] 13000 13500 14000 14500 15000 15500 16000 16500 17000 17500 18000 18500
[37] 19000 19500 20000 20500 21000 21500 22000 22500 23000 23500 24000 24500
[49] 25000 25500 26000 26500 27000 27500 28000 28500 29000 29500 30000

With the parameter grid Ne_grid set up, let’s simulate an AFS from each \(N_e\) model:

library(parallel)

afs_grid <- mclapply(Ne_grid, simulate_afs, mc.cores = detectCores())
names(afs_grid) <- Ne_grid

# show the first five simulated AFS vectors, for brevity, just to demonstrate
# what the output of the grid simulations is supposed to look like
afs_grid[1:5]
$`1000`
 [1] 400 163 156  91  77  54  60  59  44  61  47  24  22  17  28  20  21   9  17
[20]  29

$`1500`
 [1] 677 313 194 138 100 152  85  65  72  68  38  50  47  59  40  32  33  30  32
[20]  37

$`2000`
 [1] 753 383 245 216 149 160 118 106 104  84  85  79  77  61  72  45  41  46  36
[20]  39

$`2500`
 [1] 1107  521  292  242  218  144  130  135  100  104   96  125   82   60   94
[16]   73   47   59   50   41

$`3000`
 [1] 1170  596  363  344  286  221  173  178  112  109   96   91  100   84  105
[16]   89   53   40   41   65

Plot the observed AFS and overlay the simulated AFS vectors on top of it:

plot(afs_observed, type = "b", col = "black", lwd = 3, xlab = "allele count bin", ylab = "count")
for (i in seq_along(Ne_grid)) {
  lines(afs_grid[[i]], lwd = 0.5)
}
legend("topright", legend = c("observed AFS", "simulated AFS"), fill = c("black", "gray"))

Compute mean-squared error of the AFS produced by each \(N_e\) value across the grid:

errors <- sapply(afs_grid, function(sim_afs) {
  sum((sim_afs - afs_observed)^2) / length(sim_afs)
})

plot(Ne_grid, errors, ylab = "error")
abline(v = Ne_grid[which.min(errors)], col = "red")
legend("topright", legend = paste("minimum error Ne =", Ne_grid[which.min(errors)]), fill = "red")

Plot the AFS again, but this time highlight the most likely spectrum (i.e. the one which gave the lowest RMSE value):

plot(afs_observed, type = "b", col = "black", lwd = 3, xlab = "allele count bin", ylab = "count")
for (i in seq_along(Ne_grid)) {
  color <- if (i == which.min(errors)) "red" else "gray"
  width <- if (i == which.min(errors)) 2 else 0.75
  lines(afs_grid[[i]], lwd = width, col = color)
}
legend("topright", legend = c("observed AFS", paste("best fitting Ne =", Ne_grid[which.min(errors)])),
       fill = c("black", "red"))

The true \(N_e\) was 6543!

Congratulations, you now know how to infer parameters of evolutionary models using simulations! What you just did is really very similar to how simulation-based inference is done in practice (even with methods such as ABC). Hopefully you now also see how easy slendr makes it to do this (normally a rather laborious) process.

This kind of approach can be used to infer all sorts of demographic parameters, even using other summary statistics that you’ve also learned to compute… including selection parameters, which we delve into in another exercise.