\(N_e\) inference with AFS

So far we’ve learned how slendr provides an easy way to define demographic models in R and simulate (even very large!) tree sequences from them. This allows us to quickly verify our intuition about some popgen problem (things like “Hmmm, I wonder what would an \(f_4\) statistic look like if my model includes this particular gene-flow event?), in just a few lines of R. There have been instances in which we’ve been able to even answer questions like this directly in a meeting, pretty much on the spot! This makes slendr a very powerful “popgen calculator”.

Now let’s take things one step further. Imagine you gathered some empirical data, like an allele frequency spectrum (AFS) from a population that you study. That data was, in the real world, produced by some (hidden) biological process (demographic history) that we want to learn about. For instance, the population we study had some \(N_e\), which we don’t know the value of (the only thing we have is the observed AFS) but we want to infer that value.

Simulations can be a great tool to estimate the most likely value of such an unknown parameter. Briefly speaking, in this particular toy example, we can simulate a large number of AFS vectors (each resulting from a different assumed \(N_e\) value) and then pick just those \(N_e\) values (or just one \(N_e\) value) which produced a simulated AFS closest to the observed AFS.

This is exactly what you’ll be doing just now in Exercise 3.

Part 1: A self-contained slendr function of \(N_e \rightarrow \textrm{AFS}\)

In a new script afs.R write a custom R function called simulate_afs(), which will take Ne as its only parameter. Use this function to compute (and return) AFS vectors for a couple of Ne values of your choosing, but staying between Ne = 1000 and Ne = 30000 Plot those AFS vectors and observe how (and why?) do they differ based on Ne parameter you used in each respective simulation.

Hint: The function should create a one-population forward-time model (our population starting at time = 1, with the model simulation_length = 100000 and generation_time = 1 in compile_model()), simulate 10Mb tree sequence using msprime() (recombination rate 1e-8) and then overlay neutral mutations on it at mutation_rate = 1e-8), compute AFS for 10 samples and return the AFS vector as result of this custom function.

Hint: If you’ve never programmed before, the concept of a “custom function” might be very alien to you. Again, if you need help, feel free to start building your afs.R solution based on this “template” (just fill in missing relevant bits of slendr code that you should be already familiar with):

library(slendr)
init_env()

simulate_afs <- function(Ne) {
  # In here you should write code which will:
  #   1. create one population with a given Ne (provided as a function argument)
  #   2. compile a model using `simulation_length =` and `generation_time =`
  #   3. simulate a tree sequence
  #   4. select names of 10 samples (doesn't matter which, "pop_1", "po2_", ...)
  #   5. compute AFS vector from those 10 individuals using `ts_afs()`
  
  # `result` is a variable with your 10-sample AFS vector (we remove the
  # first element because it's not meaningful for our example)
  return(result[-1]) 
}

afs_1 <- simulate_afs(Ne = 1000) # simulate AFS from a Ne = 1000 model...
plot(afs_1, type ="o")           # ... and plot it

Note: Remember that you should drop the first element of the AFS vector produced by ts_afs() (for instance with something like result[-1] if result contains the output of ts_afs()) technical reasons related to tskit. You don’t have to worry about that here, but you can read this for more detail.

Hint: If the above still doesn’t make any sense to you, feel free to copy-paste the function from the solution below into your script and work with that function instead!

When used in R, your custom function should work like this (the simulation is stochastic, so your numbers will be different, of course):

# This gives us a vector of singletons, doubletons, etc., etc., all the way
# to the number of fixed mutations in our sample of 10 individuals
simulate_afs(Ne = 1000)
 [1] 397 144 122 102  72  81  55  42  50  33  45  36  24  28  25   7  20  23  21
[20]  13

A function can be understood as a independent unit of a computer program which executes a block of code inside the {…} brackets given some values of some parameters. In our example, we programmed a function simulate_sfs() which accepts a single parameter, Ne.

simulate_afs <- function(Ne) {
  # create a slendr model with a single population of size Ne = N
  pop <- population("pop", N = Ne, time = 1)
  model <- compile_model(pop, generation_time = 1, simulation_length = 100000)

  # simulate a tree sequence
  ts <-
    msprime(model, sequence_length = 10e6, recombination_rate = 1e-8) %>%
    ts_mutate(mutation_rate = 1e-8)

  # get a random sample of names of 10 individuals
  samples <- ts_names(ts) %>% sample(10)

  # compute the AFS vector (dropping the 0-th element added by tskit)
  afs <- ts_afs(ts, sample_sets = list(samples))[-1]

  afs
}

Our functions is supposed to produce an AFS vector of counts of alleles observed at a given frequency a the population sample:

Let’s use our custom function to simulate AFS vector for Ne = 1k, 10k, and 30k:

afs_1k <- simulate_afs(1000)
afs_10k <- simulate_afs(10000)
afs_30k <- simulate_afs(30000)

Here’s one of those vectors. We can see that the function does, indeed, produce a result of the correct format:

afs_1k
 [1] 418 178 129  67  91  67  76  61  42  48  31  44  40  26  25  16  36  21  29
[20]  17

To see the results of this function in a clearer context, let’s visualize the vectors in the same plot:

plot(afs_30k, type = "o", main = "AFS, Ne = 30000", col = "cyan",)
lines(afs_10k, type = "o", main = "AFS, Ne = 10000", col = "purple")
lines(afs_1k, type = "o", main = "AFS, Ne = 1000", col = "blue")
legend("topright", legend = c("Ne = 1k", "Ne = 10k", "Ne = 30k"),
       fill = c("blue", "purple", "cyan"))

Part 2: Estimating unknown \(N_e\) from empirical AFS

Imagine you sequenced 10 samples from a population and computed the following AFS vector (which contains, sequentially, the number of singletons, doubletons, etc., in your sample from a population):

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
                  264, 218,  133, 173, 159, 142, 167, 129, 125, 143)

You know (maybe from some fossil evidence) that the population probably had a constant \(N_e\) somewhere between 1000 and 30000 for the past 100,000 generations, and had mutation and recombination rates of 1e-8 (i.e., parameters already implemented by your simulate_afs() function – how convenient!).

Use slendr simulations to guess the true (and hidden!) \(N_e\) given the observed AFS by running simulations for a range of \(N_e\) values and finding out which \(N_e\) produces the closest AFS vector to the afs_observed vector above using one of the following two approaches.

  • Option 1 [easy]: Plot AFS vectors for various \(N_e\) values (i.e. simulate several of them using your function simulate_afs()), then eyeball which looks closest to the observed AFS based on the figures alone. (This is, of course, not how proper statistical inference is done, but it will be good enough for this exercie!)

  • Option 2 [hard]: Simulate AFS vectors in steps of possible Ne (maybe lapply()?), and find the \(N_e\) which gives the closest AFS to the observed AFS based on Mean squared error.

This is the observed AFS with which we want to compare our simulated AFS vectors:

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
                  264, 218,  133, 173, 159, 142, 167, 129, 125, 143)

We know that the true \(N_e\) is supposed to be between 1000 and 30000, so let’s simulate a bunch of AFS vectors for different \(N_e\) values using our new AFS simulation function:

afs_Ne1k <- simulate_afs(Ne = 1000)
afs_Ne5k <- simulate_afs(Ne = 5000)
afs_Ne6k <- simulate_afs(Ne = 6000)
afs_Ne10k <- simulate_afs(Ne = 10000)
afs_Ne20k <- simulate_afs(Ne = 20000)
afs_Ne30k <- simulate_afs(Ne = 30000)

Now let’s plot our simulated AFS vectors together with the observed AFS (highlighting it in black):

plot(afs_observed, type = "b", col = "black", lwd = 3,
     xlab = "allele count bin", ylab = "count", ylim = c(0, 13000))
lines(afs_Ne1k, lwd = 2, col = "blue")
lines(afs_Ne5k, lwd = 2, col = "green")
lines(afs_Ne6k, lwd = 2, col = "pink")
lines(afs_Ne10k, lwd = 2, col = "purple")
lines(afs_Ne20k, lwd = 2, col = "orange")
lines(afs_Ne30k, lwd = 2, col = "cyan")
legend("topright",
       legend = c("observed AFS", "Ne = 1000", "Ne = 5000",
                  "Ne = 6000", "Ne = 10000", "Ne = 20000", "Ne = 30000"),
       fill = c("black", "blue", "green", "pink", "purple", "orange", "cyan"))

The true \(N_e\) was 6543!

This is the observed AFS with which we want to compare our simulated AFS vectors:

afs_observed <- c(2520, 1449, 855, 622, 530, 446, 365, 334, 349, 244,
                  264, 218,  133, 173, 159, 142, 167, 129, 125, 143)

We know that the true \(N_e\) is supposed to be between 1000 and 30000. Let’s generate regularly spaced values of potential Ne values whose AFS we want to investigate and compare to the obesrved AFS (our parameter grid):

Ne_grid <- seq(from = 1000, to = 30000, by = 500)
Ne_grid
 [1]  1000  1500  2000  2500  3000  3500  4000  4500  5000  5500  6000  6500
[13]  7000  7500  8000  8500  9000  9500 10000 10500 11000 11500 12000 12500
[25] 13000 13500 14000 14500 15000 15500 16000 16500 17000 17500 18000 18500
[37] 19000 19500 20000 20500 21000 21500 22000 22500 23000 23500 24000 24500
[49] 25000 25500 26000 26500 27000 27500 28000 28500 29000 29500 30000

With the parameter grid Ne_grid set up, let’s simulate an AFS from each \(N_e\) model:

library(parallel)

afs_grid <- mclapply(Ne_grid, simulate_afs, mc.cores = detectCores())
names(afs_grid) <- Ne_grid

# show the first five simulated AFS vectors, for brevity, just to demonstrate
# what the output of the grid simulations is supposed to look like
afs_grid[1:5]
$`1000`
 [1] 429 197 114  72  78  78  84  58  40  37  32  44  30  27  32   8  12  19  31
[20]  32

$`1500`
 [1] 554 261 273 150 115  83  83  78  71  62  62  75  30  42  45  57  21  22  22
[20]  35

$`2000`
 [1] 828 447 291 192 164 145 113  99  97  72  55  59  59  50  55  56  48  49  46
[20]  17

$`2500`
 [1] 938 516 315 230 251 167 137 118 104 120  98 100 106  66  82  56  56  74  50
[20]  47

$`3000`
 [1] 1184  486  443  300  234  228  186  146  161  119  124  109  131  111   80
[16]   68   73   47   57   42

Plot the observed AFS and overlay the simulated AFS vectors on top of it:

plot(afs_observed, type = "b", col = "black", lwd = 3, xlab = "allele count bin", ylab = "count")
for (i in seq_along(Ne_grid)) {
  lines(afs_grid[[i]], lwd = 0.5)
}
legend("topright", legend = c("observed AFS", "simulated AFS"), fill = c("black", "gray"))

Compute mean-squared error of the AFS produced by each \(N_e\) value across the grid:

errors <- sapply(afs_grid, function(sim_afs) {
  sum((sim_afs - afs_observed)^2) / length(sim_afs)
})

plot(Ne_grid, errors, ylab = "error")
abline(v = Ne_grid[which.min(errors)], col = "red")
legend("topright", legend = paste("minimum error Ne =", Ne_grid[which.min(errors)]), fill = "red")

Plot the AFS again, but this time highlight the most likely spectrum (i.e. the one which gave the lowest RMSE value):

plot(afs_observed, type = "b", col = "black", lwd = 3, xlab = "allele count bin", ylab = "count")
for (i in seq_along(Ne_grid)) {
  color <- if (i == which.min(errors)) "red" else "gray"
  width <- if (i == which.min(errors)) 2 else 0.75
  lines(afs_grid[[i]], lwd = width, col = color)
}
legend("topright", legend = c("observed AFS", paste("best fitting Ne =", Ne_grid[which.min(errors)])),
       fill = c("black", "red"))

The true \(N_e\) was 6543!

Congratulations, you now know how to infer parameters of evolutionary models using simulations! What you just did is really very similar to how simulation-based inference is done in practice (even with methods such as ABC). Hopefully you now also see how easy slendr makes it to do this (normally a rather laborious) process.

This kind of approach can be used to infer all sorts of demographic parameters, even using other summary statistics that you’ve also learned to compute… including selection parameters, which we delve into in another exercise.