Simulation-Based Population Genomics and Inference in R

A course on the fundamentals of population genomics and statistical inference using R packages slendr and demografr

Author

Martin Petr

Published

August 1, 2025

Preface

This online workbook contains materials for a work-in-progress course on the fundamentals of population genomics and statistical inference, with a strong focus on good practices of reproducible research. The general structure of the final product (to be developed over the course of 2025-2026) will involve a large set of inter-linked tutorials together with worksheets containing practical exercises (and solutions).

The git repository containing the sources of all materials for the entire course are available on GitHub at https://github.com/bodkan/simgen. Ultimately, the workbook will provide resources which will become the basis of a 1-2 weeks long course on population genomics and statistical inference using computer simulations, with a particular focus on R packages slendr and demografr. While introducing the fundamentals of population genomics, it will also aim to frame the material with a strong focus on the most important tools for facilitating reproducible research (such as git and renv), as well as demonstrate the most useful applications of a selection of R packages from the tidyverse data science toolkit and other R packages useful for computational genomics.

The intended audience are novice researchers who have just started (or are about to start) their careers in population genomics and evolutionary genomics, primarily senior master students or doctoral students in the early parts of their PhD journey. That said, the more advanced latter parts of the book focusing on simulation-based inference of demography and selection will be beneficial even to more seasoned researchers, who are looking for more efficient means to fit models using newly developed inference tools.


The work-in-progress rendering of the book is available at https://bodkan.github.io/simgen.


Currently planned outline

A draft of some of the planned content is available in the menu on the left (most of them morphed from various workshops and practical tutorial sessions). However, there are still many parts missing. Here’s an overview of some of the things the final course will include:

  • R
    • Introduction to R
      • Basic data types, vectors, list, data frames
      • Plotting with built-in base R functions
    • Reproducible computing in R
      • What makes a good project structure
      • Creating self-contained R command-line scripts
      • Using renv and venv for reproducible projects
    • Version of control with git and GitHub
    • Basics of data science with tidyverse
      • tibble, dplyr, tidyr, ggplot2
    • Most useful R packages for computational genomics
      • GenomicRanges and friends
  • slendr
    • Introduction to the slendr R package
    • Building traditional demographic models with slendr
    • Simulating genomic data
      • What is a tree sequence?
      • VCF files, EIGENSTRAT fileformat
  • Fundamentals of population genetics with slendr
    • Computing tree sequence summary statistics
    • diversity, divergence, AFS
    • \(f\)-statistics, \(f_4\)-ratio statistics
    • \(F_{st}\)
    • PCA
    • Identity-by-descent (IBD)
    • Ancestry tracts / chromosome painting
    • Admixture dating
  • Natural selection with slendr
    • Natural selection theory
    • Simple one-locus simulation
    • Useful selection summary statistics
    • More complex epistatic selection
  • Simulation-based inference with demografr
    • Toy grid-based inference of \(N_e\) with AFS
    • Grid-based inference with demografr (\(f_4\) and \(f_4\)-ratio)
    • Grid-based admixture tract dating
    • Approximate Bayesian Computation (ABC)
    • Inference of selection using simulations
  • Introducing the workhorses of applied population genetics
    • MDS / PCA
    • ADMIXTOOLS - \(f\)-statistics, qpAdm
    • ADMIXTURE / STRUCTURE
    • IBD
  • Spatio-temporal demographic models
    • Spatial R packages
    • Simulations of spatio-temporal population genetic data
    • Visualisation of IBD networks in space

All content is available under the CC BY-SA 4.0 license.

Creative Commons License 4