Simulation-Based Population Genomics

A course on fundamentals of population genomics and statistical inference with R

Author

Martin Petr

Published

September 5, 2025

Introduction

This workbook contains materials for a work-in-progress course on the fundamentals of population genomics and statistical inference in R, with a strong focus on good practices of reproducible research. The general structure of the final product (to be developed over the course of 2025-2026) will be a set of interlinked tutorials and exercises, which the students will be able to walk through during an interactive course in the classroom and use them as a reference in their own projects later.

The git repository containing the sources of all materials for the entire course are available on GitHub at https://github.com/bodkan/simgen.

The intended audience are novice researchers who have just started (or are about to start) their careers in population genomics and evolutionary genomics, primarily senior master students or doctoral students in the early parts of their PhD journey. That said, the more advanced latter parts of the book focusing on simulation-based inference of demography and selection will be beneficial even to more seasoned researchers, who are looking for more efficient means to fit models using novel inference tools in the R ecosystem.


The work-in-progress rendering of the book is available at https://bodkan.github.io/simgen.


Currently planned outline

A draft of a subset of the planned content is available in the menu on the left. However, there are still many parts missing, even in the chapters already present. That said, here’s an overview of some of the things the final course will include:

  • R
    • Introduction to R
      • Basic data types and container types
      • Data frames, functions, iteration
      • Absolute minimum on manipulation and plotting data with base R
    • Reproducible computing in R
      • What makes a good project structure
      • What is algorithmic thinking?
      • Creating self-contained R command-line scripts
      • Reproducible reports and presentations with Quarto
    • Data science with tidyverse
      • Filtering, subsetting, modifying, and manipulating tabular data
      • Data visualization with ggplot2
      • Basics of spatial data science using sf
    • Computational genomics with R
      • GenomicRanges and friends
  • slendr
    • Introduction to the slendr R package
    • Building traditional demographic models with slendr
    • Simulating genomic data
      • What is a tree sequence?
      • VCF files, EIGENSTRAT file format, genotype tables
  • Fundamentals of population genetics with slendr
    • Computing tree sequence summary statistics
    • diversity, divergence, AFS
    • \(f\)-statistics, \(f_4\)-ratio statistics
    • \(F_{st}\)
    • PCA
    • Identity-by-descent (IBD)
    • Ancestry tracts / chromosome painting
    • Admixture dating
  • Natural selection with slendr
    • Natural selection theory
    • Simple one-locus simulation
    • Useful selection summary statistics
    • More complex epistatic selection
  • Simulation-based inference with demografr
    • Toy grid-based inference of \(N_e\) with AFS
    • Grid-based inference with demografr (\(f_4\) and \(f_4\)-ratio)
    • Grid-based admixture tract dating
    • Approximate Bayesian Computation (ABC)
  • Introducing the workhorses of applied population genetics
    • MDS / PCA
    • ADMIXTOOLS - \(f\)-statistics, qpAdm
    • ADMIXTURE / STRUCTURE
    • IBD
    • Selection scans

All content is available under the CC BY-SA 4.0 license.

Creative Commons License 4