Computational Population Genetics and Statistical Inference in R
Learning the fundamentals of population genetics through computer simulations
Introduction
This workbook contains materials for a work-in-progress course on the fundamentals of data science, computational population genomics and statistical inference in R, with a strong focus on good practices of reproducible research.
The general structure of the final product (to be developed over the course of 2025-2026) will be a set of interlinked worksheets with exercises. Students will be able to walk through those during practical sessions in the classroom or for self-study, but the materials are also designed to serve as practical reference during research work.
The git repository containing the sources of all materials for the entire course are available on GitHub at https://github.com/bodkan/simgen.
The intended audience are novice researchers who have just started (or are about to start) their careers in population genomics and evolutionary genomics, primarily master students or doctoral students in the early parts of their PhD journey. However, the materials have also been used in successful training workshops for researchers who have an extensive experience with population genetics, but are not up-to-date on the latest practices in modern reproducible data science, or some of the novel inference tools in the R ecosystem.
Planned outline
Status of chapters:
- ✅: Chapter is finished and a part of an existing course (link available on the left)
- 🚧: Section under construction (a link to a draft maybe available on the left)
- ❌: Completely missing materials
- ✅ Introduction to R
- ✅ Basic data types and container types
- ✅ Data frames, functions, iteration
- ✅ Minimum background on data manipulation using base R
- ✅ Minimum background on plotting data with base R
- 🚧 Reproducible computing in R
- ✅ What makes a good project structure
- ✅ Transforming disorganized scripts into pipelines
- ✅ Reproducible reports and presentations with Quarto
- 🚧 Building command-line interfaces for scripts
- ✅ tidyverse
- Filtering, subsetting, modifying, and manipulation of data
- Group-based operations and summary statistics
- Reading and saving data
- Data visualization using ggplot2
- 🚧 Working with spatial data
- 🚧 sf spatial data format
- 🚧 Plotting spatial data with sf and ggplot2
- ✅ Simulations with slendr
- ✅ Introduction to the slendr R package
- ✅ Building demographic models with slendr
- ✅ Simulating genomic data (tree sequences)
- 🚧 Fundamentals of population genetics
- ✅ Computing tree sequence summary statistics
- ✅ diversity, divergence, AFS
- ✅ \(f\)-statistics, \(f_4\)-ratio statistics
- ✅ \(F_{st}\), etc.
- 🚧 PCA
- 🚧 Identity-by-descent (IBD)
- 🚧 Ancestry tracts / chromosome painting
- 🚧 Admixture dating
- ✅ Computing tree sequence summary statistics
- 🚧 Natural selection with slendr
- ❌ Natural selection theory
- 🚧 Simple one-locus simulation
- ❌ Selection tests and summary statistics
- 🚧 Simulation-based inference with slendr and demografr
- ✅ Toy grid-based inference of \(N_e\) with AFS
- 🚧 Grid-based inference (\(f_4\) and \(f_4\)-ratio) — sourced from demografr
- 🚧 Grid-based admixture tract dating — based on a slendr tutorial
- 🚧 Approximate Bayesian Computation (ABC) — sourced from demografr
- 🚧Workhorses of applied population genetics
- 🚧 MDS / PCA — extending the PCA chapters part 1 and part 2 to real-world scenarios
- 🚧 ADMIXTOOLS - \(f\)-statistics, qpAdm — extending the \(f\)-statistics chapter to other real-world scenarios
- ❌ ADMIXTURE / STRUCTURE — focusing on the non-identifiability problem
- ❌ IBD
- ❌ Selection scans and summary statistics
All content is available under the CC BY-SA 4.0 license.
