Structuring (R) projects

Introduction

You’ve done a huge amount of programming already. Processing data, filtering data, annotating data, summarizing data, but also plotting data using the most powerful tool for scientific visualization ggplot2! I said it before but it’s worth repeating again—you have now learned 70% of all the tools you might need to do data science on a day-to-day basis using R. This is not an exaggeration.

Sure, you might not remember everything, but remembering comes from repetition and practice. You’ve been exposed to the most important concepts, and you know where to come back for more information when needed later.


In this session on reproducibility, we will take things a bit easier than the earlier, programming-heavy sessions earlier.

Instead of focusing on specific programming techniques and R functions, we’ll go through guidelines on how to organize (and run) computational projects on a practical basis—taking what you’ve learned, and learning how to do it in practice.

In this session your tasks will be to take bits of code from previous exercises on IBD and other data (currently scattered over multiple separate scripts and files, or maybe even nowhere!) and learn how to organize them in a structure suitable for maximizing reproducibility and effectiveness of doing computational research.

Exercise 1: Creating a formal R project

So far, you’ve worked in an “anonymous” RStudio session, with scripts saved somewhere on your filesystem, disconnected from the data, or with ad hoc interactions with an R console (i.e., with commands not saved anywhere). This is not how reproducible research should be done to be sustainable over weeks or months of working on a project.

To move forward, set up a formal R project and proper project structure by following the following steps:

  1. Click File -> New Project...

  2. In the new window, click on New Directory -> New Project

  3. Under "Directory name" type in "simgen-project" (the name of our course, and the directory where all project files will be stored). Pick where you would like to save that directory (this doesn’t matter, just put the project somewhere you can later find it). Check the "Open in new session" box, leave the other options related to “git” and “renv” unchecked.

  4. Then click on the “Create Project” button.

This will create a new RStudio window. Your original RStudio window (where you worked so far) is still open. The task for the following exercises will be to convert the (probably disorganized) bits of code into a “proper reproducible R project”.

Notice the new simgen-project.Rproj file that is currently the only file in your project directory! We will come back to it soon.

Exercise 2: Seting up a project file structure

What makes for a good computational project structure? There are endless possibilities but a good guidelines are:

  1. Be consistent – put files that “belong together” in the same place (i.e., in the same directory). For example, an Excel table shouldn’t be saved in a directory with scripts, a PDF with a plot shouldn’t be in a directory with tables.

  2. Be predictable – even a person unfamiliar with your project should be able to guess where is what just by looking at your folder. Remember, when you publish your paper or your thesis, you will have to provide all your data and scripts as supplementary materials, so others should be able to understand all of them!

  3. Document everything – write a lot of # comments in code, describing what different pieces of code do, and why you wrote things in a certain way. Your future self will thank you!


Now let’s set up an example computational project structure for our IBD data our ancient DNA metadata (we’ll leave the \(f_4\)-ratio Neanderthal estimates for next chapter), just like you would do this for a real study.

Note: This is all just an example! Again, as long as you follow the guidelines numbered above, everything works!

In the Files pane of RStudio, click on "New Folder" and create the following directories in the “root of your project directory simgen-project/”:

  1. code/
  2. data/, and within it create directories raw/ and processed/
  3. figures/
  4. results/
  5. reports/

Note: You can do all this manually, or you can play around with doing it using code with the incredibly useful function ?dir.create! For instance, a single command to do all of the above could be the following. (If you’re curious about the recursive = TRUE, look at help of ?dir.create.)

dir.create("code/")
dir.create("data/raw", recursive = TRUE)
dir.create("data/processed", recursive = TRUE)
dir.create("figures/")
dir.create("results/")
dir.create("reports/")

When you hit the "Refresh file listing" circular arrow button in the top-right of the "Files" pane, you should now see all the directories you just created. Make sure you see them, and not just the simgen-project.Rproj file like before!

Exercise 3: Building an example reproducible pipeline

Let’s get something out of the way first. Of course, if there are a million possible ways how to properly structure a computational project, there are infinite ways how a “reproducible research pipeline” should be actually organized.

Obviously, all research projects are different, they focus on different kinds of data (archaeological data, ancient DNA sequences, linguistic data, field observation data, etc.), so naturally they will require different code which will have to be structured in different ways.

Still, there are some common workflows which practically every single computational scientific research study does:

  1. Data gathering – in our IBD and metadata examples, this involved downloading data from the internet.

  2. Data processing – in our case, this involved filtering the data, processing it to bin individuals based on their ages, joining IBD data with metadata, etc.

  3. Data analysis – this involves computing summary statistics, creating figures, etc.

The dirty secret of many scientific research studies (especially in the “old days”) is that all of these steps are often clumped together in huge scripts, and its very difficult (even for the author) to sometimes tell where is what. This can be a big problem, especially if a bug needs to be fixed, a new step of a processing pipeline needs to be added, etc.

It gets even worse, when after a long time you come back to a project and need to remember some details about where in your code is that one line that does something that needs changing!

Let’s demonstrate how you could organize your code in practice, and hopefully you’ll see how investing a bit more time and thinking into properly organizing code in your research project makes your life a lot easier in the long run (and much easier for everyone who might pick up your project later too).


1. Data gathering

Create a standalone R script (File -> New File -> R Script). Paste the following code to that script, and then save it as 01_download_data.R in the root of your simgen-project project directory.

Don’t copy it without skimming through it and understanding it! You should recognize these commands from our earlier exercises on tidyverse? Ideally, you’ve written those commands before yourself. That example code was a little messy and random, because it was structured as an ad hoc tutorial. What we’re doing here is showing how to organize computational research properly.

You can find the complete script here.


When you have your script 01_download_data.R, run it by calling source("01_download_data.R") in your R console and observe what happens. You can also press Cmd/Ctrl + Shift + S — a very useful shortcut!

Then, click through your "Files" panel and look in data/raw — do you see files that your new script should create?

Note: Notice the cat("some kind of message\n") command in the script. This is extremely useful for printing log information about a (potentially) log running process. If you’re confused about what it does, write this into your R console: cat("Hello friend!\n").


Creating 01_download_data.R is a first step towards reproducibility. Downloading of all data set now happens in a dedicated script, which means:

  1. We only have to run it once, and have all data available in data/raw for later use.
  2. If we have to include a new data set to be included, we just edit that script 01_download_data.R and run it again!
  3. Except for running the script top to bottom, we don’t do any “manual work”.

This doesn’t sound like much, but it’s absolutely crucial. Automation is the most important aspect of reproducible research. Unless something isn’t fully automated, it’s not reproducible.

Now let’s move to the next step!


2. Data processing

Create a standalone R script (File -> New File -> R Script). Add the following code to that script, and then save it as 02_process_data.R in the root of your simgen-project project directory.

You can find the complete script here.

Again, please don’t just blindly copy it! Look at the script and try to recognize the commands from our earlier exercises on tidyverse. You ran all of this yourself, manually, command-by-command, in the R console.

When you have your script, close RStudio.

Let’s pretend that some time passed, you’re done for the day and went home.

Find the location of your simgen-project directory on your computer, go to that directory, and then double-click on the R project file simgen-project.Rproj. You’ll get your old session back, just like you left it (including history!


Now press Cmd/Ctrl + Shift + S to run the processing script, and observe what happens! Then click through your "Files" panel and look in data/processed — do you see your files, process and tidy, ready for analysis?


3. Data analysis

Hopefully you’re now starting to get an idea about what a well-organized, reproducible pipeline means. It’s about structuring the directory where your project lives, and about partitioning your code into scripts which represent logical components of your project — from downloading data (and saving it to a particular storage location), and processing it (and again saving it to a proper storage location), and, finally, to answering research questions based on this data.

Your project structure should now look something like this. Notice that there are already data files present, not just directories:

├── 01_download_data.R           # data fetching code
├── 02_process_data.R            # data processing code
├── code                         # directory for future custom functions
├── data
│   ├── processed                # processed / filter / cleaned data
│   │   ├── ibd_segments.tsv
│   │   ├── ibd_sum.tsv
│   │   └── metadata.tsv
│   └── raw                      # raw unprocessed data
│       ├── ibd_segments.tsv
│       └── metadata.tsv
├── figures                      # location for PDF figures
├── results                      # various other output data (tables, cache files, etc.)
├── reports                      # location for slides and reports (next section!)
└── simgen-project.Rproj

You have implemented the first two steps of a data processing pipeline. Now it’s your job to implement additional components of the project pipeline, again using a couple of examples of what we’ve already done before, but this time properly integrating it into a nice and tidy package.


Generating figures

Create a standalone R script called 03_plot_metadata.R, which will:

  • Read the table in data/processed/metadata.tsv using read_tsv() function.

  • Create the following figures using ggplot2 package and the function ggsave() you’ve learned about earlier.

  1. Count of samples in each age_bin of metadata (use geom_bar()).

  2. A histogram of coverage across different age_bin categories, only for individuals with age > 0 (i.e., ancient individuals), with one facet showing one histogram from a given age_bin category (use geom_histogram() and facet_wrap()).

  • Save the figure(s) in the figures/ directory (whether you save individual PDF for each figure, or put multiple figures into one PDF is up to you).

Note: Remember that you can use the function ggsave() to save ggplot2 figures stored in normal R variables, or the pdf() & dev.off() trick!

Hint: If you need help, we solved all of these in previous exercises.

You can find my solution here.


Create a standalone R script called 04_plot_ibd_sharing.R, which will:

  • Read the table in data/processed/ibd_merged.tsv using read_tsv() function.

  • Filter the table to only length > 5 (segments longer than 5cM) and time_pair == "present-day:present-day" (only present-day humans).

  • Create the following figures using ggplot2 package and the function ggsave() you’ve learned about earlier, saving all of these figures into a single file figures/ibd_sharing.pdf:

    • For each pair of continents region_pair == "Europe:Europe", region_pair == "America:America", region_pair == "Asia:Asia", and region_pair == "Africa:Africa", plot a histogram of lengths of IBD longer than 5 cM.

    • Use facet_wrap(~ country_pair) to visualize each histogram for each country pair in its own facets.

    • Therefore, your PDF should have four plots (one plot per page), with each plot having a number of histograms, one histogram per facet.

You can find my solution here.


Create a standalone R script called 05_plot_ibd_relatedness.R, which will:

  • Read the table in data/processed/ibd_sum.tsv using read_tsv() function.

  • Recreate that beautiful figure we made, showing a scatterplot of total_ibd against n_ibd, and highlighting which individuals are duplicated, who is a 1st degree relative, 2nd degree relative, etc.

  • Save that figure to figures/ibd_relatedness.pdf.

You can find my solution here.

Exercise 4: Generating other results

Of course, the main output of our research is visualizations. But sometimes, a good old table is also very useful! Our papers often have both plots and tables.

Just to pracice a little bit more, create a standalone R script called 06_sumarize_coverage.R, which will:

  • Read the table in data/processed/metadata.tsv using read_tsv() function.

  • filter() the table to only age > 0 (only rows for ancient individuals) and continent == "Europe" (i.e., only aDNA European samples).

  • summarize() the rows to compute across grouping based on country and age_bin (remember group_by()?):

    • avg_coverage as a mean() of the coverage column,
    • n_samples as the n() number of rows.
  • The rows should be sorted in a descending order by the column n_samples.

  • Store the result of the filter / group_by / summarize %>% pipeline as df_summary, and save the table using write_tsv() in a file called results/coverage_summary.tsv.

You can find my solution here.

Exercise 5: Adding a README file

It’s a good idea for every project to have a README file at the root of the directory. Traditionally, this file is formatted in a Markdown format, which makes it easy to read in plain text, but is easily visualized on the web (many websites such as GitHub and other data repositories render Markdown text beautifully).

Look up the basic syntax of Markdown formatting, then create File -> New File -> Markdown File (not R Markdown file!). Then write a short list (again, take a look at the Markdown syntax and read how to create a list), which will describe, in one sentence:

  1. Which directories are in your project, and what they countain (briefly!)
  2. Which scripts you created and what they do.
  3. Which results are created (so far just figures), and which scripts created those results.

Hint: Try clicking the “Visual” button on top of your editing window. I personally don’t use it because I like to write Markdown in plain text, but it can be very helpful, because you can create lists and other formatting by clicking with the mouse, instead of having to type Markdown syntax manually.


Creating a README is an important practice of documenting computational projects. If anyone comes back to your project later (even you!) they will have a summary of what they project contains, what code it contains, and what that code does.

This is infinitely easier than having to read code and decipher what it does!

Always have a README file in your project root directory, especially when you publish it in a thesis or in a paper.

What we’re still missing (theoratically)

Here are a couple of topics which can be extremely helpful for pushing reproducibility even further, but that we don’t have bandwidth to go through, unfortunately. I’m mentioning them in case you’d like to do more studying on your own, perhaps as you get further into your own research projects.

I use all of these in my own work, but I consider them much more advanced topics. Our workshop focuses on the fundamentals of reproducible research. In this sense, you could do very much

  1. Backups. Unless you use git actively (which solves the backup problem as a side-effect—see below), you need to back up regularly. Even right-clicking your simgen-project directory, archiving it in a zip file named like "yyyy-mm-dd-simgen-project.zip" regularly will get you through any trouble.

  2. git – an incredibly powerful version control system, which is also at the back end of the online code repository service GitHub. This is the big one and probably the most important tool I’ve ever learned as a programmer. It would require an entire workshop on its own, unfortunately.

  3. renv – an R package which allows you to lock the exact versions of R packages used in your project (and have anyone else restore those exact versions of packages at a later point).

  4. targets – an R package for building and orchestrating truly complex computational pipelines. For instance, it can track whether or not a particular function or piece of data has changed and, therefore, whether other components of a pipeline downstream from it need to be re-run (unlike our 01_download_data.R or 02_process_data.R scripts which need to be re-run in their entirety after each modification).

Closing remarks

Our workshop focuses primarily on the R programming language and various R packages useful for data analysis and plotting. Of course, real-world computational projects often rely on other programs and scripts written in traditional shell (like BASH) or Python. However, the same ideas apply, regardless of the exact language. In fact, in the sessions of population genomics and inference, we will see how everything fits very naturally in the overall structure we’ve established here.