Computational social scientists rely on code as the main instrument for doing our work. Code is a powerful tool, but even the most seasoned coders can make mistakes in ways that might call the results of their analysis into question.
Having a good process for code review is critical to ensuring that what we report is correct. In this post, we’ll discuss a key piece of our code review process at Pew Research Center: reproducibility.
When we talk about reproducibility, we often think of it as a framework for improving research collaboration and the broader research process. It gives others the ability to replicate our results using the same data and methods, and to reuse and build on that work.
But reproducibility can also help facilitate collaboration among researchers on the same team, which is what we’ll focus on here. Our broad philosophy for facilitating collaboration through reproducibility is twofold:
- We want to make it as easy as possible for our researchers to check each other’s code, even if someone hasn’t worked on that project before. Reviewers should be able to easily build the computational environment that the original researchers used for their analysis – the core tenet of computational reproducibility. Being able to work in a replicate environment also simplifies things like figuring out how the project is organized and documented.
- We want to provide as much flexibility as possible to individual researchers and minimize the amount of engineering resources needed to support each project. Our team uses R and Python; stores data in files and databases; and works within shared environments and on personal computers. We want to avoid overly specific technical recommendations that may not travel well from project to project.
Our reproducibility toolkit
The core piece of our internal reproducibility process is a consistent, shared project structure for all our team’s active projects. This structure lets researchers move between projects as contributors or reviewers because they’re already familiar with where things live and can easily find all the information they need.
Our specific tool is a customized cookiecutter
template that includes a standardized directory structure and guidelines for how to structure code and store data for every project.
One of our template’s guidelines is that we prefer code to be structured as batch scripts rather than interactive notebooks. To ensure that the order in which scripts are executed is as transparent as possible, we use Snakemake
, which helps users figure out what will happen when they run a piece of code. It can identify which inputs the code will consume and which outputs, like figures or other datasets, it will produce.
Even a well-documented set of scripts will fail if those scripts are run in a new environment that does not include the same versions of the same packages that the original researchers used. To address that problem, we adopted conda
, a package and environment management system that helps ensure each project has a way of declaring the computational environment used to produce its results. We chose conda
because it can manage R and Python installations and handle the integrity of the environment down to system-level libraries. For R-only projects, we sometimes use renv
to either replace or complement conda
.
We often want to ensure our code can run on any machine, but we think it’s useful to have a shared platform that everyone can access. Our data science platform, a JupyterHub instance, is a shared environment that gives researchers flexible computational resources and which we use as the reference architecture on which projects need to be replicable.
Finally, we standardize the locations where our project assets – raw data, as well as the code to work with it – are stored. We do this so they are accessible to all team members. We use GitHub for our code, and we typically store data in our AWS infrastructure.
We classify data assets into two categories: those that need to be preserved for the future because they cannot be reproduced, such as original raw data we collected; and intermediate datasets that eventually can be discarded. This separation means that at the end of a project, all relevant code will live in GitHub with references to a fixed, shared, “read-only” location in a private AWS S3 bucket or database. This bucket contains all the data that we cannot or should not reproduce.
Reproducibility as a workflow problem
With the workflow described above, anyone on our team can easily rerun any project on their own – from accessing the canonical version of the underlying data to recreating the computational environment in which the analysis was run.
The downside is that this adds work for the original researchers. They now need to worry not only about making sure that things work for them, but also about ensuring that things will work for others. Put simply: Following these reproducibility policies takes time and attention from individual researchers in the short term, in exchange for long-term benefits to the team as a whole.
In this way, reproducibility is not all that different from writing documentation. As end users, we all welcome code that has been nicely documented because it makes our lives easier. But in the race to complete a project, we often feel we could be doing something more productive with our time. When developing this workflow, we wanted to balance the end goal of reproducibility with the demands of the process itself.
We’ve also learned that reproducibility tools are not simply a neutral layer on top of code. These tools force researchers to structure their workflow in a particular way. For instance, researchers can find Snakemake
challenging if they’re accustomed to keeping all of their data in memory throughout a single session. Similarly, researchers working in any virtual environment are expected to keep their environment in sync by deleting packages that are no longer needed or by documenting all new dependencies. That’s hard to do, especially during project phases when researchers are trying out many different packages in rapid succession.
Ultimately, improving reproducibility means adding new tools and changing the workflows that researchers are used to – and who likes that? Even beyond the tools themselves, this effort requires a determined, ongoing culture of reproducibility to help overcome these obvious costs.
At every level of our team, we try to promote a mindset that views research as a process. In this mindset, how a repository is structured and documented is just as important to doing good research as the results that appear in the final publication.