Written Evidence Submitted by Dr Becky Arnold
Who am I?
I am a postdoctoral researcher at the University of Keele studying astrophysics. My main area of focus is on understanding how star clusters form and evolve. I have also done a large amount of work investigating and advocating for methods to improve the reproducibility of research that relies significantly on computers to analyse data and produce results, which nowadays is a very large fraction of research.
My experience on this topic:
● Lead collaborator on the Alan Turing Institute project The Turing Way: a how-to guide for reproducible research
● Software Sustainability Institute Fellow advocating for the preservation of the software used to conduct research in order to improve reusability, consistency, and reproducibility
● Member of the Sheffield Reproducibility Network whilst I worked at Sheffield University
● Invited speaker on the Sustainable Software Practice for Reproducible Research panel at the Collaborations Workshop Conference 2019
● Given numerous talks on research reproducibility, for example to the Data Science Network at Keele University
In this evidence I will outline common areas of weakness in research reproducibility, and propose strategies to support their improvement.
Data distribution and curation
A key barrier preventing the widespread reproduction of scientific results is that it is often difficult, or even impossible, to obtain the original data that those scientific results are based upon. Researchers commonly store and analyse their data on private or institutionally-owned computers, rendering it inaccessible to the wider community. Requests for the data may go unanswered for a variety of reasons, such as time constraints on the original researcher, or their departure from the institution the research was conducted at.
Without access to the raw data a research finding was based upon reproducing it is often a non-starter. By incentivising researchers to upload their datasets to publicly available repositories this barrier can be mitigated. Such incentivisations could take the form of making this a requirement to receive government funding, and by including dataset publishing in guidelines for research. The possibility of funding subject-specific repositories for researchers to upload their data to could also be explored. Currently available repositories, such as Zenodo, are somewhat limited.
It is not always feasible to share raw data, for example where the data is private (e.g. medical data) or commercially sensitive. However, it may be possible to share data in some obscured form, for example by anonymising it. It is important to emphasise the value of even obscured data in enabling at least partial reproducibility.
An additional barrier to research reproducibility is that even when data is publicly available it is often so poorly curated as to be unusable. Examples of such poor curation can include:
● No information on what kind of data is held in each row/column
● No information on what units the data is in
● In datasets with multiple files, no information on what data is held in what file
● No license (e.g. a creative commons license) giving others the legal right to make use of the dataset
Some level of guidance or formal education (for example embedded within PhD programs) on basic data curation would be extremely helpful for ensuring that when data is shared its usefulness is maximised.
● Funding/construction of repositories where researchers can make their data publicly available.
● Incentivisation of researchers to make their data publicly available where possible.
● Guidance and training on data curation to maximise the usefulness of publicly-available datasets.
Conventional wisdom states that one of the core purposes of academic papers is to describe how a piece of research was performed in sufficient detail that it could be reproduced, and its findings verified. However, the traditional format of these research outputs, which covers maybe up to a few tens of pages of A4, is increasingly insufficient for this task. There is simply no way for such a format to describe months or years of work of even one researcher, (let alone that of large collaborations, which are increasingly common), in sufficient detail that the research could be meaningfully reproduced.
An outline can be given in such a format, but the “meat” of scientific analysis in many fields is often carried out by computers via code or other software written by the original researchers, and existing only on their personal or institutional computers. Without access to those files, which perform the analysis and output the published results, reproducing the research is often difficult or impossible.
For the reasons outlined above, traditional academic papers are simply no longer adequate to the task of ensuring reproducibility in the modern era of large collaborations and computationally-intensive analysis. In the long term it is necessary to completely reconsider how this task is approached, but in the short and midterm improving access to the files used to perform the analysis and produce scientific results is imperative.
Strategies to improve access to such files are virtually identical to strategies to improve data accessibility, which are discussed in the previous section. Researchers must be incentivised, and guidance provided on how to properly curate what is shared; analysis files without any documentation or instruction on how to run them are often difficult or impossible to decipher. There are already publicly-available services, such as GitHub, which are widely used and suitable for the task of sharing code and other analysis files. Further, such files are often much smaller than scientific datasets making their storage and distribution less challenging. As such funding/creating publicly available repositories for analysis files is not a high priority.
As previously mentioned the paradigm of the distribution of scientific findings via academic papers will likely need to be, at the very least, reconsidered and evolved for the modern research environment. One avenue of conducting and distributing research that has been gaining increasing traction is interactive notebooks. A simple example of such a notebook is shown in Fig. 1. As can be seen from this figure, such notebooks can contain both regular text and analysis code, enabling researchers to explain what is being done alongside the analysis itself. If such a notebook is made publicly available then anyone with an internet connection can run the notebook for themselves and reproduce the research findings of the original researchers. See Fig. 2 which shows the notebook presented in Fig.1 after it has been run. This exemplifies how the widespread adoption and distribution of interactive notebooks in the long term would represent a vast step forwards for research reproducibility.
Fig. 1: Example of an interactive notebook performing a simple analysis. Such a notebook can be made publicly available, so anyone can run it.
Fig. 2: The example interactive notebook from Fig.1 after it has been run.
● Incentivisation of researchers to make their analysis files publicly available where possible.
● Guidance and training on how to curate their analysis files to make it clear how they are to be used to reproduce published research results.
● Revaluation of academic papers as the dominant format to distribute research findings, perhaps in favour of interactive notebooks.
Note: The term computational environment means the operating system, software installed, the versions of that software, and the files present on a computer.
Even with access to all the raw data and code used to analyse it, a researcher trying to reproduce another’s work may face significant difficulties in running that analysis to reproduce the original results. This is largely due to differences between their computational environment and that the original research was performed in. As a simple example, the reproducing researcher may not be able to perform the analysis the original researcher did because they do not have a key piece of software installed that is necessary to run it.
Failure to document the versions of software used to perform research represents a more dangerous example of how differences in computational environment can inhibit the reproducibility of research. As software is updated bugs are fixed (and sometimes unintentionally created), functionalities added/removed and so on. As such, even if a researcher has the necessary software to reproduce an analysis, if the version of the software they have differs from the original environment they may well get a different output.
In order to maximise the reproducibility of research the computational environment it was conducted in should be, at minimum, documented. Other avenues of ensuring the reproducibility of computational environments include the distribution of virtual machines, containers, and interactive notebooks. The last of these was discussed in the previous section, and the example notebook shown in Figs. 1 and 2 demonstrates how they can be used to preserve the computational environment that a piece of research is conducted in. In this case this is done by specifying that version 1.1.4 of the pandas software package is used to conduct the analysis.
● Guidance and training on the importance of capturing the computational environment research is conducted in.
● Guidance and training on how to preserve and distribute computational environments.
● Incentivisation towards at least minimum standards of computational environment curation, so they can be reproduced.