The European Bioinformatics Institute (EBI) is part of the European Molecular Biology Laboratory (EMBL), Europe’s only intergovernmental organisation for the life sciences. EMBL-EBI is located on the Wellcome Genome Campus in Hinxton (Cambridge), and employs over 800 full-time equivalent staff. EMBL-EBI serves the scientific community by providing freely available bioinformatics resources, promoting basic research, providing training to scientists at all levels, and disseminating cutting-edge technologies to the academic community and industry. EMBL-EBI currently hosts over 40 online data resources for different types of biomolecular data, a number of which are recognised as ELIXIR Core Data Resources. EMBL-EBI data resources cover the entire range of biological sciences, from raw DNA sequences to curated proteins, chemicals, structures, systems, pathways, ontologies and literature. EMBL-EBI resources are used 81 million times daily by users throughout the world. Many EMBL-EBI data resources are deposition databases, where researchers deposit their experimental data to share it with the scientific community. The amount of data deposited in EMBL-EBI resources has grown rapidly over the years: in 2020 the raw storage capacity at EMBL-EBI was around 390 petabytes.
All EMBL-EBI resources are open and free for all to use. Through its services, EMBL-EBI is a trusted source for some of the most used data in the world, including reference genomes for humans and other globally important species (such as the bread wheat genome, the pig reference genome, and the SARS-CoV-2 reference genomes). EMBL-EBI’s service mission as a public institution is to receive, curate and share life sciences data on behalf of the research community. EMBL-EBI does not simply store data: its focus on data reuse requires the implementation of specialist standards and metadata to meet community needs, as well as the provision of analysis software. Data is provided to users via web pages, but also programmatically for large-scale data analysis and to support the development of AI tools and third party services in both academia and industry. The investment in this portfolio of open services for over 25 years meant that the EMBL-EBI was positioned for rapid response to data sharing needs in the COVID-19 pandemic, producing the infrastructure for the https://www.covid19dataportal.org in a matter of weeks.
In this document, we address some of the points raised by the consultation and identify a number of initiatives that may have a positive impact on academia’s approach to reproducible research. We approach these issues from a life sciences perspective, a field in which there is a robust international research infrastructure.
The issues in academia that have led to the reproducibility crisis
There are two key issues that drive the reproducibility crisis: current research assessment practices and barriers to data sharing and reuse.
The hyper competitive nature of research assessment practices based on journal impact factors drives a culture that is not compatible with reproducibility. This puts researchers under pressure to produce results that are suitable for these types of publications and does not reward work that, for example, aims to replicate or validate previously published findings or publish negative results.
The sharing of open FAIR data is also a critical requirement for reproducible research. The availability of raw/primary data, analysis tools, and processes is an essential prerequisite for researchers to scrutinize results and reuse data in new scientific contexts. The reuse of data by the community gives rise not only to new areas of research, but is also a built-in quality assurance mechanism for existing data. This approach requires investment in open, community-driven infrastructure as part of the wider research ecosystem.
The process of improving scientific practices in research assessment and data sharing will be a labor intensive one. The work required to achieve a research culture that promotes reproducibility should be adequately supported and recognised.
What policies or schemes could have a positive impact on academia’s approach to reproducible research
Reproducibility issues could be addressed through two key approaches:
Some suggestions for specific policies that may drive these approaches are detailed below.
The San Francisco Declaration on Research Assessment (DORA) aims to improve the way in which research is evaluated and more specifically to remove the dependency on the journal impact factor in these processes.
In line with DORA (of which EMBL is a signatory), EMBL-EBI strongly supports the use of research outputs other than publications such as open data and open software. Further, it provides infrastructure for open, FAIR data sharing in the life sciences, which is a tool for effective DORA implementation for other research institutions around the world.
Recent new requirements from some funders to provide evidence of DORA implementation can help to drive change at leading research institutions. In 2021, EMBL set up a DORA working group to review current research assessment practices and develop best practices to help standardize, codify, and refine research assessment across all EMBL sites. Implementation of the practices agreed by the working group is expected to start by the end of 2021.
Open data is a fundamental tenet of reproducible research. Access to research data can enable a range of core scientific activities, including verification, discovery, and evidence synthesis. In fact, data availability is a critical feature of an efficient, progressive, and ultimately self-correcting scientific ecosystem that generates credible findings. Support for this idea has grown steadily over the past years (Ioannidis, 2014; Munafo et al, 2017), and the necessity of open science and open data for a functional scientific community is now largely recognised by researchers, funders, and publishers alike.
In order for open data to drive reproducibility, it needs to not only be publicly available but also Findable, Accessible, Interoperable and Reusable, or FAIR (Wilkinson et al, 2016). The open data resources at EMBL-EBI make the data they host FAIR, curating it, connecting it with other relevant datasets and relevant literature, as well as providing valuable added-value services and programmatic access.
For many years, the evolving open access publications policies of funders have driven significant behavioural change in the publication habits of life sciences researchers. Therefore, open data and open science policies have the potential to increase and drive open data sharing best practices in research communities, which in turn supports reproducibility. Policy development needs to be aspirational, but also must be supported by practical and achievable guidelines for implementation.
We have identified three areas of open data policy development that drive research reproducibility: submission to structured public data repositories, deposition of raw data, and use of community-driven data standards and guidelines.
● Submission to structured public data repositories.
Structured data repositories require rich scientific metadata that adhere to community-driven data standards. This adds rigour to the data submitted and makes the data more interoperable and reusable downstream. Public data repositories have a mission of public service and are often a part of the community they serve, making them responsive to changing scientific needs. A mission to share data as widely as possible means that programmatic access is encouraged and integration with related data resources and the research literature is all part of the service. Data integration is a quality assurance process in itself, while deep links between research articles and data are fundamental to reproducible research.
● Deposition of raw data
Accessing raw (or primary) data allows for a higher degree of scrutiny of scientific results than having access only to the processed end result (for example, a figure in a research article). Several EMBL-EBI data resources host raw data. Two examples are given below.
The European Nucleotide Archive (ENA) at EMBL-EBI provides a comprehensive record of the world’s nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation. Access to the raw data of many genomes helps scientists dig into data quality issues. Not all data-sharing platforms allow for the submission of raw sequences, which has caused especially significant issues in the current pandemic (Van Noorden, 2021). For instance, researchers at EMBL-EBI identified a number of issues with SARS-CoV-2 data that were harder to identify through those data not being openly shared and contextualised by other similar data (De Maio et al, 2020).
Biological imaging will also greatly benefit from structured databases to share raw images. In 2016, EMBL-EBI launched EMPIAR in response to the cryo-electron microscopy community's need for public archiving of raw 2D image data. In 2019, the BioImage Archive was launched to host raw images for light microscopy. In time, this resource will transform the ability to scrutinize imaging results in the same way as the ENA has provided for sequence data.
● Use of community-driven data metadata standards
Data that is accompanied by standardised metadata is more easily reproducible as it provides the relevant information for researchers to attempt replication. Secondly, the same metadata standards make data more reusable, which greatly increases the potential for analysis and remixing of datasets, allowing data analysis to drive new discoveries. This reuse can incentivise researchers to produce reusable data themselves. A culture of habitual data reuse is one of regular scrutiny of research methods and analysis, which in turn will promote reproducibility.
EMBL-EBI acts as a coordinator for many research communities seeking to establish specific data standards. Recent examples include setting data standards for Polygenic Risk Score Reporting Standards (PRS-RS) (Wand et al, 2021), and the Recommended Metadata for Biological Images (REMBI) (Sarakans et al, 2021). The BioModels Reproducibility Scorecard (Tiwari et al, 2021) was developed to address the issue of reproducibility in systems biology: out of the 455 mathematical models analysed by the BioModels team, almost half could not be replicated with the information provided by the authors in the original publication. To address the issue, the BioModels team developed an eight-point scorecard that modellers, reviewers and journals can use when publishing or reviewing a model. The scorecard includes criteria to identify systems biology models that are more likely to be reproducible, and will become part of the BioModels submission pipeline to allow researchers using the resource to identify models that are most likely to be reproducible. A more established model of a reproducibility tool is the PDBe validation report, which provides depositors with detailed reports of the results of model and experimental data validation as part of the curation of all entries. wwPDB validation reports provide an assessment of structure quality using widely accepted standards and criteria.
The role of the following in addressing the reproducibility crisis
➢ Research funders, including public funding bodies;
Funders hold a great amount of influence, which may be leveraged to drive change in research practices as demonstrated by open access policies mandating the research funded be published through open access to improve dissemination and impact. However, deposition to specialised databases is not always required from funders and publishers, even when there is a requirement to guarantee open access to the data. Therefore, research that complies with generic open data guidelines may still not be maximising its reproducibility potential if shared, for example, on platforms with only generic metadata and non programmatic access. Specialist metadata is essential for reproducibility as it typically contains key technical information on the generation of the data it describes.
Policies that require deposition in specialised databases would allow for greater discoverability and therefore scrutiny of scientific results. For example, the European Research Council (ERC) has already developed guidelines for data deposition that take community best practices into account. Moreover, funder support for community-based standard-setting initiatives would be welcome and positively influence how data infrastructures can improve services that support reproducibility. For instance, the recently developed REMBI guidelines (Sarkans et al, 2021) will be implemented in the BioImage Archive as part of its submission process.
Scientific articles are often the first way that researchers and clinicians around the world access information. However they are not solely responsible and indeed the point of publication, while a lever for implementing requirements, is not the ideal stage in the life cycle of data to promote reproducibility.
Publisher policies that mandate data deposition support reproducibility and can ensure that deposition occurs in structured, established public databases.
The introduction of a data availability statement - a distinct article section that contains guidelines on data access - by journal publishers can help boost reproducibility. In 2020, only approximately 18% of all full text publications available through Europe PMC contained a data availability section: even when a data availability section is present, it does not always contain links to open data. Nearly half of all publications with a data availability section in 2020 included the words “on request” - previous studies have shown that nearly 80% of such datasets become unavailable over time (Vines et al, 2014).
➢ Research institutions and groups
Research institutions have the power to implement more fair research assessment practices in line with DORA. They also have the opportunity to develop and implement open science (data deposition) policies and ensure that these policies are communicated to research faculty and staff.
It would be helpful to align institutional policies with those of funders and journals, taking into account community best practices, in order to make compliance as simple as possible (and therefore increase the likelihood of being actioned). For example, in the life sciences significant infrastructure for FAIR and open data sharing exists at organisations such as the EMBL-EBI. Acknowledging this in institutional policies (rather than solely pointing to a local, generic institutional repository) can support reproducibility, as well as relieve the cost burden of managing infrastructure on individual institutes.
It is critical that research institutions build a culture of data sharing and expertise by supporting roles such as data stewards that can become embedded locally to support best practices.
➢ Governments and the need for a unilateral response / action
The more that governments can agree on the response to the reproducibility crisis the better. Almost all research has an international element, and a coordinated approach not only sends a strong signal, but also improves the likelihood for change as the requirements for researchers are simplified.
A coordinated approach could also result in shared costs. Our experience with Europe PMC, which supports the open access publication policies of over 30 international funders of life science research, is one of a sustainable infrastructure funded by a single award that all funders contribute to according to their research exspenditure.