Written Evidence Submitted by
Professor Lewis Halsey, Professor of Environmental Physiology, University of Roehampton
I am writing in response to the recent call for evidence regarding ‘Reproducibility and Research Integrity’. I have taken an active interest in this issue since 2013, having recognised the concerns that a worrying number of scientific findings do not appear to replicate. In particular I have investigated and addressed this issue in terms of the effects of traditional statistical analyses on the robustness of study findings. I have published several peer-reviewed articles on the matter:
Halsey LG, Curran-Everett D, Vowler SL, Drummond GB (2015) The fickle P value generates irreproducible results. Nature Methods. 12: 179-185. 10.1038/nmeth.3288
Sneddon LU, Halsey LG, Bury NR (2017) Considering aspects of the 3Rs principles within experimental animal biology. Journal of Experimental Biology. 220: 3007-3016. 10.1242/jeb.147058
Halsey LG (2019) The reign of the P value is over: what alternative analyses could we employ to fill the power vacuum? Biology Letters. 15: 20190174. 10.1098/rsbl.2019.0174
If I may, all three of these papers are being very well cited, with the first paper listed ‘The fickle p value’ being in the top 1% of citations for all scientific papers of its age. In response to the publication of these papers, I have been invited to speak at international conferences on the issue of suboptimal statistical practices in science and the role these play in the ‘reproducibility crisis’:
Amrhein V, Halsey LG, Stephens P (2021) The end of the reign of statistical significance. SORTEE Conference 2021 https://www.sortee.org/events/
Halsey LG* (2019) Unstable p values are causing lack of repeatability in animal experiments. Laboratory Animal Centre seminar ‘3R seminar: Study design’, University of Oulu, Oulu, Finland. [invited]
Halsey LG* (2018) The fickle p value does reproducibility no favours. Meta-Psychology Conference, Sheffield, UK. [invited]
Halsey LG* (2018) Unstable p values are causing lack of repeatability in animal experiments. Scand-LAS 2018 Symposium, Kristiansand, Norway. [invited]
Given my now long-standing expertise in this area, I am keen to take this opportunity to provide evidence to the Science and Technology Committee about issues concerning the reproducibility of science, including my thoughts on how best things might be improved.
Academia strongly encourages, often obliges, scientists to prioritise research that is novel and ground-breaking. At first, this might seem sensible – isn’t the very best science going to generate new findings and progress our understanding in leaps and bounds? In reality, though, this is only true if (a) our current understanding upon which we are basing new research is valid and (b) the supposedly ground-breaking science resulting from new research provides robust, i.e. reproducible, findings.
The big flaw in a system that focuses on research promising ‘big strides forward’ is that it produces a knowledge base made of sand. Eventually, that sand crumbles, but only many years later and after a huge amount of money, time, resources, participants and experimental animals have been deployed in generating false knowledge.
Why does this approach to research produce false knowledge? While we are taught at school that a key strength of the scientific process is that findings can be tested and verified by others, the pressures exerted by academia on researchers ensures that this doesn’t happen. The ‘big’ research papers (with the novel results suggesting those big strides forward in our knowledge) secure tenure and promotions for academic staff, money for UK institutions via the Research Excellence Framework (REF), and kudos and free marketing for the universities involved. In contrast, papers that check the veracity of previous publications, playing that vital role in the scientific process, do not. And for this reason, such work is rarely carried out – there are no incentives for scientists to it. Thus, academia implicitly assumes, or hopes, that single studies reporting new findings are robust – are reproducible. Statisticians have always known this isn’t the case—that single studies cannot be taken as gospel—but most scientists aren’t statisticians, nor are most university deans or grant fund managers.
Moreover, because of the emphasis on exciting findings over the mundane, even some of the findings from highly novel research are not published. When the results of such work are deemed prosaic—typically a ‘non-significant finding’—researchers put this work on the back burner and instead prioritise publishing their more eye-catching findings. This is known as the ‘file drawer effect’ – non-significant results are left in the filing cabinet, never published, and thus this counter evidence to previous research reporting significant results is never seen by the scientific community.
Presently, colleagues at the University of Stirling and I are investigating patterns in the analyses reported by papers published in the discipline of animal behaviour. While research into animal behaviour may not be considered as essential for the economy or society as, for example, medical research, there is little reason to believe that our findings will not generalise to many scientific fields. Our work thus far on this topic suggests there is considerable evidence that the statistical analyses being reported in papers are often skewed, and in a way that makes the results more exciting by ensuring they are ‘statistically significant’. This pattern can be seen simply by looking at the distribution of the value of a key statistic used to assess whether the experimental conditions within a study have a statistically significant effect – the p value. P values less than 0.05 are generally interpreted as evidence for a ‘significant result’ while p values of 0.05 or greater are not. We are generating histograms of p value distributions across tens to hundreds of papers that show a marked increase in reported p values marginally less than 0.05, and in contrast very few p values marginally greater than 0.05. Put simply, the only explanations for this pattern of reporting are that a substantial number of researchers are (a) actively or inadvertently ‘p-hacking’, that is tweaking their data or analysis to nudge down the resultant p value until it dips under the crucial 0.05 threshold, and/or (b) only publishing data where the p value is less than 0.05, leaving the rest in the file drawer. These behaviours badly distort the scientific process, again weakening our understanding and knowledge.
I was recently involved in a session at an online conference (https://www.sortee.org/events/) discussing the ‘reproducibility crisis’ in science, specifically in ecology and evolutionary biology. My notes from that session remind me that there was a consensus among the delegates that the key driver of this crisis is misplaced institutional incentives coupled with a lack of realisation about the magnitude of uncertainty associated with the findings of single studies. Single studies, even those based on large samples, should be treated as offering only limited insights – few studies can be heralded as definitive and instead should be treated with healthy scepticism by default. Follow-up studies are just as vital as the initial one.
With these realisations, those institutions that drive the research incentives of scientists are in a position to make invaluable changes to the scientific process, by recognising the value of replicate studies and in turn encouraging them. Journals need to encourage the submission of reports of attempts to reproduce previous research, funders need to allot monies specifically for ‘non-novel’ research, in the UK the REF needs to stop effectively requiring 3*and 4* papers (the ‘money makers’) to be novel, and universities need to reduce the focus of their staff assessments on research novelty.
I have tried to keep my written submission to the committee concise. I am happy to discuss any of the points I have made, and others, upon request. This includes my feeling that probably a national committee on research integrity under UKRI would not be particularly helpful; rather it would take up resources that could be channelled into research. While there are some unscrupulous scientists, I believe they are rare and that many who indulge in scientific misconduct do so because of the aforementioned obligations on them to obtain significant results. The issue that needs to be contended with is the institutional incentives that drive scientists to publish significant results on highly novel, ‘ground-breaking’ research, and to rarely if ever run studies checking the results of previous research.