Written Evidence Submitted by the Dr Peter Ridd, The Global Warming Policy Foundation
The Global Warming Policy Foundation (GWPF) is a non-partisan think tank and a registered educational charity that focuses on climate and energy policy. Our aim is to provide robust and reliable analysis of climate and energy issues, based on rigorous research, for policymakers and the general public.
Dr Peter Ridd was Professor of Physics at James Cook University specialising in geophysics. He has published over 100 articles in international journals, many relating to the Great Barrier Reef upon which he worked for over 35 years. His recent interest is improving the Quality Assurance processes for science used to make public policy.
- The reproducibility problem is well-recognised in the science community, although the potential implications of the problem, especially to government policy formulation, have been largely ignored.
- In badly-afflicted fields, such as the biomedical sciences and psychology, roughly half of peer-reviewed science reports and papers are likely to be faulty.
- Due to the scale and ubiquitousness of the problem, it is inconceivable that irreproducible ‘science’ (i.e., pseudo-science) has not been used as evidence underpinning government policy.
- The reproducibility problem is, to a large extent, a consequence of the failure of the peer-review system. Peer review is not the rigorous process that is generally assumed by the public. It is often nothing more than a cursory examination of the work, for a few hours, by a couple of ‘peers’. Peer review is thus a very limited quality assurance process and far from the ‘gold-standard’ that is often claimed by science organisations.
- The reproducibility problem is only one of the issues caused by too much dependence on the peer-review process. The other major problem is that peer review of science publications, and science funding applications, is a system almost guaranteed to produce groupthink. Groupthink is the enemy of science.
- Science evidence used by industry is far less likely to be afflicted by reproducibility problems than evidence used by governments. Industrial and commercial entities will generally use quality assurance systems that go much further than peer review.
- The main concern of this inquiry should be the extent to which faulty research evidence has been used to underpin government policy and regulations, especially in fields where a degree of ideology may have formed under the façade of ‘science’. This includes policy relating to environmental issues, agriculture, education, social policy, criminology, and some aspects of public health.
- One possibility is that the government establish quality assurance systems to audit research evidence used to underpin decisions. An option would be the establishment of an Office of Science Review (OSR). The OSR would need to be established within the National Audit Office to ensure independence from the organisations it would audit. Another option would be to require the UKRI to set aside funds to undertake reproducibility studies.
- Since the early beginnings of science, in the time of the ancient Greeks, the scientific method has completely revolutionized human existence. Science has progressed by constant checking, replication, argument and improvement. In some areas of science, such as Newton’s laws of motion, checks are effectively done billions of times every day when people fly in a plane, drive a car or walk across a bridge. Newton’s laws of motion are so well tested, checked, reproduced and replicated that we stake our lives on them. But some scientific evidence is not massively validated in this way, and is thus not as reliable.
- In the last couple of decades, it has become well accepted in the scientific community that a large fraction, perhaps half, of peer-reviewed scientific papers and reports are wrong. When an attempt is made to reproduce or replicate the original work, an equivalent result or conclusion cannot be found. This is now called the replication, or reproducibility, problem. In essence, it is now accepted that peer review, conducted by a scientific journal, is not a useful predictor of whether a research report is accurate and reliable.
- This problem is now discussed in all the major science journals, and in the national institutions of science. Despite being well accepted in the scientific fraternity, the problem is almost unknown to the general community. This may be partly due to science institutions being reticent to publicize an unreliability rate approaching 50%.
- A measure of concern about the reproducibility problem in the science community can be gauged by the reaction to one of the first major publications about the issue, a paper by John Ioannidis, a Stanford University mathematician. Ioannidis’ paper, with the remarkable and illuminating title ‘Why most published research findings are false’, has been cited in over 10,000 other scientific publications since 2005. Ioannidis’ paper is one of the most highly cited research papers of all time.
- Thus, the issue is not whether there is a problem with reproducibility in science research. The issue is about the scale of the problem and what areas of human endeavour it affects.
- Irreproducible ‘science’ is pseudo-science. It is wrong.
- But what are the consequences of this faulty research, and in particular, has it influenced government policy and regulations?
Which fields of science are most affected by the reproducibility problem?
- In the case of science used for industry, faulty scientific results are usually rapidly exposed by the cold hard reality of the commercial world: flaws in scientific evidence will likely be quickly detected because of the extensive quality assurance systems that are deployed in business operations. In contrast, in the public policy arena, governments rely on the traditional self-correcting nature of science. However, this may take too long to expose faulty research – possibly many decades. This fundamental difference between the approach of governments and industry is discussed below.
Science used by industry for commercial applications
- In the commercial world, it is the responsibility of those using the research to do additional extra checking to ensure its veracity. It was such extra checking by industry, of largely publicly-funded research, that revealed the full extent of the replication/reproducibility problem (Prinz et al., 2011) in the drug research and development industries.
- The biomedical sciences have been responsible for great scientific advances, including vaccinations and medical drugs. Because mistakes might be life threatening to patients, the field has been a leader in improving science quality assurance processes, and has made many reforms over the last decade. It is thus useful to consider what caused concern over the peer review process in the biomedical sciences (and other fields).
- To take a drug from a promising discovery in a research laboratory, often at a university, to a commercial product, costs around one billion dollars, and usually takes over a decade. The first step in this process is to test if the original work was accurate. Prinz et al (2011), of the German drug company Bayer, found that 75% of the scientific findings used for potential drug discovery targets were unreliable. Work was thus abandoned on these candidate drugs. This issue has come to some international prominence:
A rule of thumb among biotechnology venture-capitalists is that half of published research cannot be replicated. Even that may be optimistic. Last year researchers at one biotech firm, Amgen, found they could reproduce just 6 of 53 ‘landmark’ studies in cancer research. (The Economist, 19/10/2013).
- Other authors have reported the frequency of irreproducibility at around 50% (Hartshorne and Schachner, 2012; Vasilevsky et al, 2013). It has also been suggested that false or exaggerated findings in the literature are partly responsible for up to 85% of research funding resources being wasted (Chalmers and Glasziou, 2009; Ioannidis, 2014; Macleod et al., 2014). Despite replication studies being fundamental to establishing science reliability, such studies are rarely funded, and are not generally seen as a way of advancing a scientific career (Ioannidis, 2014).
- Concern over reproducibility is shared by editors of major journals. Marcia Angell, a former editor of the New England Journal of Medicine, stated:
It is simply no longer possible to believe much of the clinical research that is published, or to rely on the judgment of trusted physicians or authoritative medical guidelines. I take no pleasure in this conclusion, which I reached slowly and reluctantly over my two decades as an editor of The New England Journal of Medicine. (Angell, 2009).
The editor of The Lancet famously stated that
...much of the scientific literature, perhaps half, may simply be untrue. Afflicted by studies with small sample sizes, tiny effects, invalid exploratory analyses, and flagrant conflicts of interest, together with an obsession for pursuing fashionable trends of dubious importance, science has taken a turn towards darkness. (Horton, 2015).
- The financial costs of irreproducible biomedical research are significant. Freedman et al. (2015) estimated that the cumulative prevalence of irreproducible preclinical research exceeds 50%, and, in the United States alone, results in approximately US$28 billion per annum spent on research that is not reproducible.
- Engineering is an area far less affected by the reproducibility problem than the sciences. Engineers have developed rigorous quality assurance processes – indeed, there are international standards that describe a vast range of processes in great detail. One reason that engineering is generally more reliable than the sciences is that, when an engineer makes a mistake, there is the possibility of loss of life or massive financial costs.
- Consider the hypothetical case of an engineer who claimed to have invented an advanced metal alloy that would improve aircraft jet engine performance. An aircraft engine manufacturer, such as Rolls Royce, would not regard a peer-reviewed publication on the properties of the new metal alloy as sufficient evidence to use the alloy in passenger aircraft immediately. The consequences of failure are disastrous and life threatening. Instead, the engine manufacturer would subject the alloy to very stringent quality tests, under the guidance of strict regulations, and after many years the metal might ultimately be certified for general use. This process is far more stringent and reliable than peer review.
- What is also evident is that many scientists have little idea of what constitutes robust quality assurance, possibly because few have worked in an industry where there is a financial or human-safety imperative to get things right. In a recent instance involving an eminent scientist, questions about the efficacy of peer review as a useful quality assurance system, was countered by an assertion they were ‘just ridiculous’ and the rhetorical claim that ‘…peer review is the same process we use when studying aeronautics, which produces planes that we travel on’ (Barry, 2016). Few people would feel safe travelling in an aircraft where the only check on safety had been peer review, given its fault rate of 50 percent.
- The crucial point is that many scientists have limited experience of quality assurance (QA) and quality control (QC) systems, and are either sometimes ignorant or in denial about the implications of the replication/reproducibility problem and the deficiencies of peer review.
- Scientists certainly have much to learn about quality assurance from engineers.
Policy-science – science used by governments for developing policy
- In many fields of public policy development – including environmental issues, agriculture, education, social policy, criminology, and public health – policy is frequently guided by scientific research. This particular class of science has been termed, ‘policy-science’ (Larcombe and Ridd, 2018). The critical distinction between policy-science and the rest of science is the active use by government, who then have an obligation to ensure reliability. It is not the social ‘science’ of how to make good policy; it is specific hard science upon which a particular policy may be based. Governments will use this research, so it is their responsibility to ensure that it is solid. Given the demonstrated limitations of peer review, governments should foster more rigorous quality assurance systems to ensure the science it uses is likely to be solid.
- At present, much policy-science has not been subject to meaningful quality assurance protocols, and is therefore not reliable. Governments are thus often making many decisions without sound, tested scientific foundations.
Peer review is a very cursory quality assurance system
- Peer review is the primary quality assurance system used for most science results including for policy-science. The public often believe that peer review involves a long and thorough process where, perhaps, a dozen scientists work for months to check another scientist’s results. But in reality, peer review may only be a quick read of the work by a couple of anonymous referees selected by a journal editor. This is clearly not sufficient quality control upon which to base decisions worth millions, or billions, of pounds.
- The cursory nature of peer review is at least partly responsible for the reproducibility problem, and some eminent scientists are very scathing about its effectiveness. For example, an editor of The Lancet, Horton (2000), commented that:
...the system of peer review is biased, unjust, unaccountable, incomplete, easily fixed, often insulting, usually ignorant, occasionally foolish, and frequently wrong.
- The Australian Nobel Laureate, Prof Peter Doherty, while being rightly supportive of peer review as a first pass check of scientific evidence, commented that:
...it’s not hard to get almost anything published at some level in what’s broadly styled as the peer-reviewed scientific literature, especially if it is well written and gives the appearance of having been done properly. (Doherty, 2015)
- Industry accepts that it is worth investing considerable resources in quality assurance where issues of fundamental importance are in play. In contrast, peer review of most scientific articles is usually done for free by an anonymous reviewer.
- There are obvious differences between the government’s innocent trust in peer review (and other similarly cursory quality assurance processes), and the far more rigorous systems of quality control used by industry and engineering. Given the level of government spending and significance of associated policy decisions, there can be no doubt that governments need to subject policy-science to greater scrutiny than is often the case at present.
Peer review and groupthink
- The reproducibility problem is not the only major quality assurance problem that stems from peer review. Equally important is that peer review, by its very nature, can in some instances have the unintended consequence of encouraging groupthink. The history of science is full of examples of dissenting scientists being excluded by their peers. Scientific papers by dissenting scientists will often fail to pass the peer-review process – a dissenter is almost by definition not one of the peer group.
- In addition, funding applications to organisations such as UKRI are to a large extent examined by a peer group. Therefore, a scientist’s work and funding applications are continuously judged by peer groups. It is hard to think of a system more likely to generate groupthink.
- In this submission, we certainly do not doubt that peer review is a worthy system that has stood the test of time. However, it is also beyond doubt that peer review has major faults, and unintended consequences that must be urgently addressed.
The solution: an extra layer of science quality control for policy-science
- The question must be asked, what can be done to improve the reliability and accuracy of science used for public policy decisions and reduce the chance of groupthink taking hold?
- The terms of reference for submissions for this inquiry ask for comments about the role of UK Research and Innovation (UKRI) in solving the reproducibility problem. The UKRI could certainly play a useful role in detecting irreproducible science evidence, but it is not an organisation that is well suited to detecting groupthink, or deciding which scientific evidence that is used to inform public policy should be subjected to reproducibility tests (see below).
- One possible way to improve the quality of evidence used to underpin government policy would be the establishment of an Office of Science Review (OSR). The function of the OSR would be to organise checking, testing, replication and general auditing of science upon which major government policy decisions were based. It is proposed that:
- The OSR would be established under Auditor General’s organisation the National Audit Office NOT UKRI (see section below), and that
- UKRI be required to set aside a small percentage of its funds, perhaps five percent, for checking policy-science. The OSR could advise, but not direct, the UKRI on particular scientific questions that might receive preferential funding. The UKRI might also set aside funds for replication studies on other, more general, fields of science, similar to the system set up by The Royal Netherlands Academy of Arts and Sciences (KNAW, 2018)
- The precise procedures used by the OSR could take many different forms. However, there must be independence, openness, and transparency in all aspects. It must be adversarial. Its aim must be expressly to find things that are wrong, like a defence lawyer in a trial, or a financial auditor.
- Depending upon specific cases, auditing is likely to involve questioning scientists, commissioning replication of previous work, reanalysing data, checking experimental design and results, and ensuring alternative interpretations are thoroughly considered. The audit of a published scientific work would involve checking that:
- prior works described in the scientific paper were correctly referenced and paraphrased;
- appropriate study design and statistical methods had been employed;
- all sources of error were accounted for and considered in the interpretation of results;
- confounding influences and alternative interpretations were considered in the paper’s discussion;
- all data, and computer code, had been released;
- and the results had been independently replicated, preferably on many occasions.
- If the work had not been independently replicated, the OSR would, where possible, commission a replication study to do so. This is likely to be one of the main functions of the OSR, and also the costliest.
- The role of the OSR would not be to adjudicate or make any recommendations on policy. It would simply provide the results of its quality checks. Decisions must ultimately be taken by the relevant government, with appropriate advice from departments and other authorities in the field.
- The main government organisation that understands audits, and the importance of independence, is the National Audit Office (NAO). Obviously, the NAO currently lacks the scientific skills to do this job, but it understands the principles of auditing, in particular how to stay independent of the auditee.
- It would be impossible for an OSR to have the scientific personnel to cover all aspects of policy-science, which embraces environment, agriculture, education, criminology, public health and many other areas. Subcontracting specific quality audits to other scientists in universities and consulting companies with relevant scientific backgrounds would be essential.
- An OSR would not find all problems; what it has to do is pose a high possibility of exposure of defective evidence, so as to have a very powerful deterrent effect. If research that was used for policy was shown to be faulty, and that there was a demonstrable lack of due-diligence, or systematic quality systems, there would be considerable incentive to prevent a repetition. It is notable that financial auditing rarely finds financial malpractice, largely because the certainty that the audit will occur is itself a very powerful deterrent. At present, errors in scientific research pose little risk to the reputation of scientist, or their organisations.
Establishing the OSR to ensure independence
- The OSR must be established with a structure that makes it unlikely to be ‘captured’ by the scientists and science organisations it will audit. A risk would be that audits of important work would not be carried out, or be ephemeral. Research from some of the major science institutions would be audited, and there could be major reputational, and perhaps funding implications, of an adverse audit. There may therefore be a risk that undue pressure could be applied to the OSR by such organisations. For this reason, the OSR must reside in the National Audit Office, not the bureaucratic structures overseeing science in general. The OSR must not reside within UKRI.
- The system of peer review has a tendency to favour groupthink, and one of the roles of the OSR is, effectively, to test if unquestioned groupthink has become established. Those with dissenting ideas are, by the definition of peer review, excluded by the peer group. This may be entirely appropriate if the dissenter is a crank, but the history of science has shown that many great advances have occurred by scientists who were initially excluded by the peer-group. The OSR must remain aloof from the peer group, and work it commissions must not be executed by the peer group.
- In order to stop the OSR being captured or diverted by an unreasonable peer group, the personnel in charge of the OSR need to be selected carefully. Ideally, they need to have an impeccable scientific background but, at some stage in their career, have experienced the negative aspects of the peer group and groupthink. Examples are the Australian Nobel Laureates Barry Marshall and Robin Warren, who demonstrated that stomach ulcers were caused by bacteria. Their work was originally ridiculed by major medical authorities and rated in the bottom ten percent by peer reviewers. Fortunately, Marshall and Warren’s tenacity ensured that the peer group ultimately recognised their mistake.
- Although the example of Marshall and Warren is very famous, many scientists have at some stage in their career experienced the negative influences of the peer group and groupthink, only to be ultimately vindicated. Such scientists, who have had first-hand experience of the inherent problems with peer review, but are well recognised scientists of high quality, are ideal candidates for involvement with the OSR.
Utilizing industry expertise in total quality management systems
- As mentioned previously, engineering is far ahead of much of the scientific fraternity in developing and implementing quality assurance systems. Quality products can only be produced when a well-thought-out quality management system is implemented. The Japanese are often credited with introducing such systems on a large scale in the 1960s. Basing their quality systems on the pioneering work of American William Deming, the Japanese car and electronics industries led the world for decades, with companies in other countries scrambling to imitate them. The British car industry was one of the victims of the Japanese ability to use quality assurance systems.
- A modern factory is, with regards to its quality systems, unrecognisable compared with a 1960s’ equivalent. However, in many of the disciplines of science, the quality systems are little changed from the 1920s when peer review became common for scientific journals. The fact that we can do better today does not mean everything from the past is faulty. In the same way as there are many excellent examples of 1960s cars, it is beyond dispute that much of the science evidence governments have used over the last 50 years has been of great benefit to society. Equally true is that there is room to improve scientific quality systems and thus also improve policy-science.
- Today, so-called ‘Total Quality Management’ systems are ubiquitous in any major industrial operation, and in the service sector of the economy, such as banks and insurance. A large number of companies specialise in implementing and troubleshooting such systems. Personnel with experience in these companies could give invaluable assistance to the OSR.
- Like the Japanese in the 1960s, the UK can lead the world in quality systems for policy-science. The OSR, and requiring the UKRI to set aside funds to undertake reproducibility studies, would be a step in this direction.
Angell, M (January 15, 2009). Drug companies & doctors: a story of corruption . The New York Review of Books. http://www.nybooks.com/articles/archives/2009/jan/15/drug-companies-doctorsa-story-of-corruption/?pagination=false
Barry, P. (2016). Muddying the waters on the Great Barrier Reef. [online] www.abc.net.au. Available at: http://www.abc.net.au/mediawatch/episodes/muddying-the-waters-on-the-great-barrier-reef/9972936 [Accessed 13 May 2020].
Chalmers I, Glasziou P. (2009) Avoidable waste in the production and reporting of research evidence. Lancet, 374: 86–89.
Doherty P, The Knowledge Wars, Melbourne University Press, 2015.
Hartshorne JK, Schachner A (2012) Tracking replicability as a method of post-publication open evaluation. Front Comput Neurosci 6: 1–13. doi: 10.3389/fncom.2012.00001 PMID: 22291635
Horton, R., 2015. ‘Offline: What is medicine’s 5 sigma?’ Lancet, 385: 1380
Horton, R. (2000) Genetically modified food: consternation, confusion, and crack-up. Med. J. Aust. 172, 148–149.
Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. Retrieved from http://www.plosmedicine.org/article/info:doi/10.1371/journal.pmed.0020124
KNAW (2018). Replication studies – Improving reproducibility in the empirical sciences. Amsterdam: KNAW.
Larcombe, P., Ridd, P., 2018. The need for a formalised system of quality control for environmental policy-science. Mar. Pollut. Bull. 126, 449–461.
Macleod MR, Michie S, Roberts I, Dirnagl U, Chalmers I, et al. (2014) Biomedical research: increasing value, reducing waste. Lancet, 383: 101–104.
Prinz, F, Schlange, T. & Asadullah, K (2011) Believe it or not: how much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery, 10, 712. 10.1038/nrd3439-c1 PMID: 21892149
Vasilevsky NA, Brush MH, Paddock H, Ponting L, Tripathy SJ, et al. (2013) On the reproducibility of science: unique identification of research resources in the biomedical literature. PeerJ 1: e148. doi: 10. 7717/peerj.148 PMID: 24032093