Written evidence submitted by Mike Hearn (C190121)
Summary of methodological issues in epidemiology
Abstract. Problematic practices within epidemiology are presented, along with suggestions for improvement.
Lack of public review. The Imperial College London Report 9 paper that largely drove UK public policy contained internally inconsistent/non-replicable numbers1, didn’t use data from arguably the best datasets then available that indicated a 40% lower fatality rate2, and relied on unpublished model code that only its author understood3. These problems were caught after the work had already altered government policy. Whilst many researchers have embraced open access, preprints and public code/data, these practices are not a requirement for research relied on by the civil service. When external review from outside the field did occur it was rejected with the justification that
cross-discipline review is inherently illegitimate4.
Poor characterisation of statistical uncertainty. Policy was driven by modelling that used insufficiently large data sets to derive critical inputs5 and uncertainty bounds were either not reported at all6 or had extremely wide ranges7. Uncertainty ranges were sometimes widened post-publication, e.g. days after the release of ICL Report 9 the lead author altered his prediction to be “could be 20,000 deaths or much lower”8, thus rendering the predictions unfalsifiable in one direction and adding a wide uncertainty bound post-facto.
Non-existent or circular model validation. Validation of epidemiological models is rare. Some scientists have argued that few healthcare models can ever be validated against reality, yet they should still be used to make decisions9. The COVID model produced by Imperial College London is derived from a flu model first published in 200510. Despite many outbreaks of seasonal influenza having occurred since then, no evidence was provided in Report 9 or its citations showing that the model accurately predicts epidemics. Models are frequently considered validated if their predictions match the results of other models11,12 rather than the actual course of an epidemic. This is invalid because testing predictions against themselves is circular reasoning.
Research papers may pre-suppose their own conclusions, for example, Nature published a modelling paper from ICL (Flaxman et al) which claimed lockdowns had saved 3.1 million lives13. In fact it used circular logic by pre-allocating all the reductions in R to government intervention (NPI)14 and encoded the output conclusion in the input parameters via statistical forcing and parameter choice15. A related issue is how the reliability of COVID PCR testing is determined by calibrating the test against itself16. Peer review appears to only rarely prevent these kinds of problems.
Particularly concerning is the use in some papers of subjective Bayesian priors, which encode the scientist’s pre-existing intuitive beliefs about the likelihood of certain answers as inputs. As the result of science is itself evidence used to update those intuitive beliefs, this is another form of circular reasoning.
No code quality processes. Standard epidemiological practice is to peer review the intended assumptions and conclusions of a model, but not the implementation17. There are no academic processes that recognise the possibility of implementation error. Despite 15 years of continuous development the code behind Report 9 was only made public in 2020 after public pressure and FOIA requests. Once public review was possible bugs were found in its code that impacted its predictions18, for example it was found that predictions depended on arbitrary factors like what kind of computer was used to run it19, that it contained data corruption bugs20,21,22, and that predictions of bed demand changed between versions by more than the size of the Nightingale emergency hospital deployment18. No standard regression test system was in place. Although professional software engineers were brought in to work on the code, this occurred only after it had already altered government policy. The British Computing Society criticised the lack of code quality processes in academic modelling23.
Misleading press statements. In their paper Flaxman et al stated that the claim of 3.1 million lives saved was “illustrative only”, and that “in reality even in the absence of government interventions we would expect Rt to decrease and therefore [we] would overestimate deaths in the no-intervention model”13. But to the press Flaxman said, “Lockdown averted millions of deaths, those deaths would have been a tragedy”24. After concerns were raised by software engineers that the ICL COVID-Sim model did not repeatedly generate the same predictions, ICL published a press release25 in which a third party researcher stated “I was able to reproduce the results… from Report 9”. Nature claimed “it dispels some misapprehensions about the code, and shows that others can repeat the original findings”26. Models generate predictions, not findings. In fact every prediction he got out of the model was different, three of them
showing “significant differences” of 10-25%27. The press release also stated that Report 9 was built “on code originally developed, published and peer-reviewed in 2005 and 2006”, although the code had never been published or externally/peer reviewed until 2020.
Excessive freedom in choosing input data. Researchers may freely select data and add assumptions without regard to quality. The Lancet published a modelling paper in August28 that used fatality rate data gathered in January29, likewise for a paper modelling the impacts of contact tracing11, although observed CFRs at that time ranged between 2.8% (higher than the Spanish Flu) and 0.18%30. It was already known since 2012 that it can take several months of observation for fatality ratios to become accurate enough to be usable5. More recent data would have lowered predicted deaths significantly. The Lancet paper also claimed “the data are sparse” using a citation from March, although a month earlier in July a literature review by doctors stated the opposite31. The ICL COVID-Sim model has over 200 user-specifiable parameters, many of which appear to be guesses32. As an example it assumed individuals hardly vary in their chances of catching COVID; the projected number of infections is far lower if the assumption is modified for non-uniform susceptibility33.
Lack of cost/benefit analysis. The quality adjusted life year (QALY) is a standard metric used for analysis of healthcare interventions in the NHS34. NICE suggests a limit of about £20,000 - £30,000 spent per QALY gained35. However, QALY analysis in academic output is rare - none of the papers discussed in this report uses it. Although non-pharmaceutical interventions were a topic of the original ICL paper from 200510, modelling efforts then and since appear uninterested in the question of whether they are cost effective. Nor are physical and mental health losses caused by NPI accounted for. One paper with “Modelling the health and economic impacts of … strategies for COVID-19” in the title declined to do a cost/benefit analysis, because the idea of a tradeoff between GDP and health outcomes would be contested36. Yet cost/benefit analysis is routine for pharmaceutical interventions and is especially critical for COVID-19 due to the high rate of comorbidities, high average age of the victims and high cost of lockdowns.
Silencing of disagreement. A model that calculated lower herd immunity thresholds (i.e. a quicker end to the epidemic) was rejected for publication because if people felt less at risk, government intervention might be reduced37. The journal Science considered rejecting a similar paper for similar reasons38. Journals have refused to publish a large-scale field study of whether masks are effective39; the author said it would be published “as soon as a journal is brave enough”40. A Nobel prize winner in biophysics was barred from speaking at an academic conference due to his
anti-lockdown views41. A member of SAGE obtained pre-agreement from BBC Radio 4 that a debate between her and an opposing epidemiologist would be rigged42. A professor of epidemiology at Stanford had a paper rejected on the basis that “no infectious disease expert thinks this way”43.
Although this paper focuses on epidemiology, questionable research practices are widespread across many academic fields which inform public policy44. The following suggestions are therefore neutral with respect to field of study:
1. Before research is presented to ministers or the civil service it should be pre-vetted by a new Office of Research Integrity, that:
a. Seeks out disagreement both within and outside the academic community. Commission Tenth Man45 /
red team reports from those people so they can make their case directly to the government.
b. Is trained in how to critically review research papers using in-house statistical expertise, under time pressure. Papers found to be using obsolete data, containing logical fallacies, questionable causative and/or statistical models, or insufficiently supported or biased assumptions, should not be approved for use.
c. Requires evidence of model validation against reality. Validation studies should be performed by a third party outside the domain being validated (i.e. researchers in a field would not be allowed to validate for government use research produced by researchers in that same field)
2. Has the power to disbar researchers from being on projects that receive public money in case of detected research fraud.Code quality controls:
a. Publishing anything about a model requires publishing at the same time all code and data utilised, with a clear explanation of all assumptions made. Exceptions for datasets licensed from commercial organisations (universities may not sub-license data they collected to get around this requirement).
b. Pre-registration of modelling efforts prior to publication, in which commitments to software engineering practices are made, e.g.
i. Minimum levels of unit test coverage (recommendation: >= 80%)
ii. Internal peer review of code changes
iii. Use of memory safe languages
c. Hiring or contracting of qualified software engineers to implement or review model code. In case of hiring for review, the comments and consequent changes must be co-published with the code itself.
3. All modelling used to argue for or against specific policies must demonstrate rigorous cost benefit analysis, backed by data collected outside the domain being studied (i.e. researchers in a field may not provide their own de novo figures for costs or benefits).
4. Prediction markets have proven successful at predicting which papers will successfully replicate. Similar markets may prove beneficial for estimating the accuracy of forecasts. The field of superforecasting may also have insight to contribute.
The author is indebted to Nicholas Lewis and Harrison Comfort for their careful review and analysis.
Mike Hearn has been programming computers since 1990. Between 2006-2014 he worked at Google as a senior software engineer on Maps, Gmail and account security. Since then he has been developing database and encryption technology, primarily for the finance and trade/shipping sectors. He has no connection with academia or the field of epidemiology.
Medicine, 2012
… the quality of the software implementations of scientific models appear to rely too much on the individual coding practices of the scientists who develop them", Computer modelling of epidemics must meet 'professional standards', says industry group, Daily Telegraph, May 2020.
Science, Thorp, Vinson & Ash, June 2020