Alan Turing Institute – Written evidence (LSI0080)
The Alan Turing Institute makes this submission as part of the inquiry lodged by the House of Lords Select Committee on Science and Technology. It has been prepared for the Committee by Kirstie Whitaker (Turing Research Fellow), with contributions from Adria Gascon (Turing Research Fellow) and in consultation with Helena Quinn (Policy Officer).
The Alan Turing Institute is the UK’s national institute for data science. Five founding universities – Cambridge, Edinburgh, Oxford, UCL and Warwick – and the UK Engineering and Physical Sciences Research Council created The Alan Turing Institute in 2015. Our goals are: to undertake world-class research in data science, apply our research to real-world problems, driving economic impact and societal good, lead the training of a new generation of data scientists, and shape the public conversation around data.
This response addresses questions 2 and 12 of the Call for Evidence, focusing on open data as a means to stimulate innovation in the life sciences sector, and the use of privacy-preserving analyses and harmonisation of metadata standards to improve collaboration between researchers and the NHS.
2. Why has the UK underperformed in turning basic research in the life sciences into intellectual property? What needs to be done to address this historic weakness in the UK and grow new companies to commercialise new research and related technologies in the life sciences?
In this question we focus on open data as an enabler of innovation and the conditions for commercialising new research and related technologies in the life sciences.
- In order for the UK to better innovate and, in turn, commercialise research from the life sciences sector, research outputs need to be more openly shared. This will stimulate the creation of faster and more efficient applications.
Open Data
- The UK Government is already committed to transparency and open data: there are 42,286 non-sensitive datasets available on data.gov.uk at the time of writing. We know that available data creates jobs (Capgemini, 2015), adds value to products and services across many sectors (Manyika et al, 2013) and benefits the users of data rich companies (Stott, 2014). We are delighted that Research Councils UK have stated their commitment to sharing data generated by UK tax payer funded research. They have developed and signed the Concordat on Open Research Data along with HEFCE, RCUK, Universities UK, Wellcome Trust, The Natural History Museum, Cancer Research UK, Sheffield Hallam University, Scottish Funding Council and The Higher Education Funding Council for Wales (HEFCW).
- The first principle of the Concordat for Open Research Data is that “Open access to research data is an enabler of high quality research, a facilitator of innovation and safeguards good research practice.” We recommend that the UK government provides the necessary investment to make sure that data from all tax-payer and philanthropically funded research is made available under an open license, particularly including the opportunity for commercial use. Nationwide adoption of creative commons licenses for datasets, rather than the creation of new guidance for the reuse of data, permits national and international researchers in academic, not-for-profit or commercial positions to use the data to stimulate innovative work while also clearly crediting the creators of the dataset.
- Transport for London (TfL) openly shares much of the data it collects and generates. The Citymapper smartphone app is one of the most well known open data success stories.[1] A recent report, commissioned by TfL, estimated the value of the time saved by passengers due to better access to information at between £15m and £58m in 2012.[2] Nesta and the Open Data Institute published an analysis of their Open Data Challenge Series. They predict that open data will provide a ten times return on investment, generating up to £10.8 million for the UK economy.[3] In a separate analysis, McKinsey report potential benefits of open data amounting to $3 trillion annually across seven sectors.[4] Public sector open data alone is estimated to return 0.5% of an economy’s GDP in value to the user.
- Data sharing improves reproducibility, meaning that it is easier to rely on published research if data is provided alongside it. In 2015 the Open Science Collaboration published their attempts to get the same results when repeating famous psychological studies. They found that only one third of the 100 results could be found by independent researchers.[5] A similar project looking at 50 experiments in cancer biology is underway.[6] So far, five results have been published, and of those only 2 successfully gave the same results as the original studies. The life sciences within academia and industry will benefit from the increased faith in published results that sharing data provides. Open data can lower the cost of research and increase the efficiency of the life sciences as a whole to make world leading breakthroughs.[7]
- Beyond repeating a previous study, open data permits new insights that can only be achieved when information from diverse sources are brought together. This could be in the form of software tools or novel research questions. Given the UK’s strong position as a world leader in sharing government data, we recommend that adding the wealth of knowledge in the academic life sciences to this collection will lead to benefits for the UK’s economy.
Open Access Publications
- Not only is it necessary for data and software generated by research into the life science to be reused, it is also imperative that the scholarly articles that are traditionally produced at the end of a project are openly available for all to read. The access to this knowledge allows businesses large and small to advance the “entrepreneurial state” (Mazzucato, 2011)[8] and stimulate innovation across the life sciences and beyond.[9]
- The benefits of making research outputs available has been associated with increases of return on financial investment[10], along with new data science projects linking academic research to industry.[11] Globally, UK cancer research has gained 5.9 million quality adjusted life years (QALYs) and saved £124 billion as a result of openly available research, corresponding to an 8-fold return on investment.[12] In the area of environmental impact assessments, Vickery (2011)[13] has shown that OA to R&D results could result in recurring gains of around €6 billion per year.[14]
12. How can collaboration between researchers and the NHS be improved, particularly in light of increased fiscal pressures in the NHS? Will the NHS England research plan help in this regard? How can the ability of the NHS to contribute to the development of and adopting new technology be improved?
- Having a national health service means that UK researchers across the life sciences have significant potential to do impactful analyses that can feedback to benefit the NHS directly, but also to stimulate the wider life sciences sector. These analyses are truly interdisciplinary and have the potential to invigorate the life sciences industry for academic and basic science, along with the invention of new technology and pharmaceuticals that will improve the treatment of people in the UK.
- For example, The Alan Turing Institute has an on-going project with the Cystic Fibrosis Trust to use machine learning techniques on UK Cystic Fibrosis Registry data, which may help to create a method of generating personalised risk scores for people with cystic fibrosis. These scores can then be used by people with cystic fibrosis and their clinical teams to tailor treatments to effectively manage the condition. Intelligent risk adjustment methods will also support clinical teams to monitor and improve the clinical care they provide.
- We have also recently run a week-long collaborative and non-competitive hackathon-style event on health, with 80 data scientists working on partners’ real-world problems and data sets. Projects included:
- NHS Scotland’s Information Services Division, looking at the methodology underlying the risk scores used to calculate patient hospital admission data;
- Queen’s Hospital A&E, assessing the severity in A&E patients;
- The Centre for Cancer Prevention, Queen Mary University of London with Cancer Research UK, investigating whether machine learning and computational statistics algorithms can be used to extract features from mammograms that are predictive of future breast cancer.
- We will focus on two key areas of data science that can improve collaboration between researchers and the NHS: privacy-preserving analyses and the development of standards for metadata.
Privacy preserving analyses
- Privacy-preserving data analyses are those that allow researchers to extract population level understanding from a dataset without identifying any of the individuals represented in it.[15]
- This is particularly relevant in the medical domain. It is imperative that all people who use the NHS retain their right to privacy for all information that relates to their health and wellbeing. In most cases, the challenge faced by researchers studying medical data is not technological, but of an ethical nature.
- Currently, health research is conducted by researchers who have signed lengthy non-disclosure agreements (NDAs). While these legal documents establish trust between NHS staff and academic scientists, they are not robust to security breaches or human error. More importantly, they are time consuming and difficult to negotiate. This process cannot scale to harness the potential of the nation’s life sciences sector.
- The NHS England Research Plan does not address differential privacy analyses and therefore misses out on the potential from research data science and data science undertaken by other companies in the life sciences sector. For example, the 100,000 sequenced genomes project has no plan in place for sharing that data whilst protecting the privacy of those involved. In addition, these bureaucratic processes put undue burden on NHS and departmental finances.
- One of the most powerful analyses that could provide valuable evidence for public policy would be to join together data available from two or more branches of the NHS, or to merge data from the health service with other information, such as financial wellbeing.
- A technically trivial task like joining two datasets together can be incredibly challenging in the presence of privacy concerns. Leveraging the use of privacy-enhancing technologies to enable such joint analyses is one of the goals of research in privacy-preserving data analysis. Several practical solutions already exist in this space, and we recommend that their utility is investigated further.
- The information available from the NHS for researchers consists of many terabytes of information. A further challenge faced by researchers wanting to ensure the privacy of their participants is the need to securely outsource large computations. Cloud computing has decreased the cost of computation and data storage, but is not currently fit for purpose for privacy-preserving analyses. Further investment into the cryptographic techniques that can allow these analyses is recommended.
- The overarching goal of privacy-preserving data analysis involves several challenging subgoals, ranging from finding an effective mathematically robust definition of privacy, to deploying the analysis in a secure way. Hence, privacy-preserving data analysis is a truly interdisciplinary effort involving areas like cryptography, statistics, machine learning, systems/hardware security, and formal methods. Investment in each of these areas has the potential to transform the life sciences sector in the UK.
Data standards
- In order to merge together two datasets it is important that the metadata from both sets can be understood and matched up. Metadata are the data that summarise information about the data. For example it would explain how, when, where and why the data was collected, along with a description of how the information is organised.
- Metadata standards are rules for how you set up a database/conventions for entering data. Essentially they mean that the metadata is standard across different datasets. For example, including standard fields for patient first names and last names, the date at which they visited their doctor and what treatment was recommended or carried out.
- One definition of data science is the analysis of data for a use other than the one for which it was originally collected. These are sometimes known as secondary analyses, and are hugely beneficial to researchers across the life sciences sector.
- We recommend investment in harmonising the metadata across different branches of the National Health Service. Not only will these datasets be easier to match up for researchers who have access to the individual patient information, but these standards also benefit physicians and administrators within the health service itself.
- For example, the NHS England Research Plan has a commitment to contribute to the design of the NHS choices website and the UK clinical trials gateway as a linked service to improve public access. This linking can only be achieved with key metadata standards.
15 September 2017
[1] See https://theodi.org/news/citymapper-government-open-data-improve-cities
[2] See http://odimpact.org/case-united-kingdoms-transport-for-london.html
[3] See https://www.pwc.co.uk/assets/pdf/nesta-and-the-open-data-institute-pwc-report-october-2015.pdf
[4] See http://www.mckinsey.com/business-functions/digital-mckinsey/our-insights/open-data-unlocking-innovation-and-performance-with-liquid-information
[5] See http://science.sciencemag.org/content/349/6251/aac4716
[6] https://elifesciences.org/collections/9b1e83d1/reproducibility-project-cancer-biology
[7] See van Assen et al., 2014; https://dx.doi.org/10.1371/journal.pone.0084896
[8] http://www.publicaffairsbooks.com/book/the-entrepreneurial-state/9781610396134
[9] See www.researchinfonet.org/publish/finch/
[10] Beagrie & Houghton, 2014 (http://repository.jisc.ac.uk/id/eprint/5568)
[11] For examples on The Alan Turing Institute’s website: https://www.turing.ac.uk/category/research/projects/
[12] Glover et al, 2014 (https://dx.doi.org/10.1186/1741-7015-12-99)
[13] https://ec.europa.eu/digital-single-market/en/news/review-recent-studies-psi-reuse-and-related-market-developments
[14] For further details see Tennant et al, 2016; (http://dx.doi.org/10.12688/f1000research.8460.3)
[15] See the report from the United States Commission on Evidence-Based Policymaking, which discusses differential privacy and multi-party computation: https://www.cep.gov/content/dam/cep/report/cep-final-report.pdf