Genomic medicine requires data infrastructure

Twenty years on from the sequencing of the first human genome, the price and speed of sequencing an individual’s genome is compatible with use in healthcare, however our ability to interpret a genome for clinical care is still very limited. For this reason the 100,000 Genomes Project focused on NHS patients with existing conditions of rare disease or cancer where interpretation is possible.

The NHS’s clinical geneticists have been testing individual genes of patients to find the cause of rare diseases for more than 20 years. Sequencing the whole genome had the potential to simplify the search and increase the number of patients receiving a diagnosis, but there are challenges. It requires substantial data storage and compute infrastructure to support it. For example, the data for just a few whole genome sequences would fill a modern laptop disk drive. And the data requires substantial pre-processing to identify the initial list of millions of differences in an individual’s genome, even before the interpretation process to identify which difference is most likely to cause the rare disease.

Hence, a substantial component of enabling routine use of whole genomes in the NHS has been the assembly of a data infrastructure capable of processing and storing hundreds of new genomes per day.

Data infrastructure that benefits both clinical care and research

Even with a whole genome, less than 50% of rare disease patients receive a diagnosis. However, from the outset of the 100,000 Genomes Project its participants were consented to allow additional analysis by researchers.

Given the sensitivity of health data, designing infrastructure to protect individual privacy while allowing research is critical. The Five Safes framework, developed by the UK Office of National Statistics, has long been used to enable safe research access to sensitive social science data. A central principle is for data not to be distributed but only accessed within a secure setting, i.e. a bit like a reading room in a library, with strong oversight and transparency on use.

Genome datasets are more complex and orders of magnitude larger than administrative data, however, a similar research environment was created by Genomics England in 2017 where researchers could securely analyse genomes alongside de-identified medical records. This allows academic and industry researchers to study the data to better understand disease, but also propose possible individual diagnoses for those patients without. Following clinical review more than 1,000 rare disease patients have been diagnosed in this way, showing the value of close alignment between data infrastructure supporting clinical care and research.

A generic approach to balancing privacy and research access to individual data

The potential to improve health care through analysis of health data has long been recognised in UK, but use has been limited largely due to concerns about privacy. However, in recent years several other infrastructures adopting the Five Safes framework were constructed to successfully provide access to subsets of health data from disparate sources, taking advantage of modern virtualised computing that allow analysis algorithms to be easily brought to the data.

Learnings from these examples were codified as Trusted Research Environments (TREs) in 2020 under the auspices of the national virtual institute Health Data Research UK. The benefits from broad access to health data using such approaches during COVID-19 was widely recognised and the TRE approach was endorsed by the Goldacre Review in 2022.

This led to NHS England to formally announce a transition to a data access model for analysis via Secure Data Environments (SDEs), recognising public concern that health data should always remain under NHS control. Regional SDEs are now being planned across England with public engagement and co-design, functionally equivalent to TREs but named SDEs to reflect their use in population health management as well as research.

From 100,000 genomes to NHS Genomic Medicine Service and beyond

Health systems whose data infrastructure enables research analysis require frameworks to evolve health treatment pathways based on resulting insights. It was analysis of the outputs from the 100,000 Genomes Project that provided the justification for NHS England’s integrated Genomic Medicine Service (GMS) incorporating whole genome sequencing for a subset of conditions into the National Genomic Test Directory, still the only health service in the world to offer this as part of standard care.

However, the NHS GMS goes beyond this with a system to update the test directory annually allowing wider adoption of whole genome sequencing when evidence supports its use to enhance patient care. This provides a robust framework for evolution of personalised medicine in the NHS, building on established infrastructure.

As well as incrementally improving interpretation pipelines such as for cancer, Genomics England is tasked with investigating additional uses such as newborn screening. The Newborn Genomes Programme – launched in December 2022 – will sequence 100,000 babies to explore the benefits, challenges, and practicalities of sequencing and analysing the genomes of newborns in the NHS.

Federation is the future

How much data do researchers need to develop algorithms, including artificial intelligence (AI), that leads to personalised medicine beyond the current relative risk prediction, which is generally too imprecise to be usefully actionable for an individual?

Where large numbers of genes and their variants are involved, more complete mechanistic models are likely to be required to accurately predict effects in individuals. However, with the effective solution of the protein folding problem by DeepMind’s AlphaFold in 2020, one huge bottleneck has been unblocked, which should over time lead to more predictive molecular models of biological processes.

More data from larger numbers of individuals will also help, particularly for rarer health conditions. Where data is held in multiple access only TREs/SDEs, this requires federated analysis, such as when the same algorithm in run within each TRE/SDE and the summary statistical results are combined to increase the sensitivity of the analysis.

UK Research and Innovation (UKRI) has been supporting the DARE UK programme to run pilots of infrastructure to support federation, such as across different collections of whole genomes. Other types of federated systems include the AI Centre for Value Based Healthcare’s FLIP platform for training a single AI from images held in different hospitals.

What is certain is that federation is the future. The European Union is developing proposals for a European Health Data Space (EHDS) which envisages enabling research across healthcare data from all EU citizens in all member states. Again, this would be with an access only model, but with data continuing to reside within each national boundary, federated analysis will be essential.

The distribution of anonymised health data has been discredited due to the risk of re-identification of individuals through the mass of social media and other public data. To enhance health through individual data, the access only approach is the most trustworthy model, given the level of public concern for the privacy of their health information. However, data infrastructures developed over the last 10 years to support the data scale of genomic medicine as well as health system wide collections of patient records to support COVID-19 management, show that trustworthy and live saving research can be carried out with the right infrastructure in place.

The future of data driven healthcare is bright.

Tim Hubbard is Professor of Bioinformatics at King’s College London, with roles at Genomics England and Health Data Research UK. He was a co-organiser of CASP 1996–2007.

On 5 July 2023, the NHS marked 75 years of service, and Genomics England celebrated a decade of life-changing discoveries in genomics