Every three months, the HDR UK Impact Committee considers dozens of articles and selects the most impactful examples, ranked against core pillars of the HDR UK ethos: research quality, team science, scale, open science, patient and public involvement, patient impact and diversity.

The committee selected ‘The challenges of replication: A worked example of methods reproducibility using electronic health record data’ by Williams et al’ as an example of high impact work in the Partner category.

Overview

Being able to replicate the findings of others is an integral part of research. There is an element of chance in any study, no matter how well controlled the variables are. This is especially true for observational studies, which make up a large proportion of research using health data. If the same results are found repeatedly then they become more trustworthy and much less likely to be down to chance. Trust underpins science, making replication studies a vital part of discovery and progress in research.

The challenge

Replicating observational studies using electronic health record (EHR) data can be challenging for many reasons. There are often complexities when it comes to accessing the data, variations in EHR systems across institutions, and the potential for unaccounted confounding variables.

Previously, researchers at the University of Manchester explored the risk factors for hospitalisation following COVID-19 in individuals with diabetes using a regional database of 2.9 million people. The opportunity arose for the BHF Data Science Centre CVD-COVID-UK/COVID-IMPACT consortium to repeat the study – with the same team and methodology – in a much larger, national database covering the whole population of England (57 million).

The solution

Both studies were carried out within a Trusted Research Environment (TRE) or Secure Data Environment (SDE), providing secure access to sensitive data in a way that protects patient privacy while enabling vital research.

Despite having the same data engineers and analysts working on both studies, and access to the original code, reproducibility was not straightforward. Differences in the data, and in the environments themselves, are a barrier to quick replication of existing studies.

By carrying out this study the researchers were not only able to validate the findings of their previous work, but they also uncovered important insights into the reproducibility of health data research.

The impact

Reflecting on the challenges they came across, the team set out 12 recommendations for TREs and SDEs to improve the ease with which replication studies can be performed. These recommendations cover areas from data access and governance to how the data is managed, curated and shared within different environments.

In summary, the main recommendations are:

  • A need for improved machine-readable metadata for EHR data
  • Standardisation of governance processes to facilitate federated analysis
  • Mandating of code sharing
  • Data environments to have a support structure for data engineers and analysts

The team also proposed a new theme for research, ‘data reproducibility’, as the ability to prepare, extract and clean data from a different database for a replication study.

Lead author Dr Richard Williams said: “Despite having in depth knowledge of how we carried out the original study and the same team involved, replicating it in a new data environment proved to be both difficult and time-consuming. This work presents the challenges we experienced and what steps could be taken to solve these issues, making it easier and quicker to replicate studies.”