Day 1 of HDR UK Conference 2025
View the agenda for Day 1 of HDR UK Conference 2025, on Wednesday 15 October 2025.
View full agenda as pdf
View agenda for Day 2: Thursday 16 October
Day 1: Wednesday 15 October
08:45 – 09:30 (Hall 1 and 2, Exhibition)
Arrival, tea and coffee
09:30 – 09:45 (Lomond Auditorium)
Title: Welcome and introductory words
Speaker: Andrew Morris, Director of Health Data Research UK
09:45 – 10:00 (Lomond Auditorium)
Title: Welcome
Keynote speaker: Professor Dame Anna Dominiczak, Chief Scientific Advisor (Health) for the Scottish Government
10:00 – 10:45 (Lomond Auditorium)
Title: Insights from whole-population electronic health records in England, Northern Ireland, Scotland and Wales
Keynote speaker: Professor Angela Wood, NIHR Research Professor of Biostatistics and Health Data Science at University of Cambridge
Chair: Patsy Wilkinson, Deputy Chair at Health Data Research UK (HDR UK)
10:45 – 11:15 (Hall 1 and 2, Exhibition)
Coffee break
11:15 – 12:00 (Lomond Auditorium)
Title: European Health Data Environment
Keynote speaker: Gerrit Meijer, Chief Science Officer at Health-RI
Chair: Rhoswyn Walker, Director of Strategy at Health Data Research UK (HDR UK)
12:00 – 13:00 (Parallel sessions)
Stream 1: Transforming healthcare with AI
Location: Lomond Auditorium
Chair: Fatemeh Torabi, Assistant Professor in Healthcare Data Science, University of Cambridge
Speakers:
-
- Lewis Hotchkiss – Accelerating AI-driven neuroimaging research in dementia
View Lewis’ abstract
Lay Summary
Artificial Intelligence (AI) is increasingly being used in neuroimaging data to support earlier, more accurate diagnosis of neurological conditions like dementia. However, neuroimaging data is much more complex than typical tabular health data. It involves large files that require specialised processing, significantly more storage, and greater computing power. The Dementias Platform UK (DPUK) is tackling these issues by integrating processes to standardise, process and harmonise over 20 neuroimaging datasets, creating research-ready datasets and providing access to advanced computing infrastructure. This enables researchers to explore, analyse, and build AI tools using high-quality, research-ready neuroimaging datasets, securely and efficiently.
Background / Hypothesis
AI approaches in neuroimaging research are advancing rapidly. However, neuroimaging data is inherently more complex and resource-intensive to handle than typical structured datasets. It requires not only high storage and compute capabilities, but also consistent processing and harmonisation across sources to be usable at scale. Secure infrastructure offering standardisation, harmonisation, and computational resources within a Trusted Research Environment (TRE) can unlock the full potential of neuroimaging data for scalable AI research.
Objective
To create infrastructure for neuroimaging research within DPUK that supports the processing, analysis, and secure use of large-scale neuroimaging datasets for AI research.
Methods
We integrated automated processing pipelines that transform raw neuroimaging data into research-ready datasets, and standardised all scans into the BIDS format. We harmonised neuroimaging features across multiple cohorts and clinical datasets. A curated clinical imaging collection focused on dementia has also been created to support targeted AI model development. All resources are integrated into the DPUK Data Portal, which provides secure access to high-performance storage and compute power alongside established neuroimaging and AI tools.
Results
DPUK has created a collection of over 20 cohort and clinical neuroimaging datasets, all standardised to the BIDS format and integrated pipelines to generate research-ready harmonised derived data. This harmonised imaging collection provides a rich resource for research into neurological conditions such as dementia. In parallel, we are collaborating with Dementias Platform Australia to federate our TREs, using shared harmonisation methods to align our respective imaging collections. This cross-national initiative is expanding the scale and diversity of available data for AI development while preserving privacy and governance standards.
Conclusion
Neuroimaging research demands more storage, compute, and complex processing than typical data workflows. By integrating processing pipelines, adopting BIDS standards, and harmonising imaging datasets across cohorts, DPUK provides a secure, scalable platform for advanced AI research.
- Lewis Hotchkiss – Accelerating AI-driven neuroimaging research in dementia
-
- Chris Tomlinson – The world’s first population-scale generative AI model of electronic health records
View Chris’ abstract
Introduction
We present Foresight, the world’s first national-scale generative AI foundation model for medical event prediction, trained within NHS England’s Secure Data Environment (SDE) for COVID-related research. Prior studies have shown that next-token prediction transformer models, originally developed for natural language processing, can be adapted to electronic health records (EHRs) by representing patient histories as token sequences. These models offer transformative potential through their ability to predict across the full range of events seen during training.
Methods
Foresight was trained on eight linked, routinely collected de-identified national datasets covering primary care, secondary care, mortality, and COVID-19 testing/vaccination records. These data were harmonised into patient timelines of clinical codes (ICD-10, OPCS-4, SNOMED-CT) with day-level resolution. Each patient’s sequence was tokenised with additional tokens for age, sex, ethnicity, and inter-event timing. We trained a 243M parameter transformer model from scratch using next-token prediction, with masked labels for demographics.
Results
We designed and implemented an evaluation framework spanning 30-day COVID-19 hospitalisation and mortality using Brier scores and the area under the receiver operating characteristic (AUROC) and precision-recall (AUPRC) curves. We also tested Foresight-E on medical events from 2023, extending beyond its 2018–2022 training period, to assess how well it captured indirect pandemic effects in a real–world deployment-like setting.
Following concerns raised by the British Medical Association and the Royal College of General Practitioners’ Joint GP IT Committee, NHS England has paused access to data for the Foresight-E project while a governance review is carried out. That pause means quantitative results are not available pending the outcome of ongoing discussions.
Impact and Patient Involvement
Foresight-SDE represents a paradigm shift: from single-disease models to a population-scale foundation model capable of probabilistic prediction across the entire clinical spectrum. This advance was enabled through establishing a novel academic-industry-NHS collaboration, with compute resources provisioned securely by AWS and Databricks, and governance overseen by the CVD-COVID-UK/COVID-IMPACT consortium of the British Heart Foundation Data Science Centre. Patient and public involvement (PPI) has been central throughout, from shaping approvals to guiding dissemination, via ongoing engagement sessions coordinated by the BHF Data Science Centre.
Conclusion
Foresight is the world’s first population-scale EHR foundation model, built on national NHS data representative of England’s diversity of demographics and disease, to produce a model applicable to all individuals. We outline plans to evaluate its real-world applications and fairness predicting the direct and indirect effects of COVID-19, setting a new standard for population-scale health AI.
- Chris Tomlinson – The world’s first population-scale generative AI model of electronic health records
-
- Matthew Watson – Multimodal skin cancer detection: Leveraging free text and images
View Matthew’s abstract
Lay summary
Early detection of skin cancer is crucial to ensure the best possible treatment. Currently, skin concerns make up 11% of GP referrals and, due to a shortage of skin specialists, NHS wait times are longer than 40 weeks for skin issues.
Studies show that machine learning (ML) can predict skin cancer using clinical images. However, clinical free text may also contain useful information – e.g., history of skin cancer or changes over time. We have collected the first skin cancer dataset containing clinical images with free text and show that this new data greatly improves ML performance for skin cancer detection.
Background
Skin cancer is one of the most prevalent cancers, with incidence ever increasing. Patients are encouraged to perform self-examinations and present to primary care for any concerning skin lesions; suspicious lesions are then referred to dermatologists for further care. These dermatology referrals account for ~11% of primary care activity and, due to a dermatologist shortage, the average NHS dermatology wait time is now over 40 weeks.
We present the first study to use ML with dermatoscopic images, tabular metadata, and free text clinical notes to predict skin cancer. Understanding that clinical free text often contains ‘leading language’ (i.e., text that implies the clinician’s differential), we propose techniques to detect and remove leading language for analysis.
Methods
We curate a novel dataset collected from NHS Community Dermatology Services containing dermatoscopic images, patient metadata (age, sex, gender, Fitzpatrick skin score), clinical free text (sunlight exposure, family history of skin cancer, observations, surgery consultation notes), and a binary malignant/benign label. We combined feature embeddings from a ConvNext model for images, and BioClinicalBERT for metadata and free text, to make the final prediction.
To remove leading language, we prompt a Llama 3.1 model to detect and remove varying levels of leading language (conditions, diagnosis, and full filtering). We compare model performance with different parts of the free text included/removed.
Results
Our dataset contains 5,481 images from 4,538 patients (7% malignant). Our vision-only model achieved an accuracy under the receiver operating characteristic curve (AUROC) of 91.6%. Images and metadata yielded a slight performance increase (92.06% AUROC). Free text and images combined achieved an AUROC of 97.03% — however, this text contains significant leading language. After removing leading language, our multimodal model achieves an AUROC of 95.44%.
Conclusion
Utilising multimodal data improves ML skin cancer detection over images alone. However, care must be taken to ensure free text does not unintentionally bias models via leading language. Our novel leading language removal technique successfully removes leading language, and this filtered text still provides performance gains when included in multimodal models. These models could support appropriate dermatology referrals, reducing false positive referrals and decreasing waitlists.
- Matthew Watson – Multimodal skin cancer detection: Leveraging free text and images
-
- Simon Thompson – Challenges of supporting artificial intelligence (AI) projects within Trusted Research Environments
View Simon’s abstract
Introduction
Artificial Intelligence (AI) is poised to transform healthcare by enhancing diagnostics, optimising treatment pathways, and accelerating drug discovery. Recognising this potential, the UK government has outlined its commitment in the AI Opportunities Action Plan. However, realising these benefits requires access to sensitive health data for training AI models—data that must be handled securely and responsibly.
Over the past decade, there has been a significant shift in data governance. Rather than allowing sensitive datasets to be transferred directly to research organisations, access is increasingly restricted to Trusted Research Environments (TREs) or Secure Data Environments (SDEs). These secure platforms have traditionally supported classical statistical research, but they must now adapt to the unique demands of AI research and development.
This talk outlines the key challenges TREs and SDEs face in enabling the training and deployment of AI models while maintaining data security and public trust:
- Assessing Disclosure Risk: Determining whether a trained AI model poses a re-identification risk if exported, especially in the event of a data breach.
- Model Controls: Developing guidelines for handling potentially disclosive models, such as query rate-limiting or output monitoring.
- Code and Pipeline Review: Evaluating external AI training code and pipelines before they are imported into a TRE/SDE.
- Computational Infrastructure: Providing the high-performance computing resources required for training large-scale models.
- Workforce Capability: Ensuring TRE/SDE staff have the expertise to address privacy risks specific to AI.
- Legal and Contractual Frameworks: Navigating complex legal agreements and data use contracts related to AI model development.
- Regulatory Compliance: Supporting AI systems that may qualify as medical devices, including requirements for audit trails, performance validation, and traceability.
- Industry Engagement: Aligning with industry partners who are already leveraging vast global datasets on commercial AI training platforms.
- Federated Learning Support: Enabling AI training across distributed datasets without centralising sensitive data.
- Public Trust and Engagement: Building and maintaining public confidence in the use of sensitive health data for AI research.
Successfully addressing these challenges is critical to ensuring that TREs and SDEs can facilitate cutting-edge AI research while upholding the highest standards of data security, ethics, and public accountability.
- Simon Thompson – Challenges of supporting artificial intelligence (AI) projects within Trusted Research Environments
Stream 2: Data access and information governance
Location: Alsh 1
Chair: Cassie Smith, Director of Legal, Trust and Ethics
Speakers:
-
- Vishnu Vardhan Chandrabalan – Unified infrastructure and governance for direct care intelligence, population health and research
View Vishnu’s abstract
Lay Summary
Lancashire and South Cumbria (LSC) has built a unified analytics platform with strong governance to address operational, population health and research needs in a region with significant health inequalities. Our approach unites data from multiple healthcare organisations into a single architecture with daily refreshes, enabling real-time monitoring, operational efficiency and cohort discovery for clinical trials. Unlike conventional approaches that separate research from non-research operations, our solution supports both simultaneously within one system.
Background
Despite the NHS generating more data than ever before, its ability to leverage it remains limited due to data silos across clinical systems and organisations. While research and non-research uses are governed under distinct frameworks, national programs seeking to improve data use have resulted in an artificial separation between research and non-research uses at an infrastructure level.
We report on our “convergent infrastructure” approach to integrate multimodal data from several organisations into a single longitudinal patient record to serve both research and non-research uses while maintaining rigorous security, governance and public trust.
Implementation
We linked on-premise infrastructure at five NHS providers to a single cloud platform with two application zones – data zone on Databricks and analytics zone on Kubernetes using JupyterHub for researcher workspaces. We developed open-source microservices for data orchestration using Research Object-Crates (cr8tor) and event-driven natural language processing (NLP) workloads (neulander). Kubernetes autoscaling and serverless compute support cost efficiencies.
We designed a modified Medallion architecture to support business intelligence, direct care, population health and research. Research-ready datasets are provisioned to LSC’s analytics environment and Northwest Secure Data Environment. Data transformed to the Observational Medical Outcomes Partnership (OMOP) standard is fed to TriNetX platform for cohort discovery for clinical trials.
We secured Confidentiality Advisory Group and Research Ethics Committee approval for structured data, unstructured text and imaging. Stakeholder and public engagement was undertaken through a 2-day conference, multiple regional meetings and PPIE workshops. Patient Advisory and Accountability Group and Data Access Committees for non-research and research were established. Data sharing agreements enabled provider organisations to flow data into this environment. We are working towards Standard Architecture for Trusted Research Environments (SATRE) and ISO27001 accreditation.
Conclusion
Our work provides a blueprint to build unified infrastructure and governance model supporting multiple functions while maintaining security, public trust, and cost-effectiveness. This integrated system creates a holistic view of patients’ journeys across multiple providers while medallion architecture on OMOP accelerates translational real-world evidence (RWE) studies and clinical trials by using the same data standard for direct care intelligence, research and clinical trial feasibility. LSC’s innovations have resulted in a DARE-UK collaboration to build K8TRE, a vendor-agnostic TRE on Kubernetes, an EPSRC-funded collaboration to train research software engineers and membership of the UK’s first RWE network.
- Vishnu Vardhan Chandrabalan – Unified infrastructure and governance for direct care intelligence, population health and research
-
- Jenny Johnston – Refreshing the Scottish Safe Haven Charter
View Jenny’s abstract
Lay Summary
The refreshed Scottish Safe Haven Charter sets clear standards for secure access to health and social care data for research. It ensures strong privacy and governance, following best practices. The Charter supports safe data use, strengthens public trust, and enables research that can improve health and care across Scotland.
Background
The Scottish Safe Haven Charter, originally published in 2015 and refreshed in 2025, establishes a cohesive framework of principles and standards to enhance data access, security, and governance within Scotland’s Safe Haven Network. The refresh was driven by the Scottish Government and COSLA’s Data Strategy (February 2023), which aims to improve research access to health and social care data while maintaining robust privacy and governance standards. The Charter adopts the internationally recognised Five Safes framework to ensure data protection across all stages of access and analysis.
Objective
To present the principles and standards defined in the refreshed Scottish Safe Haven Charter, and to illustrate their implementation in supporting research and innovation, while strengthening public trust through robust governance, transparency, and accountability.
Methods
The Charter was co-developed through extensive engagement with Safe Haven leads, policymakers, and stakeholders across Scotland. It sets out principles under key areas: Governance, Access to Data, Public Benefit, Data Security, and Data Privacy. Implementation is supported by structured guidance, adherence to ISO27001 standards, and a commitment to continuous improvement through annual reporting to the Scottish Government.
Results
The Charter promotes a unified, transparent approach across the Scottish Safe Haven Network (5 Trusted Research Environments TRES), strengthening collaboration among NHS Boards, universities, researchers, and public contributors. It supports a diverse range of research initiatives, including health informatics, clinical trials, and population health studies, by providing secure and trusted environments for data access and analysis. The Charter has been published and is publicly available here, reinforcing its role in promoting best practice across the Network.
Conclusion
The refreshed Scottish Safe Haven Charter articulates the principles and standards necessary to support research and innovation across Scotland. Its practical application is strengthening the consistency and transparency in data access, while further developing public trust through robust governance, transparency, and accountability.
- Jenny Johnston – Refreshing the Scottish Safe Haven Charter
-
- Sam Neale – Developing a robust protocol for linking longitudinal studies with NHS England records
View Sam’s abstract
Background
Accurate linkage of Electronic Health Records (EHRs) with Longitudinal Population Study (LPS) participant data is critical to facilitate research that combines self-reported and linked records. The UK Longitudinal Linkage Collaboration (UK LLC) is the national Trusted Research Environment (TRE) for record linkage in longitudinal research and has secured permissions to link LPS data to NHS England records. An accurate and robust linkage protocol is thus fundamental to enable this type of research across the UK while complying with safe and secure data access principles.
Method
Our protocol was co-developed between UK LLC, Secure eResearch Platform (SeRP), Data Owners, partner studies and UK LLC public contributors. The protocol ensures accurate and reproducible linkage between participant self-reported data held in the UK LLC TRE with NHS England records and to make integrated data available to researchers in the TRE. Vitally, the protocol enables regular refreshes of data to be added into the UK LLC, and for changes in participant permissions to be respected. The fundamental principles of the protocol are ensuring organisational separation between the use of participant identifiers and de-personalised data and a functionally anonymous approach to onward sharing of data, and ensuring legal and regulatory compliance and honouring commitments made to participants.
LPSs send participants’ personal identifiers to Digital Health and Care Wales (DHCW), (Step 1) for integration across studies and permission management (Step 2) before sending to NHS England for linking via the Master Person Service (Step 3). The de-identified linked EHRs are sent to The Population Data Science Group at Swansea University (Step 4), loading rules are applied before ingestion to a staging schema of the UK LLC database. Finally, a data curation and management package developed by UK LLC processes and pushes the linked EHRs to the production schema, where they’re available for provisioning to researchers (Step 5). This protocol is repeated quarterly.
Results
The overall results of the linkage process for LPS participants in the UK LLC are discussed. 299,976 participants across 17 LPSs have permission to link to NHS England, with 250,535 participants successfully linked to EHRs representing a linkage rate of 83.5%. Linkage at the individual LPSs are also highlighted and discussed, where linkage rates between 72.3% and 99.4% are observed. Those not linked include participants living in devolved nations, those setting National Data Opt-Out (applicable in a sub-set of studies) and those failing to link using NHS standard algorithms.
Conclusion
Overall, the developed protocol can successfully link records, is secure and scalable to allow for onboarding of further LPSs and achieves organisational separation and management of changing permissions. Owing to the generalisable principles the protocol is built upon it can be adapted for further interdisciplinary and four nation linkages of EHRs.
- Sam Neale – Developing a robust protocol for linking longitudinal studies with NHS England records
-
- Rizza Zahid – The Atlas of Longitudinal Datasets: A global open-access resource
View Rizza’s abstract
Background
Longitudinal studies are a powerful research tool that have generated a wealth of opportunities to answer pressing questions about population health and society. However, despite the considerable investment they represent, longitudinal datasets are often underused. This is partly because they can be difficult to find, and their data can be challenging to access, compare, and integrate. Improving the discoverability, accessibility, and interoperability of longitudinal data is an essential step towards increasing their reach, impact, and reuse.
About the Atlas
The Atlas of Longitudinal Datasets is an interactive, open-access platform that aims to maximise the uptake and reuse of longitudinal data by enhancing the discoverability of thousands of datasets from around the world. The Atlas provides metadata from over 2,000 longitudinal datasets globally, including birth cohorts, ageing studies, population-based cohorts, biobanks, data linkages, and mixed-methods studies. It captures key information about their location, sample and unique features, as well as detailed information about data collection methods, mental health, neuroimaging, linkage, qualitative data, and data access policies.
Interoperability and Features
To support interoperability, the Atlas also highlights the growing integration of longitudinal studies with routinely collected data such as electronic health records (EHRs) and contributes to metadata harmonisation efforts. Users can explore and search thousands of longitudinal datasets on the Atlas, filter their searches to narrow down relevant datasets, compare study features, and save their favourite datasets and searches to return to — all supporting efficient, scalable data discovery.
User Involvement
In addition, the Atlas has been created with the guidance of a global Lived Experience Expert (LEE) group. Their contributions to the platform’s design and accessibility have been pivotal in ensuring it meets the needs of diverse users. Through co-design and user testing sessions, the LEE group helped refine the presentation of data, making the Atlas more inclusive and user-centric. Their ongoing involvement ensures the Atlas continues to evolve in line with research and public needs.
Conclusion
The Atlas is designed as a resource for researchers planning new projects, for data custodians aiming to increase the visibility and impact of their assets, and for funders and infrastructure planners seeking to identify gaps and support strategic coordination. It can also foster the development of cross-study collaborations and global consortia. This presentation will introduce the Atlas, share insights into its development, and showcase how it can serve as a free and practical tool to strengthen the infrastructure around longitudinal data and promote wider data access, integration, and use.
- Rizza Zahid – The Atlas of Longitudinal Datasets: A global open-access resource
Stream 3: Routine healthcare data in clinical trials
Chair: Xinchun Gu, University of Oxford
Speakers:
-
- Shamaila Anwar – Collecting and using patient demographic data in a National Research Network
View Shamaila’s abstract
Lay Summary
The National Institute for Health and Care Research (NIHR) funds and supports research to improve health and social care. A big gap has been not knowing much about the people who take part in research. Without basic information like age, gender or ethnicity it is hard to tell if everyone has a fair chance to get involved. To start to fix this the NIHR launched the Year of Birth (YoB) project – a simple process to test whether it is possible to collect the age of research participants recruited to NIHR supported studies.
Background / Hypothesis
Although the number of participants recruited to NIHR-supported portfolio studies is routinely reported via the Central Portfolio Management System (CPMS) from each Local Clinical Research Network (LCRN), no data on participant characteristics is currently collected. This limits the ability to assess inclusivity and equity in research participation. The hypothesis was that routinely capturing participants’ year of birth would be a feasible first step towards building a national dataset on research demographics.
Objective
The YoB pilot aimed to assess the feasibility of collecting participant age data through existing local systems and to understand the scale, variation, and challenges in doing so across England.
Methods
From 2021, the NIHR Clinical Research Network Coordinating Centre (CRNCC) collaborated with the 15 LCRNs to implement YoB data collection. NHS Trusts were asked to report the year of birth of participants via their Local Portfolio Management Systems (LPMS). This pilot examined data flows, system compatibility, and governance processes across different NHS Trusts.
Results
The YoB project has been running for over four years and reports the age of almost one million participants recruited to over 4,900 NIHR-supported studies across 204 NHS Trusts in England. The average age was 48.4 years (95% CI: 48.34 to 48.46), with the most common age group being 65–74, accounting for 15.5% of the total. In-depth analysis of recruitment activity within Cancer has shown how this data can help check whether recruitment reflects the age groups most affected by a condition. Challenges included different IT systems, confusion around GDPR rules on sharing birth year, and varying local governance processes.
Conclusion
The YoB project shows it’s possible to collect participant age data at scale across the NIHR portfolio. This provides an important foundation for understanding who takes part in research and where gaps remain. But to fully assess equity, other demographic data (such as ethnicity, gender, and socioeconomic status) must also be monitored. Looking ahead, accessing demographic data through NHS numbers offers a more complete picture of participation. Achieving this will need national coordination, cultural change, and better system interoperability. Monitoring participant demographics will directly support NIHR’s ambition to make research more inclusive, representative, and aligned to population health needs.
- Shamaila Anwar – Collecting and using patient demographic data in a National Research Network
-
- Kate Boddy – Embedding PPIE in Data Analysis: Co-Producing Genetic Research on Multiple Long-Term Conditions
View Kate’s abstract
Background
Patient and Public Involvement and Engagement (PPIE) is widely recognised for enhancing the quality and relevance of health research. While PPIE can inform every stage of the research cycle, from inception to dissemination, it remains underutilised during the ‘doing’ phase, particularly in non-clinical fields such as data science, where the influence of patient experience is less immediately apparent. In such contexts, PPIE is not only less prevalent but also less thoroughly integrated across the research process.
GEMINI (Genetic Evaluation of Multimorbidity towards INdividualisation of Interventions) explores the causes of multimorbidity using databases of DNA sequence information linked to diseases from 100,000s of people. GEMINI integrated PPIE into all stages of the research cycle. Notably, for a genetics data science study, this included two milestone points in the project’s data analysis.
Aim
To illustrate how PPIE can be incorporated into the ‘doing the research’ part of the research cycle within a genetics data science study.
Methods
PPIE in GEMINI employed interlocking layers to provide a robust and integrated approach. Two PPIE co-investigators were part of the research team providing a constant patient perspective throughout the study and at all team meetings. This was supplemented by (n=7) bespoke PPIE workshops (attended by n=58 public contributors) at key stages to enable decision making directly informed by PPIE perspectives. PPIE within the study was co-ordinated by a PPIE lead.
Results
GEMINI was funded from November 2021-October 2025. PPIE was embedded throughout the research cycle and had a meaningful influence at every stage. This presentation highlights two illustrative examples from the data analysis phase, examining how PPIE perspectives were integrated and the impact they generated. The first example is the co-creation of a curated code list of 84 long term conditions. PPIE work showed that important conditions were missing from the initial list leading to the addition of 29 conditions.
The second example is the development of a criteria checklist to select pairs of conditions for further analysis from a long list of 2546 specific pairs. This PPIE work enabled conditions to be selected according to patient priorities which were quality of life, improved outcomes, novelty and mortality.
Conclusion
GEMINI research advances understanding of the biological pathways underlying the development of multiple long-term conditions. Crucially, it was co-produced with patients, carers, and the public, enhancing its relevance and applicability. It offers a valuable example of fully integrated PPIE across the research cycle of a genetic health data science study, including the data analysis phase.
- Kate Boddy – Embedding PPIE in Data Analysis: Co-Producing Genetic Research on Multiple Long-Term Conditions
-
- Daniel Iheanacho – Embedding Patient and Public Voices: A Multi-Institutional Approach To Transforming Data for Trials
View Daniel’s abstract
Lay Summary
Health Data Research UK’s Transforming Data for Trials (TDfT) programme brings together researchers from Cardiff, London, Dundee, and Oxford to improve how health data is used in clinical trials. A key focus is working with patients and the public to shape training materials and research resources. Public contributors across the UK have co-designed videos and modules on data privacy, ethics, and involvement. So far, four training modules and over ten videos are available on the HDR UK Futures platform. Public Advisory Group (PAG) members have also influenced the green paper draft and the development of a routemap. Through ongoing input from the PAG, the programme supports more transparent, trustworthy, and inclusive use of health data in trials.
Background
TDfT unites researchers at Cardiff University, University College London (UCL), the University of Dundee, and University of Oxford to transform access, utility, analysis and understanding of health systems data for clinical trials. Partnering with a diverse group of patients and public contributors, this initiative enhances governance, transparency, and ethical data use in clinical trials. The programme equips the trials community with skills and technical solutions needed to navigate ethical, legal, and practical challenges in real-world health data research. It centres on actively involving patients, trial participants, and public contributors from diverse backgrounds to guide priorities and shape programme outputs.
Aim
To collaborate with patients and the public across the programme to embed the public voice in training content, clinical trial resources, case studies and broader discussions with the trials community.
Methods
The TDfT Public Advisory Group (PAG) includes 22 public contributors, purposively recruited to reflect a broad range of ages, ethnicities, experiences, and regional perspectives across the UK. Through structured discussions, literature reviews, and case studies, the group co-develops training videos and written resources for trialists, delivery staff, data custodians, and public contributors. Training materials include scripted ‘talking head’ videos, available on the HDR UK Futures platform, covering key themes such as data governance, public involvement in trials, and equity in data use. Written resources align with TDfT Routemap stations (e.g., informed consent, pharmacovigilance). PAG members also contribute via the Stakeholder Prioritisation forum to shape programme outputs.
Results
The programme co-developed six training modules and over ten standalone videos with direct public involvement. Public contributors were key in refining content on data utility, recruitment using health data, and creating resources for other contributors. Their input shaped a routemap, due in early 2026, emphasizing the ethical and practical issues of using routine health data in trials. The resources offer guidance, case studies, and best practices, supporting researchers and the public across learning styles. Beyond trials, these materials aid those working with administrative and health data governance and has boosted researchers’ confidence in handling complex data, fostering transparency and trust.
Conclusion
Co-producing resources with diverse public contributors enhances transparency and trust in data-enabled trials, ensuring relevance, broad applicability, and continued support for responsible health data use through ongoing multi-institutional collaboration.
- Daniel Iheanacho – Embedding Patient and Public Voices: A Multi-Institutional Approach To Transforming Data for Trials
-
- Macey Murray – Health systems data chemotherapy-related neutropenic events: STAMPEDE trial data utility comparisons
View Macey’s abstract
Background and Aim
Health systems data (HSD) are collected during healthcare interactions and have the potential to transform the design and conduct of late phase randomised controlled trials (RCTs). However, there are multiple challenges to widespread HSD use, and one important concern is whether HSD sufficiently captures (and can therefore replace) data that is traditionally recorded on case report forms; known as “utility”. Data utility comparison studies (DUCkS) evaluate this aspect by comparing HSD against trial-collected data for completeness and accuracy. Previous DUCkS have assessed specific cancer outcomes recorded in HSD, such as death (fact, date, and cause), major cardiovascular events, and new cancer diagnoses (colorectal, breast, ovarian). The aim of our study was to assess the recording of chemotherapy treatment and neutropenic admissions in NHS England datasets: Hospital Episode Statistics (HES; Admitted Patient Care [APC], Outpatients [OP]) and Systemic Anticancer Therapy (SACT) by comparison with data from the STAMPEDE RCT, the largest interventional prostate cancer trial worldwide.
Methods
Multi-site STAMPEDE trial-collected data (2005-2020) were securely linked using confidential patient identifiers to HES APC and OP and SACT. Data cleaning was undertaken, and events of interest were extracted and compared with associated trial-collected events. HES neutropenic admissions were compared with trial-collected grade 1+ febrile neutropenia and neutropenic sepsis serious adverse events (SAE) at both an individual sepsis event (N=213) and patient level (N=155). HES chemotherapy cycles were validated against SACT at an individual cycle event (N=15376) and patient (N=2697) level. Sensitivity, percentage agreement and cross-tabulation assessed the utility of HES.
Results
At an individual event level, HES identified 14035/15376 (91%) of SACT chemotherapy cycles with concordant administration dates. HES also identified 204/213 (96%) of STAMPEDE recorded grade ≥1 febrile neutropenia or neutropenic sepsis SAEs. At a patient level, percentage agreement was 89.1% between HES and SACT for chemotherapy events, identifying 2689/2697 patients experiencing chemotherapy (sensitivity: 100%). There was a 87.8% agreement between HES and STAMPEDE-reported sepsis SAEs, with HES identifying 155/155 patients (sensitivity: 100%).
Conclusions
HES (APC and OP) and SACT successfully captured most of the chemotherapy and neutropenic sepsis events recorded in the trial data, demonstrating their potential as rich data resources for RCTs. Assessment through DUCkS is critical to provide evidence of where HSD can and cannot be used within RCTs; this builds confidence in using HSD among clinical trialists, health data researchers and the public, and will allow HSD to supplement and eventually replace traditional trial data collection methods.
- Macey Murray – Health systems data chemotherapy-related neutropenic events: STAMPEDE trial data utility comparisons
13:00 – 14:00 (Hall 1 and 2, Exhibition)
Lunch
ECR Session (booking only, Online on platform)
14:00 – 14:30 (Lomond Auditorium)
Lightning Talks
Early career researchers will give a short (four-minute) presentation on their exciting new research.
Chair: Patrick Bidulka, Assistant Professor at London School of Hygiene and Tropical Medicine
Speakers:
-
- Raul Berrocal-Martin – Public involvement in shaping and interpreting analysis of intermediate care use in Grampian
View Raul’s abstract
Description
The Networked Data Lab examined changes in intermediate care use among older adults in NHS Grampian (2019-2023). Experiences and insights from the Biostatistics and Health Data Science (BHDS) Patient and Public Involvement and Engagement (PPIE) group were central to the project, shaping research design, interpretation and communication.
Early engagement introduced the concept of intermediate care and explored personal experiences of care transitions, delayed discharges, and unmet support needs…
This collaboration demonstrated how public involvement can guide research design, clarify interpretation, and improve the accessibility of outputs. Real-life stories provided a frame through which to understand the data and ongoing dialogue ensured that analysis remained relevant to those most affected. The approach taken in this study highlights the value of integrating public voices not only in what is studied, but also in how findings are understood, communicated and used.
- Raul Berrocal-Martin – Public involvement in shaping and interpreting analysis of intermediate care use in Grampian
-
- Catherine Bowden – Trustworthy health data research; insights from patient and public involvement
View Catherine’s abstract
Lay summary
Health data research has the potential to improve people’s lives. However, this can only be achieved if the public are able to trust that their data will be used appropriately…
Background
Despite widespread public support for health data research, many people in the UK choose not to share their data…
Objective
To explore how people’s concerns could be addressed to unlock the promise of health data.
Methods
In our GP Data Trust projects we conducted surveys, interviews, and focus groups with people who had opted out of sharing their primary care data…
Results
- Transparency around health data use; particularly its purposes and what benefits are delivered.
- The importance of choice and control in fostering trust.
- A health data literacy campaign to empower individuals…
- Support for a trusted intermediary to oversee the handling of health data…
- The need for high quality data curation.
Conclusion
Patient and public support for health data research must not be taken for granted. The preferences of patients over how and why their health data is used must be respected…
- Catherine Bowden – Trustworthy health data research; insights from patient and public involvement
-
- Nazia Nafis – Mitigating representation bias in health data through AI-based synthetic data generation
View Nazia’s abstract
Description
Representation bias in health datasets disproportionately affects individuals with multimorbidity and frailty, especially from underrepresented and marginalised groups…
By addressing data bias at the source, this work offers a practical, scalable pathway to improve health outcomes for often-overlooked populations – supporting the broader vision of inclusive, data-driven healthcare innovation.
- Nazia Nafis – Mitigating representation bias in health data through AI-based synthetic data generation
-
- Arina Tamborska – Open Code Counts: An R package and online tool for better electronic health records research
View Arina’s abstract
Background
Electronic health records (EHR) are among the UK’s most valuable data assets…
Methods
We accessed, processed, systematised and aggregated NHS data on the frequency of clinical codes use in the primary and secondary care in England between 2013 and 2024…
Results
Open Code Counts provides annual usage for 215,267 SNOMED, 12,380 ICD-10 and 9,857 OPCS-4 codes with 41.7 billion; 1.3 billion and 408.2 million instances of recording, respectively…
Conclusion
Open Code Counts has already allowed researchers to assess feasibility of defining variables and populations in EHR and assisted in codelist curation and validation…
Plain English summary
Electronic health records (EHR) are digital versions of patients’ medical files…
Open Code Counts has already been helping researchers to plan their work in EHR…
- Arina Tamborska – Open Code Counts: An R package and online tool for better electronic health records research
-
- Robert Nagy – Real-world data based budget impact modelling for health technology appraisal: Oncology drug example
View Robert’s abstract
Background / hypothesis
The NHS faces unprecedented pressures due to resource scarcity and capacity constraints amidst proliferating new healthcare technologies entering the market…
Objective
Our goal is to enable the NHS to make better informed, real-world evidence-based, decisions about how to best utilise their assets and resources…
Methods
The project linked regional and national routinely collected data to reconstruct and segment clinical pathways of CDK4/6 inhibitor-eligible ABC patients (N=74)…
Results
A prototype BIM was developed, revealing substantial discrepancies between the 5-year base-case BI estimates of the RWD-driven model (£4,045,092.27 for 121 simulated patients)…
Conclusion
BIA offers policymakers and NHS managers estimation of the implications of adopting new healthcare technologies… This pioneering study proposes a solution that demonstrates how RWD can be converted into evidence-based actionable insights…
- Robert Nagy – Real-world data based budget impact modelling for health technology appraisal: Oncology drug example
14:30 – 15:30 (Parallel sessions)
Stream 4: Public health and health inequalities
Location: Lomond Auditorium
Chair: Ralph Akyea, University of Nottingham
Speakers:
-
- Kelly Fleetwood – The impact of mental illness and the COVID-19 pandemic on post-myocardial infarction mortality
View Kelly’s abstract
Background
People with a mental illness have an increased risk of heart attack and higher mortality post-heart attack. To our knowledge, no study has examined whether the COVID-19 pandemic and its disruption to healthcare has exacerbated existing mental illness disparities in post-heart attack mortality. There has also been little exploration of whether differences in the care people receive post-heart attack contribute to the mortality disparity.
Objective
We aimed to compare 30-day and 1-year mortality following a non-ST segment elevation myocardial infarction (NSTEMI) (the most common type of heart attack) among people with versus without mental illness (schizophrenia, bipolar disorder and depression). We also investigated whether disparities worsened during the COVID-19 pandemic and the extent to which differences in receipt of care might explain disparities in mortality.
Methods
We used linked electronic health records from England, available via the British Heart Foundation Data Science Centre’s CVD-COVID-UK/COVID-IMPACT Consortium. We used data from the Myocardial Infarction National Audit Programme to identify people hospitalised with an NSTEMI between November 2019 and February 2023. We ascertained pre-existing schizophrenia, bipolar disorder and depression from linked primary and secondary care data and mortality from national death records. We used logistic regression to estimate odds ratios (OR) with 95% confidence intervals (CI) for mortality in people with versus without mental illness, adjusting for sociodemographic and clinical characteristics and NSTEMI admission timing. Our study is supported by an advisory group including people with lived experience of mental illness. Our protocol and analysis code are available on GitHub (https://github.com/BHFDSC/CCU046_02).
Results
We included 107,550 people with NSTEMI, of whom 0.6% had schizophrenia, 0.6% had bipolar disorder and 26.3% had depression. Compared to people without mental illness, the adjusted odds of 30-day mortality were higher amongst people with schizophrenia (OR 1.73, 95% CI 1.20 to 2.49) and depression (1.17, 95% CI 1.08 to 1.26). The adjusted odds of 1-year mortality were higher amongst people with schizophrenia (1.75, 95% CI 1.39 to 2.21), bipolar disorder (1.56, 95% CI 1.20, 2.02) and depression (1.17, 95% CI 1.11, 1.23). There was no evidence that these disparities changed during the COVID-19 pandemic. Our models suggest that differences in receipt of care, including lower receipt of invasive coronary angiography within the target 72 hours, may explain some of the mortality disparity.
Conclusion
Mental illness was associated with higher risk of 30-day and 1-year mortality post-NSTEMI, with disparities unchanged by the COVID-19 pandemic. Differences in receipt of care may partly explain the excess mortality in people with mental illness. This study highlights the urgent need to investigate why people with SMI are less likely to receive guideline indicated care for heart attack and to inform approaches to improve receipt of care and ultimately survival in this marginalised group.
- Kelly Fleetwood – The impact of mental illness and the COVID-19 pandemic on post-myocardial infarction mortality
-
- Ash Routen and Fatemeh Torabi – Enhancing diversity and quality in health data
View Ash and Fatemeh’s abstract
Ethnicity data are vital for tackling health inequalities, yet remain incomplete, inconsistent, or missing in many routine health records. As part of HDR UK’s cross-cutting theme on diversity, our UK-wide team – recognised as HDR UK’s Team of the Year for “Enhancing Diversity and Quality in Health Data”- has taken a major step forward in addressing this challenge. Drawing on 46 million records from 26 linked sources within the Secure Anonymised Information Linkage (SAIL) Databank, we created the first national ethnicity spine for Wales (2000–2021). Alongside this technical achievement, we convened experts to shape recommendations across six themes: the value of capturing ethnicity and wider determinants of health; the need for standardisation; communication and transparency in data use; training and guidance for collection; data linkage to enhance completeness; and international harmonisation. Together, these efforts demonstrate the power of collaboration in building robust, representative, and equitable health data systems.
- Ash Routen and Fatemeh Torabi – Enhancing diversity and quality in health data
-
- Isobel L Ward – The impact of an endometriosis diagnosis on employment status and earnings among women in England
View Isobel’s abstract
Background
Endometriosis has physical, psychological, social, and economic impacts; however, there has been no national population-based analysis of the labour market impacts of this condition in England. The Women’s Health Strategy for England has highlighted the lack of research into women’s health conditions, including endometriosis. We linked HM Revenue and Customs (HMRC) Pay-As-You-Earn Real Time Information to individual-level census data and endometriosis diagnoses (from Hospital Episode Statistics (HES)) to examine the impact of having an endometriosis diagnosis on earnings and the probability of being in employment.
Methods
We used de-identified monthly employee pay records from HMRC linked to HES Admitted Patient Care (APC) data and detailed sociodemographic information from Census 2011. Our study population included 55,920 women who had a primary diagnosis of endometriosis in an NHS hospital between April 2016 and December 2022, were aged 25 to 54 years at the time of diagnosis, and could be linked to HMRC, HES and Census 2011 data. We used fixed effects regression models to estimate the average changes in employee pay and employment status attributable to being diagnosed with endometriosis. Outcomes were estimated at different time periods after diagnosis, compared with the two-year period before the month of diagnosis. We adjusted for age, calendar time and births. Monthly pay was deflated to 2023 prices, and being a paid employee was defined as receiving a monthly pay greater than £0. During the study, we engaged with Endometriosis UK and women with lived experience of endometriosis.
Results
Compared with the two-year period before being diagnosed with endometriosis, monthly pay initially dropped on average in the first three months post-diagnosis, then returned to pre-diagnosis levels from 4 to 12 months. From one to five years after diagnosis, there was a statistically significant average decrease in monthly earnings among women. Among women in paid work, monthly pay reached an average decrease of £56 (95% confidence interval £29-£83) per month in the four to five years post-diagnosis. This suggests that, following a diagnosis, women in work may be taking lower-paying jobs or working fewer hours. The probability of being a paid employee decreased by 2.7 percentage points in the four to five years post-diagnosis, compared with the two years before diagnosis.
Conclusion
Our findings show that following an endometriosis diagnosis, women earn less and are less likely to be in paid employment compared with the two-year period before diagnosis. These results highlight a clear need for a review of appropriate healthcare services and policies around endometriosis and the workplace. This study has received coverage from media outlets including the BBC, Forbes and Women’s Health magazine, and National World recently cited this study in an open letter to UK health ministers calling for improvements in endometriosis treatment.
- Isobel L Ward – The impact of an endometriosis diagnosis on employment status and earnings among women in England
-
- Emma Whitfield – Emergency diagnoses in 18 conditions: a population-based study of linked electronic health records
View Emma’s abstract
Background
Emergency diagnoses (EDs) – the use of emergency hospital care in the 30 days before a new diagnosis – occur for a variety of complex reasons, including rapid deterioration of symptoms or disease progression, patient factors, and system delays. Emergency diagnosis of cancer is associated with advanced stage at diagnosis and poor prognosis; however, little is known about EDs in other conditions. This study aimed to evaluate the frequency and prognostic implications of emergency diagnoses in England for 18 diverse conditions.
Methods
We used linked records from CPRD, HES, and ONS to identify patients with first incident diagnoses of conditions of interest between 1/1/1999 to 31/12/2019. 18 conditions were examined, comprising infectious (Lyme disease, subacute bacterial endocarditis (SBE), tuberculosis), autoimmune (rheumatoid arthritis, ankylosing spondylitis, coeliac disease, inflammatory bowel disease), and neurological conditions (Parkinson’s disease, multiple sclerosis) alongside COPD, PCOS, coronary/ischaemic heart disease, and schizophrenia and other chronic psychoses. Five cancers (brain, colon, lung, ovarian, pancreatic) were included for benchmarking.
ED was defined as an emergency inpatient admission starting on, or up to, 30 days before the diagnosis date. For each condition we measured ED frequency and examined trends over time, age at diagnosis, and deprivation level, stratified by gender and CPRD population (Aurum or GOLD).
We estimated adjusted odds ratios (aORs) for the association between ED status and death in the year after diagnosis, and adjusted rate ratios (aRRs) for the association with total duration of time in hospital in the year after diagnosis, adjusting for age and year of diagnosis, deprivation level, comorbidity burden, and the diagnosis source.
There were limited opportunities for PPI during analysis. After discussing the study plans with PPI representatives it was agreed that multimedia summaries would be co-produced in order to disseminate findings to a wider audience.
Results
ED frequency ranged from 5.96% (95% CI: 5.18% to 6.84%; Lyme disease, women, Aurum) to 82.6% (81.6% to 83.6%; SBE, men, Aurum), was stable or increased slightly over time, and exhibited U- or J- shaped associations with age. aORs for 1-year mortality following an ED ranged from 1.04 (0.837 to 1.29; SBE, men, Aurum) to 7.53 (5.64 to 10.1; coeliac disease, women, Aurum), while aRRs for duration of time in hospital ranged from 1.16 (1.07 to 1.25; SBE, men, Aurum) to 10.3 (5.88 to 18.1; coeliac disease, men, GOLD).
Conclusion
Emergency diagnoses often comprise a substantial minority of cases or are the dominant diagnostic route and are associated with poor prognostic implications in a wide range of conditions. EDs and their predictors are not unique to cancer and are possible markers of delayed diagnosis in other conditions. Future research should consider whether, and how, EDs and related excess mortality in emergency-diagnosed patients could be prevented.
- Emma Whitfield – Emergency diagnoses in 18 conditions: a population-based study of linked electronic health records
Stream 5: Innovating the data infrastructure
Location: Alsh 1
Chair: Gordon Milligan, Deputy Director of the Alleviate (Advanced Pain Discovery Platform) Hub
Speakers:
-
- Philip Couch – More than data are needed to save lives! Supporting the next generation of SDE RTPs
View Philip’s abstract
Lay summary
The Secure Data Environment Team Development Hub (https://sdertp.org) aims to increase the capacity and capability of the Secure Data Environment (SDE) Research Technical Professional (RTP) workforce by creating a strategic hub that fosters talent development at various career stages, co-develops a relevant training framework with the community, delivers flexible learning options and promotes an inclusive professional environment.
Background
The UK Government strongly recognises the need for NHS data to be made available for research. To realise this vision, recent reviews have highlighted the need to create teams/institutions in both academia and healthcare, often spanning both sectors to deliver infrastructure. However, there are some barriers to the recruitment and retention of highly skilled Research Technical Professionals in some research environments. This can be related to factors such as limited opportunities for career progression and difficulty keeping skills up to date. The SDE Team Development Hub has been established to directly address some of these key barriers.
The key objectives of the Hub are:
- Directly increasing capacity through recruitment of RTPs into existing teams developing SDEs in the Northwest of England.
- Supporting the career development of RTPs through the creation of a competency framework.
- Supporting the training of RTPs by developing a curriculum.
- Facilitating direct training and certification of new and existing RTPs.
- Developing an online resource to share our outputs and experience with the wider SDE community.
Approach
- Increasing SDE RTP capacity and capability. The Hub is offering six short-term placements to undergraduate or Master’s students over its lifetime. Additionally, nine new RTP trainees are being supported – three newly recruited and six existing staff working on practical projects aligned with real-world challenges.
- Co-developing a competency framework and curriculum. Input is being gathered through a series of workshops and surveys involving stakeholders from across the SDE ecosystem. These sessions are defining the required competencies for creating and maintaining SDEs. The resulting framework will inform a development pathway for RTPs.
- Training. A map detailing training and development related to the competency framework and curriculum is being produced and trialled with the short-term placement cohort. Technical training includes formal training leading to certification, university/NHS training and new bespoke training developed by the Hub where gaps are identified.
- Developing an inclusive professional environment. Team Science principles will be embedded to ensure RTPs’ contributions are recognised across disciplinary and organisational boundaries. Existing training on team dynamics will be applied.
Results
A protocol for the competency framework has been developed and workshops have been held to gain input from the community. The initial framework is available through our website. Some Master’s level projects have completed and others have started. Adverts are about to go live for RTP placements.
- Philip Couch – More than data are needed to save lives! Supporting the next generation of SDE RTPs
-
- Simon Li – A standard architecture for TREs: why is it important in a federated world?
View Simon’s abstract
Background
The Standard Architecture for Trusted Research Environments (SATRE), version 1.0, provided a baseline set of best-practice capabilities of relevance to TREs in the UK developed jointly with TRE operators and public representatives. It has requirements that TREs holding personal data should fulfil to demonstrate equivalence in a transparent manner. It was published in October 2023 and was quickly adopted by Trusted Research Environments (TREs) across the UK and beyond.
The initial success of SATRE was due to two main factors: it was driven by an open-collaboration with a grassroots community, and it encapsulated the existing state of most TREs, as opposed to imposing a top-down view of what a TRE should be. This was primarily achieved through Collaboration Cafés – open events that anyone can attend – supported by other events to obtain input and feedback from the wider community.
Independent communities have used SATRE as a starting point for federation projects including the NHS regional SDEs, the Scottish Safe Haven network, and the European Open Source Cloud ENTRUST network of federated TREs.
Others have also asked for extensions to SATRE such as for natural language processing, or data tiering and classification. Finally, several groups planning to deploy new TREs had requested guidance on how to implement a new SATRE compliant TRE.
Objectives
Version 2 of SATRE, through the TREvolution Core Programme, incorporates guidance on TRE federation for the UK and beyond. Supporting these needs is critical to ensuring the community remains involved in SATRE, and will not result in multiple competing or conflicting standards. It establishes a long-term sustainable governance and accreditation model.
The new version also allows for infrastructure that was, and still is, under active development, with limited existing best-practice. This has required a different approach to that taken for version 1.0, with a layered specification model providing not only architectural but also implementation guidance.
A significant outcome of the SATRE work is a specification for TREs based on Kubernetes; a popular, flexible and standard open-source platform for managing containerised workloads and services which is already making in-roads into the TRE ecosystem. Encouraging Kubernetes as a common – community owned – infrastructure baseline allows components to be shared across implementations where practical, and provides a solution for independent TREs to be part of a federated network.
Finally, we provide an initial version of a technical specification-compliant vendor-neutral reference TRE, called “K8TRE”. This is suitable for pilot deployments, and development continues to make it production ready. K8TRE is fully open-source and can be deployed on private infrastructure, or in the public cloud, with minimal adaptation.
- Simon Li – A standard architecture for TREs: why is it important in a federated world?
-
- Lars Murdoch – ScottisH Medical Imaging Dataset with Evaluation of Linked Data for Cardiovascular Disease (SHIELD)
View Lars’s abstract
Background
Routinely collected imaging data is underutilised in research due to challenges collecting, accessing, curating imaging data. SHIELD-CVD (ScottisH medical Imaging dataset with Evaluation of Linked Data for CardioVascular Disease) is a collaborative initiative led by the British Heart Foundation Data Science Centre, in partnership with Public Health Scotland and experts from King’s College London. The project’s primary aim is to enhance the usability and accessibility of the Scottish Medical Imaging (SMI) dataset for cardiovascular research. SMI contains over 57 million imaging radiological studies that can be linked to other routinely collected healthcare datasets.
Objectives
- Enhance dataset transparency: Provide clear, detailed information on imaging modalities and their disease representations within the SMI dataset. This will include information on available imaging types, reconstructions and sequences.
- Streamline governance processes: Simplify the administrative procedures required for researchers to access and utilise the dataset.
- Develop reusable curation tools: Create and share code and methodologies that facilitate efficient data curation and analysis of the imaging and linked electronic health record. Both imaging based and disease based cohorts will be curated within SMI to facilitate research applications and accelerate research analyses.
- Apply FAIR principles: Ensure the dataset is Findable, Accessible, Interoperable, and Reusable, aligning with best practices in data management.
Impact and potential benefits
- Accelerated research: The curated SMI data could significantly expedite research studies. This will accelerate our understanding of cardiovascular diseases, potentially leading to new treatments or management strategies.
- Improved lives of patients in Scotland: By accelerating research and the utility of this dataset, there is potential for researchers identifying mechanisms for preventing and treating disease across the population. The insights gained could be used to enhance the quality of life for thousands living with cardiovascular disease in Scotland, their families and carers.
- Early detection and intervention: The ability to identify high-risk individuals through analysis of electronic health data is a critical benefit. Early detection of cardiovascular disease can allow for mechanisms to slow the progression of disease, improve quality of life, and reduce the burden on caregivers and healthcare systems.
- Enhanced prediction: Similar to the management of cardiovascular diseases, accurate prediction and early detection of dementia could lead to improved management of risk factors. This proactive approach in healthcare can lead to more effective treatment strategies, tailored to individual needs.
Conclusion
SHIELD-CVD is committed to fostering a collaborative research environment. By improving access to and understanding of the SMI dataset, we aim to empower researchers to conduct impactful cardiovascular studies that ultimately benefit patients and the wider public.
- Lars Murdoch – ScottisH Medical Imaging Dataset with Evaluation of Linked Data for Cardiovascular Disease (SHIELD)
-
- Reecha Sofat – UK CliC: Unlocking the power of clinical cohorts in health research
View Reecha’s abstract
Background
Cohort research studies collect bespoke, detailed data, including clinical, genetic, biomarker, wearable and imaging data on individuals with a specific disease. However, because disease cohorts often collect individuals’ health data at the time of an event, and for a period prior to or following an event, causal and prognostic research can be challenging. Linking these rich cohort data with routine health records could transform and accelerate our understanding of disease causes, progression, and treatment.
Objectives
The BHF Data Science Centre aims to unlock the power of clinical cohorts by enabling:
- Secure linkage to routine NHS data in a cost- and time-efficient manner
- Integration of additional clinical, multi-omic, imaging, and wearable cohort data
- Cohort owners and approved researchers to securely access, analyse, and share enhanced multimodal cohort data via a privacy-preserving, trustworthy, and scalable platform
Infrastructure development
We have established the UK Clinical Cohorts (UK CliC) Trusted Research Environment (TRE) in collaboration with UK Secure Research Platform (UK SeRP) and the Secure Anonymised Information Linkage (SAIL) Databank. Pseudonymised cohort data linked to NHS data will be made available to researchers following the findable, accessible, interoperable, and reusable (FAIR) principles using established SAIL application processes after an embargo period.
A transparent governance, operational, and contract framework has been set up in collaboration with cohort study research teams, public contributors, HDR UK governance and legal specialists, and the platform provider.
The BHF Data Science Centre health data scientist team, who have established reusable algorithms to clean and curate routinely collected NHS data, will provide curation, guidance and expertise, along with SAIL.
Impact
- Streamlining data access applications and governance to access linked health data, leveraging the expertise of the HDR UK information governance and legal teams
- Reducing research costs by sharing data linkage expenses across studies
- Leveraging the expertise and services of the BHF Data Science Centre health data scientist team
- Ensuring robust data security through a TRE
UK CliC is now available to any UK cardiometabolic cohort study, with 24 cohorts in the process of being onboarded. We have engaged with the wider research community and developed a community of cohorts to fully realise the potential of linked cohort data.
Next steps
UK CliC has been developed for cardiometabolic cohort studies. However, the infrastructure and processes are applicable for delivering a disease-agnostic platform, which could bring even greater efficiencies. We are engaging with a range of funders to explore options for sustaining a disease-agnostic UK CliC service for UK-wide disease-cohort research. The availability of this data to the wider scientific community will facilitate and accelerate translational science across a range of disease domains.
We are also developing the processes and pipelines to extend UK CliC to support multi-modal data such as imaging, omics, and wearables data.
- Reecha Sofat – UK CliC: Unlocking the power of clinical cohorts in health research
Stream 6: Sponsors’ talks
Location: Alsh 2
Chair: Edith Milanzi, IQVIA
Speakers:
-
- Andy Boyd – UK Longitudinal Linkage Collaboration: the Trusted Research Environment for data linkage in the longitudinal research community
View Andy’s abstract
Background
The UK Longitudinal Linkage Collaboration (UK LLC) partnership has been established to provide a centralised infrastructure to enhance Longitudinal Population Studies (LPS) by systematically linking study data with participants’ routine health, administrative and place-based records. By centralising these functions and offering them as a service, UK LLC will unlock new scientific opportunities, substantial cost efficiencies and form an expert hub for record linkage that drives innovation and knowledge exchange.
Method
We established a centralised Trusted Research Environment (TRE) with a bespoke governance framework tailored to the needs of longitudinal research and meeting the expectations of consented participants. Our TRE is hosted on the Secure eResearch Platform and is accredited to the ISO 27001 and Digital Economy Act standards. A single application process enables researchers to access the TRE and incorporates the needs and voices of many stakeholders, including data owners and public contributors. We have developed new data pipelines with the NHS, Administrative Data Research UK and the Office of National Statistics to link health and administrative records.
Results
UK LLC is a partnership of over 20 of the UK’s most established interdisciplinary LPS with over 400,000 participant records. Participants’ data are linked to NHS records and place-based exposures across the UK. This resource is now accessible for public good research for bona fide UK researchers. Administrative tax, work and pensions, and education records are being added to the resource throughout 2025–26. This data flow is enabled by:
- A model where a Trusted Third Party processes participant identifiers for many different data owners
- Creation of a novel longitudinal data pipeline, enabling linkage, data extraction and update of records over time
- An access framework where a Linked Data Access Panel considers applications on behalf of data owners (e.g., the NHS), with review by a public panel and distributing applications to LPS for approval
- Mechanisms to ensure long-term sustainability, scalability and balance governance of consented research with inclusion of wider individuals where this promotes research equity
Conclusion
UK LLC provides a simple-to-access strategic research-ready platform for interdisciplinary longitudinal research. Our partnership of studies, data stewards and infrastructure experts is establishing multiple precedents for pan-UK and interdisciplinary linkages. These include extending linkages of LPS participants to previously inaccessible socio-economic datasets and developing new governance approaches such as reviewing the validity of consent. UK LLC is positioned to allow researchers to investigate cross-cutting themes and ingrained inequalities by enabling triangulation of routine records with partner studies’ uniquely granular data on lived experience, behaviours, aspirations and other data not included in official records. Pooling and systematic linkage of large-scale, diverse LPS will help provide the numbers to study rarer outcomes and seldom-reached populations.
- Andy Boyd – UK Longitudinal Linkage Collaboration: the Trusted Research Environment for data linkage in the longitudinal research community
-
- Adam Milward – Making Data FAIRer within the HDR UK partnership: A practical demonstration
View Adam’s abstract
Objectives
We aim to share the latest best practice on making data FAIRer within the HDR UK partnership, enhancing the data user journey, accelerating research and innovation, and ultimately improving patient care.
We will raise awareness of the tools and processes available to the partnership to manage data, including federating or combining datasets with other relevant organisations, and managing access requests. This allows users to improve efficiency and utility while harnessing the power of their data for good.
Method
An oral presentation including a technical demonstration focusing on enhancing the data user journey, and discussion of a live use case developed by MetadataWorks and KMS SDE.
We will demonstrate key features of the tool, live searching deidentified datasets on a user-friendly interface, requesting access and managing requests, from both the user and the data owner perspectives. We then outline how to implement similar projects with best practice notes to help attendees get more from their own data.
Results
We’ll showcase how to:
- Federate or share datasets with other suitable organisations, improving efficiency and widening data audiences
- Create user-friendly interfaces to help data users quickly find and access what they need while maintaining the highest security standards
- Promote datasets via integrated search engine optimisation
- Make existing datasets discoverable and improve user engagement once they are on the platform
These improvements increase visibility and utility of datasets and encourage the goal of using data for public good.
Conclusion
Attendees will leave the presentation with pragmatic tips and advice for upgrading tools and processes to meet best practice standards for FAIR data.
- Adam Milward – Making Data FAIRer within the HDR UK partnership: A practical demonstration
-
- Craig J. Currie – Automated analysis of routine healthcare data: Where are we, and where are we going?
View Craig’s abstract
Automating the complexities of analysing routine healthcare data (real world data) is extremely challenging. Observational studies using these data take many months up to a couple of years. These challenges are not just technical—they are cultural.
Access to anonymous data remains severely restricted, and the multidisciplinary skills necessary to curate and analyse these data remain scarce. At a time of fast-moving advances in computing, the barriers to analysis of NHS data to facilitate public health, healthcare management and other purposes remain restrictive, as they have always been. On the other hand, the mood music appears to have shifted, and the willingness to exploit the value of NHS data is growing.
Advances in computing, allied to the availability of generative AI models, offer an opportunity to liberate the promise of this precious resource while allaying fears among sceptics. This could also mean that the general public can engage in their own scientific investigations using these same data, while maintaining scientific integrity. Peer review may soon become obsolete in this field, as published materials can be immediately replicated and fact-checked.
Here we demonstrate how these tools have advanced in the UK using the example of Livingstone, and provide a glimpse of what may soon be possible if the doors—largely closed to innovation—are opened. Depending on perspective, these evolutionary changes may seem either daunting or exciting.
We are at an inflection point in healthcare data analysis. Embracing change offers a paradigm shift that should translate into considerable improvement in treatment value for money and accrual of healthcare benefits.
- Craig J. Currie – Automated analysis of routine healthcare data: Where are we, and where are we going?
-
- Cassie Smith – The SafeGUARDS: a principles-led governance framework for using data about people for research
View Cassie’s abstract
Background
Data governance models for research using data about people are complex and have given rise to significant levels of public distrust. Reflecting on the challenges of rapidly creating new combinations of data in the COVID-19 pandemic, the Science Academies of the “Group of Seven” nations called for the G7 to establish health data as a global public good by working together to adopt principle-based governance systems. Yet, we have lacked a comprehensive and easy-to-communicate governance framework to enable this.
Methods
Professional deliberative workshops led by the Pan-UK Data Governance Steering Group, including the Office for National Statistics, NHS and regulators, and at the International Population Data Linkage Network conference. Public involvement from the Health Data Research UK Public Advisory Board and consultation via an online survey.
Results
The GUARDS are a novel set of overarching principles to help data stewards consider and balance the factors and competing tensions that impact on our use of data. The GUARDS require that research data use is:
- Guided by public and professional involvement
- Understandable and transparent
- Aligned to common standards
- Responsible in meeting ethical requirements and delivering equitable outcomes
- Delivering in times of crisis and capable of informing policy needs
- Stewarded by a distinct profession with diverse skillsets, roles and responsibilities
The GUARDS sit at a high level and are designed to knit together a wide spectrum of connected requirements. They will help collate and promote governance best practice and highlight areas where further action is required. The principles act in synergy with the more operational ‘Five Safes’—conceptualised together as the SafeGUARDS framework. The GUARDS also interact with wider principles, such as the FAIR and CARE principles, and do not replace them. To support adoption, a series of aspirations have been identified for each GUARDS pillar, and HDR UK will support centralised collation of good practice and resources from across the data science community.
Conclusion
The SafeGUARDS provide an end-to-end governance framework to ensure population data science balances scientific utility with trustworthiness and the social licence to operate. They are designed to promote effective communication, collate best practice, and support knowledge exchange. The principles will evolve over time to reflect continuous improvement, respond to emerging issues, and invite wider international reflection.
- Cassie Smith – The SafeGUARDS: a principles-led governance framework for using data about people for research
15:30 – 16:00 (Hall 1 and 2, Exhibition)
Coffee break
16:00 – 16:50 (Lomond Auditorium)
Title: Mayo Clinic Platform: Harnessing the power of platform thinking to transform care
Keynote speaker: Dr John Halamka, Dwight and Dian Diercks President at Mayo Clinic Platform
Chair: Dame Janet Thornton, Health Data Research UK (HDR UK) Board member
16:50 – 17:00 (Lomond Auditorium)
Title: Summary remarks and close (including presenting Lightning Talk Award)
Keynote speaker: Katie Wilde, Co-Director of HDR UK Scotland and Grampian Safe Haven (DaSH) Director and Head of Digital Research, University of Aberdeen
17:00 – 18:00 (Hall 1 and 2, Exhibition)
Drinks reception and poster exhibition
18:00
Close
View full agenda as pdf
Day 2 of HDR UK Conference 2025
View the agenda for Day 2 of HDR UK Conference 2025, on Thursday 16 October 2025.
Health Data Research UK Conference
The UK’s No.1 conference on health data science returns for 2025. Join us at this two-day hybrid event to celebrate the latest advances in health data science that are improving people’s...