Assessing and improving ethnicity data in England health records
7 August 2024
Ethnicity data from general practice and hospital records of more than 60 million people in England has been studied in detail in for the first time.
Each quarter, the HDR UK Impact Committee consider dozens of articles and select the most impactful examples, ranked against core pillars of the HDR UK ethos: research quality, team science, scale, open science, patient and public involvement, patient impact and diversity.
One of the papers selected by the committee was ‘Ethnicity data resource in population-wide health records: completeness, coverage and granularity of diversity’ by Pineda-Moncusí et al. as part of the CVD-COVID-UK/COVID-IMPACT Consortium supported by Health Data Research UK (HDR UK) and the British Heart Foundation (BHF) Data Science Centre.
The challenge
Health inequalities with respect to ethnicity and other sociodemographic factors were highlighted during the COVID-19 pandemic. However, this is also reflected across many other disease areas. Accurate and standardised collection of ethnicity data in electronic health records is a major challenge, but is crucially important in understanding and investigating health inequalities. In UK health records, there are several different ways to categorise ethnicity, many of which often reduce the categories down with some loss of information or accuracy. One example of this is the translation of the 20-category system from the NHS coding system to the 6-category ONS system.
“Emerging AI-based healthcare technology depends on the data that is fed into it, therefore a lack of representative data can lead to biased models that ultimately produce incorrect health assessments. It is important that we focus on better understanding and improving the quality of ethnicity data, as well as the methodology to collect and use it for research, to ensure that any advancements are inclusive, free from bias, and benefit everyone,” says senior author Professor Sara Khalid, Associate Professor of Health Informatics and Biomedical Data Sciences
The Impact
Pineda-Moncusí et al. addressed this challenge by curating a population-wide data source for ethnicity data from over 60 million individuals in England primary care, and secondary care records. They assessed the quality of the ethnicity data by looking at completeness, inconsistency across records, and granularity of the data.
They then reconciled various coding schemes into consistent hierarchy and categories, through organising over 250 ethnicity subgroups beyond the standard less-nuanced categories. More accurate, accessible and useable ethnicity data is crucial for addressing important disparities in health across ethnicities, to understand the diverse populations that make up electronic health records, and to influence policy that can move towards better health for all.
“It is hoped that the data curated in this work can be used by future researchers including ethnicity in their work and ultimately contribute towards closing the gaps in healthcare,” adds Professor Khalid.
What the Impact Committee said
The Impact Committee scored this paper highly across all scoring criteria. In particular, the study was impressive with respect to scale in that it covered: a near-whole population dataset of health records in England. The study also upheld exemplary commitment to diversity, equality and inclusion through its focus on developing better data, and thus better research, for minority populations and health inequalities, showing massive potential for impact in this area. Open science was also a standout feature as all the analysis code for the project was published on Github, and the study was accomplished as part of a large team effort.