Experts warn of risk of 'data poverty' if data underpinning AI health technologies is not representative and inclusive

Researchers from INSIGHT, the Health Data Research Hub for eye health, and University Hospitals Birmingham NHS Foundation Trust and are warning of the risk of reinforcing healthcare inequalities if new AI health technologies are based on unrepresentative datasets – a situation they describe as ‘health data poverty’.

In recent years, many academic and commercial organisations have been developing artificial intelligence (AI) and other digital health technologies based on publicly available datasets. But there is little information about how many datasets actually exist or the diversity of people and health conditions represented within them, which could lead to the development of technologies and products that only work for certain groups or countries.

Focusing on eye health – one of the leading areas of digital innovation – consultant eye specialist and director of INSIGHT Professor Alastair Denniston and his colleagues carried out a global search to explore the availability of publicly available datasets and the extent to which they represented the diversity and needs of the world’s population.

They identified and analysed 94 datasets containing 507,724 clinical images and 125 videos of eyes gathered from at least 122,364 people. They then created a comprehensive catalogue detailing the source of each dataset, its accessibility, and the populations, diseases and types of images represented within it.

The team found that most images came from populations in Asia and Europe, with very few datasets from large parts of the world such as sub-Saharan Africa (one dataset) and South America (two datasets). Looking closer, they discovered that information about the people within each dataset was generally poor, with basic demographic information such as age, sex and ethnicity being missing in more than one in five (20%) of datasets.

This lack of geographical diversity and demographic data is concerning, because technologies developed using data from one population may not work effectively when applied in a different group of people or part of the world.

There were also significant disparities in the types of diseases covered by the datasets. The majority of images were relevant to diseases such as diabetic retinopathy, glaucoma and age-related macular degeneration, mainly because these images are routinely collected as part of healthcare and screening in countries with advanced modern health infrastructure.

However, cataracts, trachoma and refractive error – which have been designated as priority eye diseases by the World Health Organization and account for half of all global blindness – were significantly under-represented. These conditions are common in low and middle income countries where digital technology could make a big difference in enabling access to healthcare.

Whilst these are areas that have not traditionally included imaging as part of their standard assessment, there is increasing evidence that such technologies could play a role. But the lack of relevant data for developing and training AI-based tools makes it less likely that researchers and companies will be able to develop products that could help.

Publishing their analysis in The Lancet Digital Health, the researchers highlight that this situation risks creating ‘health data poverty’ for populations and countries that are under-represented in current health datasets.

Professor Denniston said,

“We hope that our catalogue will raise awareness of more diverse datasets for the development of AI-based health technologies. We need to act now to encourage health systems and researchers to invest in publicly available datasets to support research and innovation in areas that are currently data poor. Otherwise, we risk perpetuating a growing digital divide where healthcare technologies are only developed to benefit diseases, populations and countries with advanced data infrastructure.”

Caroline Cake, CEO of Health Data Research UK said,

“The coming generation of digital health technologies are only as good as the data we use to develop them, and this new study highlights the fact that datasets must be representative and inclusive if these tools are to be relevant and applicable to all. We are committed to working with our national and international partners to ensure that advances in digital healthcare bring benefits to everyone.”

The research team was drawn from from the University of Birmingham;University Hospitals Birmingham NHS Foundation Trust; Moorfields Eye Hospital NHS Foundation Trust; the London School of Hygiene and Tropical Medicine; McGill University, Montreal, Canada; and Health Data Research UK.

Read more

A Global Review of Publicly Available Datasets for Ophthalmic Imaging: Recognising Barriers to Access, Usability and Generalisability (2020), Saad M. Khan et al, The Lancet Digital Health. DOI: 1016/S2589-7500(20)30240-5

Find out more at hdruk.org. Get involved on Twitter: @HDR_UK #HealthDataTogether

Areas of work

Infrastructure

Sites / Hubs

INSIGHT Hub

UK-Canadian AI initiative to create equitable multi-ethnic polygenic risk scores that impro...

Using data to improve care for patients with heart failure

Grand Challenge In Imaging AI