Global mapping of publicly available datasets: understanding the barriers to visibility, accessibility and utility

The challenge

Many artificial intelligence algorithms are developed using large, publicly accessible datasets. But we know very little about what data is out there, where it comes from, and the people, settings and health conditions it represents. The risk is that new AI health technologies will be based on unrepresentative datasets and will therefore only work for some people in some contexts, leaving countless others behind – an issue known as ‘health data poverty’.

The solution

Eye health is one of the leading areas of digital innovation. Through global searches and analysis, this project is mapping what publicly available eye imaging datasets are out there and reviewing the extent to which they represent the diversity and the needs of the world population. In understanding and assessing the information being used by AI algorithms, the team can identify gaps – such as a lack of basic, clinically important information about the people represented (like age, sex and ethnicity) or disparities in who or what conditions are represented. We can then work to understand why this is the case – and look for possible solutions. For example, if data isn’t publicly available, why? Are there barriers to collecting it, and how can we overcome these barriers? Or are there particular challenges in making more representative data visible, accessible and useable?

Only 1 of the 98 eye health datasets we identified and assessed came from sub-Saharan Africa – most came from populations in Asia, North America and Europe – and none came from Australia and New Zealand. This means the people and diseases in these datasets represent only a small part of the global population, and others are left ‘off the map’.

—Xiao Liu

The impact and outcomes

The project has already uncovered major gaps in the publicly available data on eye health and highlights a concerning lack of data on certain conditions, from certain parts of the world and certain population groups.

The ambition is to extend these reviews into other health disciplines to understand the scale of the problem and to make sure that new AI health technologies are based on representative datasets so that everyone benefits from AI innovations and decision-making for better health and care.