A growing concern amongst the genetics community is the lack of ethnic diversity in health data. In particular, there is concern that polygenic risk scores (PRS) developed to inform clinical decision-making are frequently based on individuals from European ancestries.

This bias means that polygenic risk scores – which aim to quantify the cumulative effects of a number of genes to predict a person’s likelihood of displaying a trait with a genetic component – often show better performance in those with European ancestries. Therefore, understanding the cause of human diseases and translating this into the clinic, can lead to substandard care due to a person’s ethnic ancestry.

An international collaboration led by Health Data Research UK researcher Dr Michael Inouye from the University of Cambridge and the Cambridge Baker Systems Genomics Initiative and Dr Brent Richards from McGill University, Canada aims to use more ethnically diverse datasets to improve the performance of the PRS to address the inherent bias,


PRS are an emerging precision medicine tool based on multiple gene variants that, taken alone, have weak associations with disease risks, but collectively may enhance disease predictive value in the population. However, the clinical benefit of these may not be equal among non-European populations, as they are under-represented in genomic studies that serve as the basis for polygenic risk score development.

A 2019 analysis of the first decade of polygenic scoring studies (2008–2017, inclusive), found that 67% of studies included exclusively European ancestry participants and another 19% included only East Asian ancestry participants. Only 3.8% of studies were among cohorts of African, Hispanic, or Indigenous peoples. This means that, compared to European ancestries, polygenic risk scores won’t perform as well for most of the world’s population.

Dr Michael Inouye, Director of the Cambridge Baker Systems Genomics Initiative and an Investigator at Health Data Research UK – Cambridge, said:

“The socio-economic issue of over-representation of participants of European ancestry in human genetics research has been around for a long time, as the funding and studies often take place in Europe and North America. This in-built disadvantage limits our understanding of etiological factors predisposing to disease risk, and is becoming a big clinical problem.”


Recent breakthroughs in genomics and machine learning have generated genetic risk scores with the potential to improve patient care and health maintenance. Population genetic methods exist to adjust for this ancestry bias in polygenic score performance, but more sophisticated methods are possible based on advances in deep learning and transfer learning on current genotype cohorts. More equitable polygenic scores and more equitable risk prediction is likely to lead to better treatment and prevention.

The team is looking to develop and test new Artificial Intelligence (AI) approaches on more diverse and representative datasets. The aim is to increase the robustness of the PRSs generated and provide insights into how best to construct unbiased AI tools for genetic risk prediction.

Brent Richards, Professor of Medicine, Human Genetics, Epidemiology and Biostatistics, McGill University, said:

“One way to address this issue would be to recreate these cohorts but that would take time and money, so our aim is to create something algorithmic – by learning how the scores perform in Europeans and transferring it to the wider population. A major focus for us is an application based on the concept of transfer learning, much in the same way Google Translate allows you to translate languages, this algorithm will translate the genotype of two different ancestries.”

In addressing global diversity, the team is working globally to scale the research, being inclusive of different datasets across the UK, Canada, US, Japan, Bangladesh and China. The UK Health Data Research Alliance, an independent alliance of leading healthcare and research organisations, is providing secure access to their diverse datasets including the UK Biobank of 500,000 volunteer participants, and Barts Health NHS Trust, a partner in the East London Genes and Health study, one of the world’s largest community-based genetics studies of people with Pakistani and Bangladeshi heritage.

They also have access to data from the Million Veterans Program in the US, which holds data on 825,000 veterans in one of the world’s largest programs on genetics and health; the China Kadoorie Biobank, set up to investigate the main genetic and environmental causes of common chronic diseases in the Chinese population; the Bangladesh Risk of Acute Vascular Events (BRAVE) Study, and Biobank Japan, the largest consortium of genotypes in the world.

In particular, they will focus on applying these new AI approaches to demonstrate the potential for improved risk prediction and screening of atherosclerotic coronary heart disease. This disease area was selected due to the health and societal burden and the presence of data around cholesterol, blood pressure, smoking and other risk factors.

Dr Inouye said:

“There are no off-the-shelf tools, everything has to be coded from scratch. One thing that can be done is working from the individual level genotype and phenotype data and each person’s subsequent health events and constructing a neural network around that data.

“Once we can scale that data, we need to understand how to take the models on one dataset and transfer it from one ethnicity to a new ethnicity. The primary challenge is around the structure of the genome and how, as far as we’re aware, the primary drivers of the variable performance are down to the frequencies and correlations of genetic variants in Europeans vs Africans vs South Asians, and so on”.


The team aims to highlight the treatment of linkage disequilibrium and variant frequencies when applying polygenic scoring to cohorts of non-European ancestry, and bolster the rationale for large-scale genome studies in diverse human populations. This increased representation will advance understanding of the factors associated with disease and efforts to prevent and treat disease.

The grant is methodological, with the team looking to put in place the methods to address this challenge and make them available at a global scale. The long-term vision is to have as effective and equitable care as possible for the whole population, reducing the number of health events for the population, without providing substandard care for those from non-European backgrounds.

Public Engagement

HDR UK has a patient advisory panel of eight people from across the country; a mix of patients, carers and the public, to act as a voice for the public and patients, advising on broadly what patients will and will not accept.

Ben Johnson, Health Data Research UK Public Advisory Board Member, who supported the project’s funding application, said:

“I am always cautious when researchers want to conduct a study of historically marginalised populations, as they have not been well served by medicine in the past. But this project passes the test in that the analysis will directly benefit those from black and minority ethnic backgrounds and it is focusing on heart disease which is particularly common in these groups.

“While I am an advocate for the use of technology in healthcare, machine learning is totally ineffective if it is only based on Europeans. It is crucial that we redevelop polygenic scores in people of non-European descent to understand human biology across our diverse population.”

Collaboration opportunities

This research collaboration between the UK and Canada, is enabling the sharing and access to data and expertise on an international scale. The UK and Canada have a shared culture, language, and population diversity, making for a successful partnership, which can advance science in public healthcare around the globe.

The UK-Canada team is looking to collaborate with other research institutes and data banks, particularly those holding non-European ancestry genotype data.

Partners: John Danesh, Michael Inouye, Brent Richards

Contact: John Danesh