Understanding genetic diversity through RNA data to inform future research
22 January 2024 | Author: Aditya Kale, Clinical Research Fellow (AI and Digital Health)
This study from Fachrul et al was selected by the HDR UK Impact Committee.
Each quarter, the HDR UK Impact Committee consider dozens of articles and select the most impactful examples, ranked against core pillars of the HDR UK ethos: research quality, team science, scale, open science, patient and public involvement, patient impact and diversity.
One of the papers selected by the committee was ‘Direct inference and control of genetic population structure from RNA sequencing data’ by Fachrul et al.
The challenge
RNA sequencing is a scientific technique to detect and quantify the expression of genes (in the form of RNA molecules) from biological samples. By capturing the expression, we can identify how certain genes may play a role in complex diseases, deepening our understanding of how they are regulated.
However, many gene expression studies do not consider how the data varies between different populations. This lack of sufficiently-captured genetic variation (which reveals differences of DNA between populations through a process called “genotyping”) risks biasing such studies, as the results may explain the inherent differences between populations and not the diseases of interest. In this study, the authors developed a software tool (RGStraP) to estimate the composition of genetic variations in understudied populations.
The impact
To test their tool, researchers used data from blood samples collected from 376 individuals recruited in Nepal as part of the larger Strategic Typhoid Alliance across Africa and Asia (STRATAA) study. By performing RGStraP on the RNA sequencing samples and comparing them to genotyping results done on the same samples, the authors demonstrated that their tool can capture and adjust for genetic variations merely using RNA data. A validation dataset from the Geuvadis consortium consisting of genomes from diverse multi-national populations was also used.
Through use of this software, researchers can better identify particular genes associated with risk of diseases more accurately by taking into consideration the population structure of the samples. The approach and software tool developed by the authors are freely available for researchers, enabling researchers to undertake more robust gene expression-based research ultimately improving patient care in the long run.
This study has particular benefits for gene expression analyses where matched genomic data may not be readily available to adjust for bias, such as in low and middle-income countries.
What the Impact Committee said
The Impact Committee scored this paper highly against a range of criteria. The study was particularly impressive in four areas. Firstly, there were multiple large-scale datasets in the study covering a total of 4,921,472 genetic variants across all samples. Secondly, there was strong team science, with representation from the four nations and internationally, with researchers at different stages of their careers.
The software tool developed in this study was also made openly available, showing that the researchers are keen to support open science. The code for the software pipeline is freely available from Github. Finally, the ability for this tool to benefit understudied and diverse global populations in future RNA sequencing studies shows a commitment to equality, diversity and inclusion.