Case notes and other texts written by clinicians are an invaluable source of data about disease and treatment. This study presents new thinking on how they can be analysed to gain greater insights into patient phenotypes. It provides an analytic approach that shows similarities between different facets of patients’ illnesses as they are described in clinical texts.
Finding the semantic similarities between clinical texts is a valuable way to identify phenotypes. It allows researchers to use defined metrics to group patients whose conditions have observable similarities. A single score is often given to show how much they have in common. However, there are major limitations to the use of singular similarity scores for comparing patient phenotype profiles.
Single similarity scores aggregate similarities rather than distinguishing between what is alike or unalike, obscuring multiple similarities and making it harder to identify relationships. The unmitigated use of text-derived phenotypes for semantic similarity may result in unintended bias on account of the large phenotype profiles they produce. This can be especially pronounced for clustering approaches since fewer differences between groups of entities can be discerned. Single scores also leave much of the potential for comparing phenotype profiles unrealised.
Using natural language processing techniques to calculate multiple semantic similarity scores (based on different facets of the phenotype manifestation) can offer a more information on the interrelationships between patients, diseases and phenotypes. This paper (published in Computers in Biology and Medicine), by a team including Dr Luke Slater, Research Fellow at the University of Birmingham Institute of Cancer and Genomic Sciences, demonstrates how this can be done. Their approach involved creating text-derived phenotypic annotations for 1,000 patient admissions from the MIMIC-III dataset, and splitting them into separate facets, revealing that annotations are widely spread across different facets of the human phenotype ontology (HPO).
Impact and outcomes
There are many substrata among patients. This research (supported by HDR UK) allows them to be identified and for new phenotypes to be defined.
The algorithm developed by the team also allows researchers to explore new hypotheses by letting them create generalisations against which they can test an idea. This could have implications for diagnoses for example by looking at whether there are specific symptoms reported for one form of an illness that are not present in another.
The team hopes to develop a tool to help frontline healthcare professionals with diagnoses. One use would be to help distinguish between illnesses that can present in the same way. For example pneumonia and embolisms can have similar symptoms, but patients with the former more often describe themselves as being in pain.
The research represents a step towards empowering researchers and clinicians to be able to rapidly define cohorts and test ideas to gain strong data-based insights – something Dr Slater describes as “the holy grail of this kind of work”.
The HDR UK Impact Committee chose this paper for its excellence, significance, originality and rigour.
Dr Slater firstname.lastname@example.org
One Wales: how data assets from the National Core Studies have informed the Welsh Government’s pandemic response.
20 January 2022
HDR UK, data custodians and government came together at the start of the pandemic as a collaborative 'One Wales' team to develop an agile approach to data analysis and enable vital insights to be...
Population Research UK – how public contributors helped to shape research at its earliest stages
19 January 2022
HDR UK sought to embed public and patient involvement in the entire journey of the design and dialogue phase of Population Research UK - through its design, to delivery, to impact and...
A genome-wide association study with 1,126,563 individuals identifies new risk loci for Alzheimer’s disease
18 January 2022
Overview Alzheimer’s disease (AD) is a highly prevalent form of dementia – the genetic variations underlying the disease are poorly understood and the number and effectiveness of drug...