Case notes and other texts written by clinicians are an invaluable source of data about disease and treatment. This study presents new thinking on how they can be analysed to gain greater insights into patient phenotypes. It provides an analytic approach that shows similarities between different facets of patients’ illnesses as they are described in clinical texts.

The challenge

Finding the semantic similarities between clinical texts is a valuable way to identify phenotypes. It allows researchers to use defined metrics to group patients whose conditions have observable similarities. A single score is often given to show how much they have in common. However, there are major limitations to the use of singular similarity scores for comparing patient phenotype profiles.

Single similarity scores aggregate similarities rather than distinguishing between what is alike or unalike, obscuring multiple similarities and making it harder to identify relationships. The unmitigated use of text-derived phenotypes for semantic similarity may result in unintended bias on account of the large phenotype profiles they produce. This can be especially pronounced for clustering approaches since fewer differences between groups of entities can be discerned. Single scores also leave much of the potential for comparing phenotype profiles unrealised.

The solution

Using natural language processing techniques to calculate multiple semantic similarity scores (based on different facets of the phenotype manifestation) can offer a more information on the interrelationships between patients, diseases and phenotypes. This paper (published in Computers in Biology and Medicine), by a team including Dr Luke Slater, Research Fellow at the University of Birmingham Institute of Cancer and Genomic Sciences, demonstrates how this can be done. Their approach involved creating text-derived phenotypic annotations for 1,000 patient admissions from the MIMIC-III dataset, and splitting them into separate facets, revealing that annotations are widely spread across different facets of the human phenotype ontology (HPO).

Impact and outcomes

There are many substrata among patients. This research (supported by HDR UK) allows them to be identified and for new phenotypes to be defined.

The algorithm developed by the team also allows researchers to explore new hypotheses by letting them create generalisations against which they can test an idea. This could have implications for diagnoses for example by looking at whether there are specific symptoms reported for one form of an illness that are not present in another.

The team hopes to develop a tool to help frontline healthcare professionals with diagnoses. One use would be to help distinguish between illnesses that can present in the same way. For example pneumonia and embolisms can have similar symptoms, but patients with the former more often describe themselves as being in pain.

The research represents a step towards empowering researchers and clinicians to be able to rapidly define cohorts and test ideas to gain strong data-based insights – something Dr Slater describes as “the holy grail of this kind of work”.

Impact committee

The HDR UK Impact Committee chose this paper for its excellence, significance, originality and rigour.


Dr Slater l.slater.1@bham.ac.uk

Read the full paper