The paper describes high-quality research into the development of a new approach for deriving patient phenotype profiles using clinical narrative texts. Its aim was to find ways to use valuable, but often untapped, data stored in electronic health records (EHRs) in order to improve the diagnosis of common diseases.


Efficient and accurate diagnoses are essential in providing the best and most appropriate care. A team, led by Dr Luke T Slater, a Research Fellow in the Gkoutos Lab at the University of Birmingham Centre for Computational Biology, aimed to find new ways to enhance the diagnosis of common diseases by unlocking the value in data held in un-curated text written by doctors, nurses and others.


Their paper, published in the journal Computers in Biology and Medicine in June 2021, describes three approaches for predicting patient diagnoses. One was based on patient-to-patient comparisons, to see whether those with the greatest similarities could be confidently identified as having the same illness. Another was based on patient-to-disease comparisons – assessing data from patient records alongside phenotype-disease profiles contained in medical literature. Thirdly there was a combined approach which made use literature-derived phenotypes but extended them using phenotypes derived from patient records.

The research (supported by HDR UK) made extensive use of the MIMIC III (Medical Information Mart for Intensive Care) database. The first two approaches involved sampling data and texts associated with 1,000 patient visits recorded in MIMIC-III. The third method, which synthesised the first two, involved a further set of 500 patient visits, whose text-derived patient phenotypes were used to extend the literature-derived phenotypes.

Lessons learnt

The third approach made it clear that there is considerable potential for improving, and semi-automating, the diagnosis of common illness by making use of un-curated texts.

Impact and outcomes

The team created what Dr Slater describes as a “pipeline method” that could use comparative profiles to provide clinicians with a ranked and scored list of possible  diagnoses.

They believe that their research has could lead to powerful semantic similarity derived solutions for differential diagnosis of common diseases or outcomes, as well as text classification and cohort discovery tasks in a clinical context.

They also think that their methods, which were primarily based on the cases of patients needing critical care, could be applied more widely. The team is already working (in collaboration with the Centre for Rare Diseases at the Queen Elizabeth Hospital, Birmingham) with cardiologists to enhance the care of patients with congenital heart conditions.

It is hoped that their work can be used for many other tasks including differential diagnosis, cohort discovery, document and text classification. One potential application is in the better understanding of likely patient outcomes, allowing clinicians to optimise care.

Insights from the Impact Committee

The paper was selected by the HDR UK Impact Committee for the quality of its research and its key impacts – the translation of discoveries and research for clinical use, its reach and reuse of outputs.