Patient phenotype definitions are fundamental to much health data research – allowing teams to use descriptions of disease characteristics to search databases for relevant sets of records to use in their studies. In the UK’s primary care databases these definitions are typically represented in flat lists of Read terms, which can be cumbersome and time-consuming to use. A recently-published study suggests that SNOMED CT, which is widely used internationally, can significantly speed up the process. It also provides researchers with sample code allowing them to adopt SNOMED CT.

The challenge

One key challenge was to explore whether the SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) was as efficient for defining disease phenotypes as the standard approach, which relies on manually curated codelists.

Essential questions included whether the SNOMED CT knowledge model can simplify the process for selecting patient cohorts and outcomes in EHR research databases.

The solution

The team, led by Dr Anoop Shah, Associate Professor, UCL Institute of Health Informatics, developed SNOMED CT phenotype definitions for three exemplar diseases: diabetes mellitus, asthma, and heart failure, using three different methods.

The researchers also derived SNOMED CT codelists for 276 disease phenotypes. The cohorts selected using each codelist were compared to “gold standard” of manually curated Read codelists. This covered a sample of 500,000 patients from the CPRD Aurum primary care database.

The research, which was supported by HDR UK and built on an MSc project by Musaab Elkheder, resulted in a paper entitled, Translating and evaluating historic phenotyping algorithms using SNOMED CT, which was published in JAMIA.


The team found that SNOMED CT provides an efficient way to define disease phenotypes, resulting in similar patient populations to the currently used manually curated codelists. This can cut the time taken to find the records needed for a piece of research from hours to minutes.

Dr Shah said:

“Using SNOMED CT makes it quicker to define a set of patients or a patient cohort. Another advantage is that it uses the same definitions that are often used in clinical systems for clinical decision support. So the same definitions can be used across the research and clinical fields.”

The team have also taken practical steps to help other researchers by producing an R package with sample code so they can adopt the new methods to develop their own codelists in SNOMED CT, or convert codelists from other terminologies such as Read.

It is hoped that the research will have even wider benefits because SNOMED CT is increasingly used internationally and will help align UK and international approaches. A further advantage is that the terminology and definitions used in SNOMED CT are the same as those used by many clinicians, again facilitating better research.

What the impact committee said:

The committee highlighted the quality of the study and believe that it has the potential for future impact on patient care.

For further information contact a.shah@ucl.ac.uk