The challenge

EHRs are the main source of data for medical cohort research studies and phenotype libraries provide a set of keys to unlock the data they hold. The definitions they host describe the observable characteristics of a disease and thereby allow researchers to extract the data they need from EHRs. But there can be challenges as the definitions come in many shapes and forms. Some are just a piece of code which can be a struggle to understand, others may simply be flow charts that are not downloadable and computable. To move forward there is a need for phenotype libraries to ensure that definitions are portable, reproducible, and of a clinically valid design.

The solution

A team including Dr Martin Chapman, Research Fellow King’s College London, and Dr Vasa Ćurčin, Head of King’s College Department of Population Health Sciences, conducted widespread consultation and research. This involved a literature review, examination of existing libraries and consultation with researchers from the HDR UK Phenomics Theme and the USA’s Mobilizing Computable Biomedical Knowledge (MCBK) and Phenotype Execution and Modelling Architecture (PhEMA). They also applied knowledge and experience from their own work on Phenoflow, a next generation phenotype library.

Impact and outcomes

The research (published in GigaScience) describes 14 desiderata that provide a model of best practice which libraries and researchers can work towards. These encompass modelling, logging, implementation, validation, sharing and warehousing and look at each stage of a phenotype definition’s lifecycle.

They hope this will encourage libraries to ensure that phenotypes are developed according to standard models and are clear to understand – which makes them more reproducible.

They emphasise the need for libraries to encourage computable definitions (by providing implementation tooling) and to improve portability, ensuring that definitions can be readily used by other researchers. They also want libraries to directly validate definitions through measures such as automated comparisons with gold standards.

Dr Chapman said:

“If you build a library that that meets these properties, you are effectively creating high quality phenotype definitions.”

Dr Ćurčin said:

“Ultimately, it’s about trust, transparency and stimulating reproducibility of research. If a decision support system recommends a course of action, it is vital that the algorithms behind it are clearly documented as being based on published high quality phenotypes.”

Impact committee

The HDR UK Impact Committee described this paper as a great example of open science – with potential to improve health data research and downstream impacts on public health and health care.

Patient and Public Involvement and Engagement

The team has consulted with the public on the navigability of libraries and tested a patient subpage on Phenoflow.


Dr Chapman