Enormous amounts of data are generated during clinical interactions across multiple-healthcare settings in the form of structured and unstructured EHRs. The data contains rich, longitudinal information on diagnoses, symptoms, medications and tests which can be used for research. However, EHR data is not primarily generated for research purposes; is stored in disparate sources often using different formats and requires a significant amount of pre-processing.

Our Phenotype Library

The UK has established a Phenotype Library, which is one of the largest in the world.  It is the only national wholly open-access library of reproducible phenotyping algorithms for defining human disease, lifestyle risk factors and biomarkers using diverse electronic health records. For each phenotype, the library curates its metadata, implementation details, programmatic code and validation information. The Library enables reproducible and transparent research using such complex data by the wider research and clinical community.   

Researchers hoping to unlock the valuable data contained with EHRs need to spend considerable time creating the coding needed to work with data that often contains inconsistencies and is of varying quality and detail. 

The HDR UK Phenotype Library has been created to assist researchers working with EHRs, by creating an open access national library of validated phenotyping algorithms, definitions and methods. Routine use of the library by researchers will cut down on the duplication of effort by allowing re-use of algorithms, tools and methods and will ensure reproducibility of research by creating a national standard for creating, evaluating and representing phenotypes.  

Access the Library to find phenotyping algorithms, tools and information 

Are you a researcher that has developed a phenotyping algorithm that: 

  • defines a disease, risk factor or biomarker,  
  • derives information from one or more EHR sources,  
  • is associated with one or more peer-reviewed output and  
  • is already validated? 

You can contribute to the improvement of health by depositing your algorithms in the Phenotype Library enabling their dissemination, re-use, evaluation, and citation to the benefit of the emerging phonemics research community. 

Upload your phenotypes  

The phenome national priority is developing tools which will support the definition and creation of computable phenotypes, which can be used to interrogate EHR data to enable health research for patients benefit

  • Phenoflow is a phenotype definition model, which can be used to define phenotypes from EHRs and export them – this allows phenotypes definitions to be re-used across research institutes improving reproducibility. Over 300 phenotypes are currently downloadable from Phenoflow and can be instantly used to interrogate local datasets. Phenoflow also allows researchers to author new phenotypes and enables their validation against multiple data sources. 

National Medical Text Analytics 

The HDR UK Text Analytics Resource is the UK’s first repository of toolsmethods and datasets for natural language processing (NLP) of the unstructured free text contained within electronic health records. The resource will help  the clinical and research community to unlock the rich data contained within electronic health records to deliver improvements in healthcare. 

There is much value in the information included in EHRs, e.g. symptoms, tests, investigations, diagnosis, and treatments, which could help researchers and clinicians learn how to tailor treatments more accurately for individual patients and to offer better and safer healthcare. However, most of the information held within these records is in written form – sometimes referred to as unstructured text – which is difficult to use in research and is currently under-used for research.  

To access the data held with unstructured text we need to develop special computerised tools to process these words to ensure we have a full picture of all patient symptoms, experiences and diagnoses to use in research for patient benefit. The HDR UK Text Analytics Resources is building a NLP research community that will address the complexity of clinical text through development of shared tools and standards.  

A curated list of applications and datasets for healthcare text analytics can be found on HDR UK Text’s github “resources” repository, you can find some examples of these below:   

  • CogStack 

    Cogstack allows the extraction of information from unstructured data (e.g. PDF/MS Word documents, images) contained within Electronic Health Records (EHRs). This data, which is usually inaccessible, once extracted and processed via CogStack can then be analysed in multiple ways.

  • MedCAT, Medical Concept Annotation Tool

    MedCAT is a natural Language Processing tool which can be used to link the extracted EHR data to definitions of disease to answer research questions such as ‘the relationship between diseases and age?’ Over twelve million free text documents and over 250 million diagnostic results and reports have been processed within CogStack, which is being implemented across three NHS Foundation Trusts (South London and Maudsley, King’s College Hospital, and University College London Hospitals). CogStack was cited in the Secretary of State for Health and Social Care’s speech ‘Better tech: not a ‘nice to have’ but vital to have for the NHS’ (January 2020) and NHSX’s report ‘Artificial Intelligence: How to get it right’ (October 2019). 

  • Freetext Matching Algorithm (FMA)

    FMA allows the extraction of information including causes of death and other diagnoses from free text in EHRs. The algorithm makes use of Read Clinical Codes, whereby clinical terms are designated with code e.g. ‘Asthma’ = ‘H33..’, and the earlier iteration OXMIS (OXford Medical Information System) Code, to identify ‘medical’ words within the text. FMA facilitates research using free text in EHRs (e.g. those deposited in the UK General Practice Research Database), reducing the need for manual analysis.  

Use our NLP resources, applications and datasets to unlock clinical text for your research.

Smartphones and wearable devices

The mHealth toolbox will enable researchers to rapidly spin up population level remote monitoring studies with data streams including active data (e.g. questionnaires and clinical assessments) as well passively generated data from smartphones and wearable devices linked to other data modalities such as EHRs. Using reproducible methods to analyse mHealth generated data researchers will be able to better understand the causes and consequences of disease.  

Our mHealth community is developing open access tools and software which will support researchers undertaking studies using health data collected via smartphones and wearables, for example:  

  • Remote Assessment of Disease and Relapse (RADAR)-base

    RADAR-base is a remote data collection platform that enables health data collected from study participants via wearables and mobile technologies to be shared with and used by clinicians and researchers. The platform supports study design and set up, active (e.g. the use of questionnaires) and passive (e.g. real time monitoring of movement) remote data collection and secure data transmission to the research/clinical team.  

  • BiobankAccelerometerAnalysis

     BiobankAccelerometerAnalysis is a tool to extract health information from large accelerometer datasets (usually captured via a wrist worn device that measures acceleration i.e. a person’s activity). The software generates time-series and summary metrics useful for answering key questions such as how much time is spent in sleep, sedentary behaviour, or doing physical activity and its health consequences. 

Get involved 

To find out more and to get involved, contact with Serina Hayes, Phenomics Programme Director, Spiros Denaxas, Phenomics National Resource Lead, Richard Dobson or Angus Roberts, Text Analytics National Resource Co-Leads.