The challenge

The challenge we face is that most of the information held within these records is in written form – sometimes referred to as unstructured text – which is difficult to use in research: for example, ‘the patient feels very tired and breathless, is losing weight, and says her heart is beating very fast’. We need to develop special computerised tools to process these words to ensure we have a full picture of all patient symptoms, experiences and diagnoses to use in research for patient benefit.

The solution

We will establish a natural language processing (NLP) processing research community that will address the complexity of clinical text through development of shared tools and standards with inbuilt patient confidence and engagement, supporting joint working across industry, academia and the NHS. The community will be open and inclusive, and develop capability for UK-wide NLP research at scale whilst providing clear ‘quick-wins’ through exemplar projects, shared material and datasets for training and implementation, with the ultimate aim of integrating with other health data analytics. The project will lay the foundations for a sustainable model for collaborative working, thus attracting funding for next 4 years and beyond.

Impact and outcomes

There is much value in the unstructured portion of the EHR, often in the form of rich narrative, which is currently unused, yet important for understanding health interactions that are either not recorded in, or are less obvious from, the structured data (e.g. multimorbidity, cancer diagnosis). The subtleties of the patient journey and characteristics are often stored in the free-text component and clinicians therefore find it a more user-friendly format.

Building on existing successful partner-led programmes, and drawing the wider community together, this project will enable a major shift in the UK’s ability for research-ready, actionable, real-time and large-scale EHRs. Shared tools will be made available across the NHS, creating richer, more useful clinical information to improve healthcare. The integration of NLP derived phenotypes (digital descriptions of health characteristics) from EHRs with other rich records (e.g. educational and social information) and other modalities including imaging, mobile health and genomics will help generate a more complete picture of the patient and their health. Example projects will focus on areas of stroke, lung cancer and serious mental illness.

Better use of unstructured text will help streamline matching of patients to clinical trials and stratification of patients for disease classification, outcome prediction, patient trajectories across the life-course, adverse drug reactions, and identify drug-repurposing opportunities.


HDR UK Cambridge
HDR UK Scotland
HDR UK London
HDR UK Midlands
HDR UK Wales and Northern Ireland


Angus Roberts
Richard Dobson