Electronic health records contain a rich history of the patient journey with huge potential for research and direct patient care. The HDR UK Text Analytics Project is helping to build the UK’s natural language processing (NLP) community for healthcare, developing text analytics capabilities to utilize free text information recorded in electronic health records. This will help us learn how to tailor treatments more accurately for individual patients, in order to offer them better and safer healthcare.


Most of the information held within electronic health records is in written form – sometimes referred to as unstructured text. However, records can often be difficult to access and contain incomplete information which is challenging to analyse, and require special computerised tools to process the information. Despite progress in NLP, efforts to exploit unstructured health data are uncoordinated, siloed and fragmented with little sharing of libraries and tools, datasets, and models of governance.


The project, funded by HDR UK, consists of computerised open source tools, datasets, APIs and standards developed by an open and inclusive NLP research community to support the clinical and research community to unlock the rich data contained within electronic health records to deliver improvements in healthcare. The project supports joint working across industry, academia and the NHS and is building the foundations for a sustainable model for collaborative working in NLP for the future.

Impact and Outcomes

The National Text Analytics project is enabling a major shift in the UK’s ability for research-ready, actionable, real-time and large-scale electronic health records by delivering data-driven systems with potential to transform patient care. Nearly 100 NLP applications, datasets and models for healthcare text analytics have been developed and are shared by the HDR UK text community. Our work has received national recognition for its contribution towards improving the care of patients in areas such as mental health, cancer and COVID-19.

Our far-reaching impacts spans key areas including:

Community building:

We are bringing together the siloed and fragmented NLP and health informatics community by partnering with universities and NHS organisations across the UK, the Healtex network (the UK healthcare text analytics network), data providers, government healthcare agencies and industry.  We have undertaken the first national review of clinical NLP in the UK to map NLP activities over the last 5 years providing valuable information on work that is being carried out across the country and facilitating new opportunities for collaboration.

Tools, APIs and standards:

We have built a Portal for sharing NLP tools, datasets, APIs and standards with the NLP community. Our tools have led to significant impacts on direct patient care (e.g. alerting to safety events and improving safety of prescribing), delivery of electronic audit and feedback dashboards (e.g. increasing efficiency of coding for orthopaedic procedures led to 20% increase in procedures identified), streamlining trial recruitment (e.g. NLP tools enabled 2x more recruitment) and population research – now with significant investment from NHSX for wider NHS roll out and scaling.

Our community developed an open source healthcare analytics platform that has been implemented across multiple hospitals internationally with enterprise search, NLP, analytics, and visualisation technologies. CogStack, a multi-award winning information retrieval and extraction framework has facilitated development of successful data-driven systems including The Psychosis Clinical Academic Group (CAG) Population Health Management Platform to allow a more proactive and responsive system that is being used to coordinate care for people with non-affective psychosis, and has enabled innovative and interactive visualisation to allow special monitoring and treatment of patients before and after discharge and interaction with electronic health records to allow planning integrated services with local General Practitioners.

The federated multi trust analyses we have done has created the largest ever AI based NLP models trained on NHS documents (17 million documents to date). The software used for this, hosted on GitHub, has been widely adopted by the community. For example, the MedCAT NLP toolkit for medical concept detection, which links text to SNOMED or other terminologies, has multiple installs across the NHS, and is used in academia and industry (>10 companies)).

Our partner-led approach is driving forward efforts to improve access to unstructured data for research and develop trusted models of governance and standards that can be adopted for new datasets, focusing on challenging areas within the field including temporality, de-identification, semantic annotation (with ICD, SNOMED for example) and integration with linkage pipelines.  We are also leading work with key stakeholders including patients and the public to develop governance models around access to data, and understanding stakeholder views around the acceptability of establishing a free text databank for research.

Our tools are being adopted within HDR UK specialist communities (e.g. the DATAMIND Hub for Mental Health Informatics Research Development) and to facilitate discoverability by the research community, tools are registered in HDR UK Innovation Gateway and all code is made open source in GitHub.

Delivery of novel, high impact, multi-site exemplars:

We are testing our tools in three important clinical areas, showing promising results for: atrial fibrillation, breast cancer and physical multimorbidities in patients with severe mental illness where we are applying NLP tools to free text data to validate methods and performance of tools and exploring the potential for NLP methods to predict occurrence of recurrence and identify important information about diseases.


To build capacity among health data scientists and NLP researchers, we are delivering national training programmes for health data scientists with the aim of developing capability for UK-wide NLP research at scale, including new “bitesize” video training for the Health Data Research UK ‘Futures’ continued professional development curriculum.