Health data research explained

Health data research is a way of gathering, analysing and linking information about people and their health to improve healthcare for all.

Using health data for research helps us better understand diseases and health conditions, such as understanding their causes and symptoms or knowing how many people are affected. It provides new ways of identifying people most at risk of becoming ill, diagnosing diseases earlier, and providing better care and treatment. And it helps health services to run more efficiently and effectively, so everyone can get the care that they need.

Health data research has huge potential to transform healthcare, now and in the future. It has played a major role in the COVID-19 pandemic by helping scientists and doctors understand more about this new disease, and enabling the NHS and decision-makers to respond to the challenge.

Find out how health data research is making a difference

Join the HDR UK Voices Network to get involved with our work

Learn about health data research

What is health data?

Health data refers to data related to a person’s health; their health conditions, information relating to maternity and children, causes of death, and quality of life. Health data includes, for example: patient health records, studies about the health of groups of people, data from blood or tissue samples, imaging data, and data from health and fitness devices.

Every day, large amounts of health-related data (known as datasets) are generated by the NHS and other health and care services, and from other sources such as academic studies and research. Some of these datasets come from general information about local or national populations, like the number of babies being born or how many people are admitted to hospital on any day. Other types of data are very personal, such as individual medical records or scans.

What is health data research?

Health data research is an exciting and developing area and can make positive changes in health and care for everyone. It combines maths, statistics, and technology to manage and analyse very large amounts of different health data sets across our health and care systems. It is a way of gathering, analysing and linking information about people and their health to enable us to make advances in healthcare and ultimately make improvement to healthcare for all.

There are lots of examples already of how it has improved our knowledge and helped to solve challenging health problems:
Combatting the COVID-19 pandemic has depended upon the ability to collect, link, access and use health data for research. It has allowed the NHS to identify and protect millions of people at high risk of COVID-19, to deliver and monitor the safety and effectiveness of the COVID-19 vaccinations programme, and to identify life-saving treatments for COVID-19, including dexamethasone.

These benefits must not stop with COVID-19. They must also extend to people living with other conditions such as mental illness, cancer, heart disease and diabetes. There is huge potential to make better use of information from people’s patient records. Data is vital for your individual care, and to improve health, care, and services across the NHS. If data from many different patients are linked up and pooled, researchers and doctors can look for patterns in the data, helping them develop new ways of predicting or diagnosing illness, and identify ways to improve clinical care.

Click the link to see a diagram from Understanding Patient Data which gives some examples of how health data can be used to make improvements to your individual care, in understanding and diagnosing diseases, improving treatments, evaluating healthcare policies, and for planning NHS services and ensuring patient safety.

Understanding Patient Diagram on data can be used to improve people’s lives.

The information from health data is really valuable to help understand more about disease, to develop new treatments, to monitor safety, to plan services and to evaluate NHS policy.

Data really does save lives. To view a series of animations about how better use of health data has made improvements for people with various health conditions, please go to Understanding Patient Data’s animations page.
Researchers can apply to access these health datasets for research and innovation. There are strict controls on how researchers can access health data information. The purpose must be approved before anyone can use the data, and they are only given access to the minimum amount of data necessary. The types of organisations that may use health data include:
- NHS providers and commissioners: use data to monitor trends and patterns in hospital activity, to assess how care in provided, and to support local service planning
- University researchers: use data to understand more about the causes of disease, to develop new ways of diagnosing illness or to identify ways to develop new treatments
- Charities: use data to evaluate services and identify ways to improve care
- Companies: use data if they are partnered with the NHS to provide care and research. The NHS can’t do all of the analysis on its own and companies may have the best expertise and technologies for making sense of large and complex data from hospitals, or for developing new treatments. If you have more questions about how and why companies access health data, please go to this webpage https://understandingpatientdata.org.uk/companies.
Who are health data scientists?

Health data scientists come from a range of backgrounds and include health researchers, innovators, technology specialists, mathematicians, and statisticians. At Health Data Research UK our community includes:
- doctors working in the NHS with an interest in using data for research
- academic scientists working in universities who use data to understand human health
- colleagues from industry who are using data to develop new tests and treatments
Everyone who uses health data for research and innovation has to work within legal frameworks. This includes the strict parameters of the Codes of Practice and the standards set out by the National Data Guardian and regulatory bodies including the Information Commissioner’s Office. Health data research is for everyone, so it’s important to us that we know what people think about it. We work with patients and the public to find out what matters to them and the kinds of research that should be done with health data.
The organisation which currently has your data (known as data custodians) – your GP practice, community health services, mental health or hospital trust, medical research charity, UK Biobank (a large-scale biomedical database and research resource, containing in-depth genetic and health information from half a million UK participants) or disease registry – is responsible for keeping your data safe.

Every member of staff who works for these organisations has a legal obligation to keep information about you confidential. For example, in the NHS, organisations maintain a duty of confidentiality by conducting annual training and awareness, ensuring access to personal data is limited to the appropriate staff and information is only shared with organisations and individuals that have a legitimate and legal basis for access.

Where systems have been set up to collect data from GP practices, NHS trusts, Health Boards and so on – for instance by NHS Digital in England, the Information Services Division of NHS Scotland, NHS Wales, and Health and Social Care Northern Ireland – the organisations collecting the data have responsibility for it. It is collected and, in most cases, de-identified before being released for research.

Data privacy is incredibly important in health data research. In almost all cases, except where it is directly relevant for personal healthcare, any identifying information (such as people’s names) is removed from datasets used in research. You can find out more about this process in this guide from Understanding Patient Data.
The datasets used by health data scientists come from lots of different sources including:
- Patient data from the NHS and social care, including hospital and GP data such as dates and times of appointments, through to information about treatment, medical and diagnostic tests
- Studies about the health of groups of people, which may be based on a particular health condition, such as cancer, or issues that affect the people’s health, like smoking
- Data from blood or tissue samples such as information about genetic makeup or biological markers of disease
- Data from images such as X-rays, MRI and CT scans that contain a huge amount of information
- Data from clinical trials investigating tests and treatments for health conditions
- Health and fitness devices, which provide data on things like heart rate, activity and calories
Diagram designed by HDR UK.

When a researcher or innovator (someone who can use health data to make discoveries that lead to patient benefit) wants to have access to a dataset, they have to apply for access by submitting a data access request. This is a form and while these vary slightly between data custodians, they all serve the purpose to establish the reason a researcher is asking for the data.

The form will ask for details on why the user needs the dataset for their research and how, by gaining access, they will use the data to improve health and care services and generate public benefit.

The data access request is submitted to the data custodian (the organisation that holds the data). If a dataset is listed on the Innovation Gateway, researchers can submit their data access request through the Gateway, and this will be passed to the data custodian.

How is a data access request approved?

The data custodian will make the ultimate decision on whether access is granted to the researcher, and there will be a process in which they review the data access requests. For example, many data custodians have a Data Access Committee; this is a committee or an equivalent body such as a project steering committee, that is involved in access requests and overseeing the management and administration of data access.

They will meet and discuss the request for access and decide whether it should be approved or not, and they may make recommendations or ask questions of the researchers to clarify or improve their request. They may use set criteria to make their decision, and lay members may be members of these committees.
The Five Safes is a framework to help make decisions that will enable effective use of data which is confidential or sensitive. The Five Safes may be used when data custodians and these data access committees consider whether to approve or deny a data access request from a researcher. They will consider why the data is needed, who is accessing the data, and how the data will be protected.

It breaks down decisions surrounding data access and use into five separated dimensions, and these are likely to be used by data custodians through their data access approval processes.

Diagram designed by HDR UK from information on the UK Data Service webpage (link).
- Safe People: can the data user be trusted to use the data in an appropriate manner? Do the researchers have the knowledge and skills to act in accordance with the required standards of behaviour?
- Safe Projects: is the use of this data appropriate, lawful, ethical, and sensible? Is the project expected to deliver public benefit?
- Safe settings: does the tool that the researcher is using to access the data prevent unauthorised use or mistakes? Are there controls on the way the data is accessed, both from a technology perspective and considering the physical environment?
- Safe Data: is there a disclosure risk in the data itself? Has the data been treated appropriately to minimise the potential for identification of individuals or organisations?
- Safe outputs: do the results of the research using the data prevent someone from identifying individuals from the data? What can be done to minimise risk when releasing the findings of the project?
There are various ways that health data is protected: by removing identifying information, using an independent review process, ensuring strict legal contracts are in place before data is transferred, and implementing robust data security standards. When researchers are granted access to a dataset, their use of it will be monitored carefully by the data custodian to ensure they are using it appropriately.

A legal contract must be signed before data can be transferred or accessed. This sets out strict rules about what an organisation can do with the data and has clear restrictions on what is not allowed. This data sharing contract may include:
- what data will be provided, and how
- the purpose for which the data can be used
- when and how data must be destroyed after use
- the data security requirements that must be followed
- what an organisation must not do with the data:
  - data cannot be used in any way to re-identify an individual
  - data cannot be linked with any other data, unless explicitly approved in the application
  - data cannot be passed to any third parties, unless explicitly approved in the application
Data must also be stored securely, with controlled access and robust IT systems to keep data safe and the organisation can be audited to check data is being used appropriately. Technology can be used to protect data, for example by restricting access (using passwords or swipe cards to control access to data) or using encryption so the data can only be read with a code.

IT systems must be kept up to date to protect against viruses and hacking. Anyone accessing data must have appropriate training and be approved by the organisation. Finally, there must be an audit trail that records every time that personally identifiable data is viewed or used.

In many cases it is safer when the data is not transferred between the data custodian and the researcher. This can be done inside a Trusted Research Environment (TRE), also known as ‘Data Safe Havens’.
A TRE is a Trusted Research Environment. Also known as ‘Data Safe Havens’, TREs
are highly secure computing environments that provide remote access to health data
for approved researchers to use in research that can save and improve lives.

Click here to find out more about Trusted Research Environments
Hopefully the above information has helped your understanding of health data research, but if you would still like further information you may find the following links useful.
If you know of a resource that you think would be useful to include here, please get in touch with the Patient and Public Involvement and Engagement team at Involvement@hdruk.ac.uk.
We have created this glossary to help you understand key terms in health data research. We endeavour to update this regularly, but if you notice any key words or definitions that you think are missing, please get in touch with the Patient and Public Involvement and Engagement team at Involvement@hdruk.ac.uk.

Access Request/Data Access Request

Where a researcher submits a form to ask a data custodian for access to dataset.

Artificial Intelligence

Artificial Intelligence (AI) is the theory and development of computer systems that are able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.

AI systems or machines can mimic human intelligence to perform tasks and can ‘learn’ and improve themselves based on the information that they collect.

The Caldicott Principles

The Caldicott Principles are a set of eight principles to ensure people’s information is help confidential and used appropriately. These principles apply to the use of confidential information within health and social care organisations and when such information is shared with other organisations and between individuals, both for individual care and for other purposes.
- Principle 1: Justify the purpose(s) for using confidential information
- Principle 2: Use confidential information only when it is necessary
- Principle 3: Use the minimum necessary confidential information
- Principle 4: Access to confidential information should be a strict need-to-know basis
- Principle 5: Everyone with access to confidential information should be aware of their responsibilities
- Principle 6: Comply with the law
- Principle 7: The duty to share information for individual care is as important as the duty to protect patient confidentiality
- Principle 8: Inform patients and service users about how their confidential information is used
To find out more, please go to this webpage The Caldicott Principles – GOV.UK (www.gov.uk).

Dataset

A dataset is a collection of data.

Data and Connectivity

A Health Data Research UK (HDR UK) National Core Study making data from all studies available and accessible to inform decision makers and catalyse COVID-19 research.

Data Access

Refers to the availability of data and the process of obtaining data for research. Not all data access is the same and researchers may need more or less types of data for specific projects. The conditions under which access to data is granted often vary by project, researcher, and data controller.

Data Custodian/Data Controller

A term used to describe an individual or organisation who controls the purposes for why and how any health data is accessed and used for research. It is the responsibility of the Controller to ensure that any processing of personally identifiable data is safe and lawful.

Data linkage

Data linkage is a method of bringing information from different sources together about the same person or population to create a new, richer dataset. For example, linking a dataset with information about outcomes from COVID-19 in Yorkshire with a dataset with information about outcomes from COVID-19 in London together, or linking information about health to educational information.

The linkage of information from different information sources enables us to understand how different factors may influence health outcomes, and is valuable in helping to inform policy and research into health and wellbeing.

Data Sharing

The disclosure of data from one or more organisations to another organisation or organisations, or the sending of data between different parts of a single organisation. This can take the form of routine data sharing, where the same data sets are shared between the same organisations for an on-going established purpose; and exceptional, one-off decisions to share data for a specific purpose.

Data Use Register

A data use register is a register or list of data that has been allowed access to for research by a data custodian. It is a public record of how data is being used, by whom and for what purpose.

Data Utility

A term used to refer to the usefulness of a dataset for a given purpose.

De-identified Data

De-identified data refers to data from which all personally identifiable information has been removed to protect individual identities and privacy. This includes items such as your NHS number, name, address, and date of birth (except year).

Disclosive Data

Any information from which you can deduce information about someone or identify individuals.

Electronic Health Record

An Electronic Health Record (EHR) or Electronic Medical Record (EMR) is a digital version of a patient’s medical history. EHRs are real-time, patient-centred records that make information available instantly and securely to authorized users.

FAIR Principles

The FAIR principles contain guidelines for good data management practice that aim to make data FAIR: Findable, Accessible, Interoperable, and Reusable.
- Findable: this means that data can be discovered by both humans and machines, for instance by exposing meaningful machine-actionable metadata and keywords to search engines and research data catalogues.
- Accessible: this means that data are archived in long-term storage and can be made available using standard technical procedures. This does not mean that the data have to be opening available for everyone, but information on how the data could be retrieved (or not) has to be available.
- Interoperable: this means that the data can be exchanged and used across different applications and systems – also in the future, for example, by using different file formats. It also means that the data can be integrated with other data from the same research field or data from other research fields.
- Reusable: the means that the data are well documented and curated and provide rich information about the context of data creation. The data should conform to community standards and include clear terms and conditions on how the data may be accessed and reused. Find out more here.
Federated

In data, the term federated means that datasets are held separately is different locations.

Federated Analytics

In data, federated analytics is where various distributed datasets are analysed, and research with those datasets can be carried out as though they are a single data source. The datasets are never combined together, and always stay separated, but researchers can analyse them as though they were a combined dataset.

Health Data

Refers to data related to health conditions, reproductive outcomes, causes of death, and quality of life. Health data includes, for example: patient data, studies about the health of groups of people, data from blood or tissue samples, imaging data, and data from health and fitness devices.

Health data research

A growing area of work and combines maths, statistics and technology to manage and analyse very large amounts of different datasets across our health and care systems. The information we get from health data research will enable us to make advances in healthcare.

Health Data Research Innovation Gateway

The Health Data Research Innovation Gateway (‘Gateway’) provides a common entry point for researchers and innovators (anyone who can use health data to make discoveries that lead to patient benefit i.e., researchers, clinicians, health data scientists, industry researchers) to discover and request access to health data held within UK health datasets. It does not hold or store any health data. It allows researchers to see descriptions (metadata) of health datasets available in the UK so that they can request access to them for research.

Health Data Research Hubs

HDR UK centres of excellence with expertise, tools, knowledge and ways of working to maximise the insights and innovations developed from the health data.

Information Commissioner’s Office

The UK’s independent body set up to uphold information rights https://ico.org.uk/about-the-ico/who-we-are/.

Innovator

Anyone who can use health data to make discoveries that lead to patient benefit i.e., researchers, clinicians, health data scientists, industry researchers.

Longitudinal Population Study (including cohort and panel studies)

Longitudinal Population Studies (LPS) track the health of large groups of people over time. These include cohorts, panel surveys and biobanks. They are valuable resources for the scientific community as there is no other way of understanding how biological, social and environmental factors interact over time in a population to produce health outcomes. LPS become more valuable with time, and often develop in unexpected ways, yielding outcomes that could not have been predicted at the outset.

Machine Learning

Machine learning is a type of artificial intelligence that provides computer programs with the ability to automatically learn and improve from experience, without being explicitly programmed. It focuses on the development of computer algorithms that can access data and use it to learn for themselves.

Metadata

Descriptions and information about data, for example how many records, quality of certain details, or where further information can be found. Each dataset listed on the Gateway has metadata associated with it which can help users decide whether it would be of use to their work and whether they would be eligible to access the dataset.

National Core Studies

The National Core Studies (NCS) programme is enabling the UK to use health data and research to inform both our near and long-term responses to COVID-19, as well as accelerating progress to establish a world-leading health data and research infrastructure for the future.

Epidemiology and Surveillance – collecting data to inform safe levels of restrictions and protection against imminent outbreaks (led by Ian Diamond, UK National Statistician, Office for National Statistics).

Clinical Trials Infrastructure – building on established NIHR infrastructure (and equivalent in the devolved administrations) to accelerate delivery of large scale COVID trials for drugs and vaccines (led by Patrick Chinnery, Clinical Director, MRC and Divya Chadha Manek, Head of Business Development, Vaccines Task Force).

Transmission and Environment (also known as PROTECT) – improving our understanding of how the virus is transmitted in different settings and environments, including in workplaces, on transport and in other public places (led by Andrew Curran, Chief Scientific Adviser. Health and Safety Executive).

Immunity – understanding immunity again COVID to inform back-to-work policies (led by Paul Moss, Professor of Haematology, University of Birmingham).

Longitudinal Health and Wellbeing – using data from longitudinal studies to address the impact of COVID-19 and other associated viral suppression measure on health and wealth to inform mitigating strategies (led by Nishi Chaturvedi, professor of clinical epidemiology, University College London).

Data and Connectivity – making data from all of the above studies (and wider) available and accessible to inform decision makers and catalyse COVID-19 research (led by Andrew Morris, Director, HDR UK).

National Data Guardian

The National Data Guardian (NDG) advises and challenges the health and care system to help ensure that citizens’ confidential information is safeguarded securely and used properly.

Office for National Statistics Framework

The Office for National Statistics (ONS) is committed to providing access to de-identified survey and administrative data for statistical research that delivers a public benefit for the UK. Access is granted to researchers in a safe and consistent way, in line with legislation and ONS policies. We believe that better statistics lead to better decisions and that the research community is pivotal to achieving our national goals. Their research helps the UK understand and make informed decisions regarding the economy, society and people’s quality of life.

This framework provides certainty and clarity on the governance arrangements adopted by the ONS for access to data for research purposes, including the conditions under which access to these data is granted and the framework for accrediting researchers, research projects and data processors. This will help to ensure a consistent and transparent approach for access to data for research purposes so that the benefits are more easily realised, and the confidentiality of the data is protected.

Patient Data

Data that is collected about a patient whenever they go to a doctor or receive social care. It may include details about the individual’s physical or mental health, such as height and weight or detail of any allergies, and their social care needs and services received. It may also include next of kin information.

Pseudonymisation

Pseudonymisation means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject or individual without the use of additional information. This process de-identifies data and replaces personally identifiable information fields within the data record by artificial identifiers. For example, a dataset will remove NHS numbers of individuals who data is represented, and may replace exact date of birth and postcode of home address with the year of birth and a larger postal code.

Public Contributor (Lay Member)

The term public contributor (or lay member) is widely used to refer to a member of the public or a patient who is involved in research or activities with an organisation, for example by being part of a board or group, to provide their personal experiences and views and be a ‘critical friend’ for the organisation. Lay members will have an expert understanding of what matters most for people and help to bring that perspective.

Research Priorities – Better Care

The HDR UK Better Care programme aims to improve people’s lives by equipping clinicians and patients in the UK with the best possible data-based information to make decision about their care. Our vision is that by 2030, patients across the UK will benefit from healthcare decisions informed by large scale data and advanced analytics to identify what will work best for them.

Research Priorities – Clinical Trials

Getting enough people for a clinical trial is critical, as there is a level below which a trial’s results are just not robust enough. These days, a large portion of NHS records are electronic, so we are able to search the database for all patients that could benefit from a particular trial, and source participants from the entire population. The HDR UK Clinical Trials programme uses health data to ensure that every individual across the UK has access to the latest treatments and technologies through access to clinical trials.

Clinical trials track long-term health of participants by having consented access to participants’ NHS medical records. This allows for more efficient and longer-term follow-up processes. We want to provide a system that all clinical triallists can use to identify large numbers of the right trial participants. We also work on new approach to analyse trial results.

Research Priorities – Improving Public Health

The causes of poor health can be unexpected, long or short term, as well as local or global. We don’t always have the right data to know what effect a behaviour or intervention might have on public health. In many cases, the data we need already exists in GP surgeries and hospitals; the challenge is to create the infrastructure to link it all together. The HDR UK Improving Public Health programme aims to develop the capability to identify the things that cause ill-health in people, as well as the factors that result in improved health – no matter where they live or what their socioeconomic background – across the whole population of the UK.

We will enable data science to transform public health research through linking to data beyond health care, for example, to other government sectors, organisations, and data on environments that influence our health.

Research Priorities – Understanding the Causes of Diseases

The HDR UK Understanding the Causes of Disease programme aims to help advance understanding of disease prediction, causation, and progression through the integration of molecular data and other intermediate phenotypes with routine clinical data. The overarching hypothesis is that insights into biology and disease aetiology (the causes of a disease of condition) can be revealed by integration of information, at scale, on genomics, other biomolecular traits, and high-resolution electronic health records (EHRs).

The ultimate vision is to create new informatics infrastructures and data science methods that help achieve a deep integration of biology, biomedicine, and population health science.

Trusted Research Environment (Data Safe Haven)

Trusted Research Environments, also known as ‘Data Safe Havens’, are highly secure spaces for researchers to access sensitive data. They are based on the idea that researchers should access and use data within a single secure environment. In other words: users go to the data, the data doesn’t travel to them. Trusted Research Environments have multiple layers of security and safeguards in place, designed to minimise the risk of anyone’s data being misused.

UK Health Data Research Alliance

HDR UK’s Alliance of leading health, care and research organisations united to establish best practice around the ethical use of UK health data for research and innovation at scale. Alliance members have listed their datasets on the Gateway.

Involving and engaging patients and the public

We ensure our work is worthy of public trust and confidence by meeting people's needs. Visit our pages to see how you can get involved as a public member. If you're a researcher, take a look at our training and support.

Find out more

Learn about health data research

What is health data?

What is health data research?

Who are health data scientists?

How is a data access request approved?

Access Request/Data Access Request

Artificial Intelligence

The Caldicott Principles

Dataset

Data and Connectivity

Data Access

Data Custodian/Data Controller

Data linkage

Data Sharing

Data Use Register

Data Utility

De-identified Data

Disclosive Data

Electronic Health Record

FAIR Principles

Federated

Federated Analytics

Health Data

Health data research

Health Data Research Innovation Gateway

Health Data Research Hubs

Information Commissioner’s Office

Innovator

Longitudinal Population Study (including cohort and panel studies)

Machine Learning

Metadata

National Core Studies

National Data Guardian

Office for National Statistics Framework

Patient Data

Pseudonymisation

Public Contributor (Lay Member)

Research Priorities – Better Care

Research Priorities – Clinical Trials

Research Priorities – Improving Public Health

Research Priorities – Understanding the Causes of Diseases

Trusted Research Environment (Data Safe Haven)

UK Health Data Research Alliance

Involving and engaging patients and the public