We have created this glossary to help you understand key terms in health data research. We endeavour to update this regularly, but if you notice any key words or definitions that you think are missing, please get in touch with the Patient and Public Involvement and Engagement team at Involvement@hdruk.ac.uk.
Access Request/Data Access Request
Where a researcher submits a form to ask a data custodian for access to dataset.
Artificial Intelligence
Artificial Intelligence (AI) is the theory and development of computer systems that are able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.
AI systems or machines can mimic human intelligence to perform tasks and can ‘learn’ and improve themselves based on the information that they collect.
The Caldicott Principles
The Caldicott Principles are a set of eight principles to ensure people’s information is help confidential and used appropriately. These principles apply to the use of confidential information within health and social care organisations and when such information is shared with other organisations and between individuals, both for individual care and for other purposes.
- Principle 1: Justify the purpose(s) for using confidential information
- Principle 2: Use confidential information only when it is necessary
- Principle 3: Use the minimum necessary confidential information
- Principle 4: Access to confidential information should be a strict need-to-know basis
- Principle 5: Everyone with access to confidential information should be aware of their responsibilities
- Principle 6: Comply with the law
- Principle 7: The duty to share information for individual care is as important as the duty to protect patient confidentiality
- Principle 8: Inform patients and service users about how their confidential information is used
To find out more, please go to this webpage The Caldicott Principles – GOV.UK (www.gov.uk).
Dataset
A dataset is a collection of data.
Data and Connectivity
A Health Data Research UK (HDR UK) National Core Study making data from all studies available and accessible to inform decision makers and catalyse COVID-19 research.
Data Access
Refers to the availability of data and the process of obtaining data for research. Not all data access is the same and researchers may need more or less types of data for specific projects. The conditions under which access to data is granted often vary by project, researcher, and data controller.
Data Custodian/Data Controller
A term used to describe an individual or organisation who controls the purposes for why and how any health data is accessed and used for research. It is the responsibility of the Controller to ensure that any processing of personally identifiable data is safe and lawful.
Data linkage
Data linkage is a method of bringing information from different sources together about the same person or population to create a new, richer dataset. For example, linking a dataset with information about outcomes from COVID-19 in Yorkshire with a dataset with information about outcomes from COVID-19 in London together, or linking information about health to educational information.
The linkage of information from different information sources enables us to understand how different factors may influence health outcomes, and is valuable in helping to inform policy and research into health and wellbeing.
Data Sharing
The disclosure of data from one or more organisations to another organisation or organisations, or the sending of data between different parts of a single organisation. This can take the form of routine data sharing, where the same data sets are shared between the same organisations for an on-going established purpose; and exceptional, one-off decisions to share data for a specific purpose.
Data Use Register
A data use register is a register or list of data that has been allowed access to for research by a data custodian. It is a public record of how data is being used, by whom and for what purpose.
Data Utility
A term used to refer to the usefulness of a dataset for a given purpose.
De-identified Data
De-identified data refers to data from which all personally identifiable information has been removed to protect individual identities and privacy. This includes items such as your NHS number, name, address, and date of birth (except year).
Disclosive Data
Any information from which you can deduce information about someone or identify individuals.
Electronic Health Record
An Electronic Health Record (EHR) or Electronic Medical Record (EMR) is a digital version of a patient’s medical history. EHRs are real-time, patient-centred records that make information available instantly and securely to authorized users.
FAIR Principles
The FAIR principles contain guidelines for good data management practice that aim to make data FAIR: Findable, Accessible, Interoperable, and Reusable.
- Findable: this means that data can be discovered by both humans and machines, for instance by exposing meaningful machine-actionable metadata and keywords to search engines and research data catalogues.
- Accessible: this means that data are archived in long-term storage and can be made available using standard technical procedures. This does not mean that the data have to be opening available for everyone, but information on how the data could be retrieved (or not) has to be available.
- Interoperable: this means that the data can be exchanged and used across different applications and systems – also in the future, for example, by using different file formats. It also means that the data can be integrated with other data from the same research field or data from other research fields.
- Reusable: the means that the data are well documented and curated and provide rich information about the context of data creation. The data should conform to community standards and include clear terms and conditions on how the data may be accessed and reused. Find out more here.
Federated
In data, the term federated means that datasets are held separately is different locations.
Federated Analytics
In data, federated analytics is where various distributed datasets are analysed, and research with those datasets can be carried out as though they are a single data source. The datasets are never combined together, and always stay separated, but researchers can analyse them as though they were a combined dataset.
Health Data
Refers to data related to health conditions, reproductive outcomes, causes of death, and quality of life. Health data includes, for example: patient data, studies about the health of groups of people, data from blood or tissue samples, imaging data, and data from health and fitness devices.
Health data research
A growing area of work and combines maths, statistics and technology to manage and analyse very large amounts of different datasets across our health and care systems. The information we get from health data research will enable us to make advances in healthcare.
Health Data Research Innovation Gateway
The Health Data Research Innovation Gateway (‘Gateway’) provides a common entry point for researchers and innovators (anyone who can use health data to make discoveries that lead to patient benefit i.e., researchers, clinicians, health data scientists, industry researchers) to discover and request access to health data held within UK health datasets. It does not hold or store any health data. It allows researchers to see descriptions (metadata) of health datasets available in the UK so that they can request access to them for research.
Health Data Research Hubs
HDR UK centres of excellence with expertise, tools, knowledge and ways of working to maximise the insights and innovations developed from the health data.
Information Commissioner’s Office
The UK’s independent body set up to uphold information rights https://ico.org.uk/about-the-ico/who-we-are/.
Innovator
Anyone who can use health data to make discoveries that lead to patient benefit i.e., researchers, clinicians, health data scientists, industry researchers.
Longitudinal Population Study (including cohort and panel studies)
Longitudinal Population Studies (LPS) track the health of large groups of people over time. These include cohorts, panel surveys and biobanks. They are valuable resources for the scientific community as there is no other way of understanding how biological, social and environmental factors interact over time in a population to produce health outcomes. LPS become more valuable with time, and often develop in unexpected ways, yielding outcomes that could not have been predicted at the outset.
Machine Learning
Machine learning is a type of artificial intelligence that provides computer programs with the ability to automatically learn and improve from experience, without being explicitly programmed. It focuses on the development of computer algorithms that can access data and use it to learn for themselves.
Metadata
Descriptions and information about data, for example how many records, quality of certain details, or where further information can be found. Each dataset listed on the Gateway has metadata associated with it which can help users decide whether it would be of use to their work and whether they would be eligible to access the dataset.
National Core Studies
The National Core Studies (NCS) programme is enabling the UK to use health data and research to inform both our near and long-term responses to COVID-19, as well as accelerating progress to establish a world-leading health data and research infrastructure for the future.
Epidemiology and Surveillance – collecting data to inform safe levels of restrictions and protection against imminent outbreaks (led by Ian Diamond, UK National Statistician, Office for National Statistics).
Clinical Trials Infrastructure – building on established NIHR infrastructure (and equivalent in the devolved administrations) to accelerate delivery of large scale COVID trials for drugs and vaccines (led by Patrick Chinnery, Clinical Director, MRC and Divya Chadha Manek, Head of Business Development, Vaccines Task Force).
Transmission and Environment (also known as PROTECT) – improving our understanding of how the virus is transmitted in different settings and environments, including in workplaces, on transport and in other public places (led by Andrew Curran, Chief Scientific Adviser. Health and Safety Executive).
Immunity – understanding immunity again COVID to inform back-to-work policies (led by Paul Moss, Professor of Haematology, University of Birmingham).
Longitudinal Health and Wellbeing – using data from longitudinal studies to address the impact of COVID-19 and other associated viral suppression measure on health and wealth to inform mitigating strategies (led by Nishi Chaturvedi, professor of clinical epidemiology, University College London).
Data and Connectivity – making data from all of the above studies (and wider) available and accessible to inform decision makers and catalyse COVID-19 research (led by Andrew Morris, Director, HDR UK).
National Data Guardian
The National Data Guardian (NDG) advises and challenges the health and care system to help ensure that citizens’ confidential information is safeguarded securely and used properly.
Office for National Statistics Framework
The Office for National Statistics (ONS) is committed to providing access to de-identified survey and administrative data for statistical research that delivers a public benefit for the UK. Access is granted to researchers in a safe and consistent way, in line with legislation and ONS policies. We believe that better statistics lead to better decisions and that the research community is pivotal to achieving our national goals. Their research helps the UK understand and make informed decisions regarding the economy, society and people’s quality of life.
This framework provides certainty and clarity on the governance arrangements adopted by the ONS for access to data for research purposes, including the conditions under which access to these data is granted and the framework for accrediting researchers, research projects and data processors. This will help to ensure a consistent and transparent approach for access to data for research purposes so that the benefits are more easily realised, and the confidentiality of the data is protected.
Patient Data
Data that is collected about a patient whenever they go to a doctor or receive social care. It may include details about the individual’s physical or mental health, such as height and weight or detail of any allergies, and their social care needs and services received. It may also include next of kin information.
Pseudonymisation
Pseudonymisation means the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject or individual without the use of additional information. This process de-identifies data and replaces personally identifiable information fields within the data record by artificial identifiers. For example, a dataset will remove NHS numbers of individuals who data is represented, and may replace exact date of birth and postcode of home address with the year of birth and a larger postal code.
Public Contributor (Lay Member)
The term public contributor (or lay member) is widely used to refer to a member of the public or a patient who is involved in research or activities with an organisation, for example by being part of a board or group, to provide their personal experiences and views and be a ‘critical friend’ for the organisation. Lay members will have an expert understanding of what matters most for people and help to bring that perspective.
Research Priorities – Better Care
The HDR UK Better Care programme aims to improve people’s lives by equipping clinicians and patients in the UK with the best possible data-based information to make decision about their care. Our vision is that by 2030, patients across the UK will benefit from healthcare decisions informed by large scale data and advanced analytics to identify what will work best for them.
Research Priorities – Clinical Trials
Getting enough people for a clinical trial is critical, as there is a level below which a trial’s results are just not robust enough. These days, a large portion of NHS records are electronic, so we are able to search the database for all patients that could benefit from a particular trial, and source participants from the entire population. The HDR UK Clinical Trials programme uses health data to ensure that every individual across the UK has access to the latest treatments and technologies through access to clinical trials.
Clinical trials track long-term health of participants by having consented access to participants’ NHS medical records. This allows for more efficient and longer-term follow-up processes. We want to provide a system that all clinical triallists can use to identify large numbers of the right trial participants. We also work on new approach to analyse trial results.
Research Priorities – Improving Public Health
The causes of poor health can be unexpected, long or short term, as well as local or global. We don’t always have the right data to know what effect a behaviour or intervention might have on public health. In many cases, the data we need already exists in GP surgeries and hospitals; the challenge is to create the infrastructure to link it all together. The HDR UK Improving Public Health programme aims to develop the capability to identify the things that cause ill-health in people, as well as the factors that result in improved health – no matter where they live or what their socioeconomic background – across the whole population of the UK.
We will enable data science to transform public health research through linking to data beyond health care, for example, to other government sectors, organisations, and data on environments that influence our health.
Research Priorities – Understanding the Causes of Diseases
The HDR UK Understanding the Causes of Disease programme aims to help advance understanding of disease prediction, causation, and progression through the integration of molecular data and other intermediate phenotypes with routine clinical data. The overarching hypothesis is that insights into biology and disease aetiology (the causes of a disease of condition) can be revealed by integration of information, at scale, on genomics, other biomolecular traits, and high-resolution electronic health records (EHRs).
The ultimate vision is to create new informatics infrastructures and data science methods that help achieve a deep integration of biology, biomedicine, and population health science.
Trusted Research Environment (Data Safe Haven)
Trusted Research Environments, also known as ‘Data Safe Havens’, are highly secure spaces for researchers to access sensitive data. They are based on the idea that researchers should access and use data within a single secure environment. In other words: users go to the data, the data doesn’t travel to them. Trusted Research Environments have multiple layers of security and safeguards in place, designed to minimise the risk of anyone’s data being misused.
UK Health Data Research Alliance
HDR UK’s Alliance of leading health, care and research organisations united to establish best practice around the ethical use of UK health data for research and innovation at scale. Alliance members have listed their datasets on the Gateway.