There was something very tangible about holding a set of Lloyd George notes when I first trained as a GP- the thickness of the envelope in your hand telling a story about the patient before they even walked through the door. Their list of addresses from birth and their occupations scrawled in different hands across the front. The translation of that envelope of thin brown card into electronic records reflects the ever-evolving digital transformation of our modern lives.

In step I moved from clinical practice into clinical epidemiology, analysing the who, when and where of the distributions and determinants of mental health. The case definition was king. Surveys of well-defined populations created beautiful clean datasets with data dictionaries – although I appreciate, I am likely looking back with bunting-like nostalgia. But there were problems with these studies- often the people most likely to experience adverse outcomes are the least likely to participate. An hour-long interview or answering a questionnaire is often challenging for the anxious or unwell and therefore avoided. In my field where many people are socially excluded, this is particularly important. Many measures are self-reported and subject to bias. It’s likely that answers to questions about issues stigmatised in society like drinking, self-harming and smoking are under-reported. Linking this survey data to health data enhances it and allows for long term follow-up.

And so with enthusiasm, I embraced mental health data science working initially with electronic health records- using anonymised privacy protecting securely held data collected routinely as health professionals and patients go about their days in general practices, emergency departments and hospitals. Then with linked health, education and social care data, exploring the wider determinants of health, blue and green spaces, deprivation, prescribed medication.

The majority of people participate in this type of research without burden- it is democratising on many levels. For patients and for different types of researchers.

Many researchers, who for years had worked on the same problems in isolation, barely acknowledging each other, were brought together as their data linked in the virtual world- geneticists and social scientists, primary care and psychiatrists. Suddenly public health, the whole population, primary care and inequalities (of deprivation, of outcomes, of contacts) has gained traction because real world demonstration of distributions and determinants is possible and speaks directly to clinicians and policy makers.

But with scale comes problems. Multitudes of data inputters who are focussed on clinical care, far removed from other uses of the data for research to improve care, has meant case definitions are the stuff of nightmares. The Read and ICD-10 codes used to define ‘cases’ are varied and their clinical meaning altered with clinician and setting. Each nation in the UK does it a little differently. Many researchers create bespoke code lists and algorithms- each study measuring something slightly different to the next.

Garbage in, garbage out unless we proceed with care.

Computer science offers ways of analysing the sheer volume of data available but using machine learning and AI to ‘predict’ health may reflect inherent biases of accessing care and of those whose needs are recognised and recorded. We can address these issues with the same attention to study design as we’ve always employed.

Often, we are creating health technologies and these need to be trialled in the same way as we would trial any intervention impacting on patient care. Data harmonisation is key. As is the involvement of patients. In response, academia must shift from competition to open science, co-operation and collaboration across disciplines and recognise patient’s voices in the use of their data. It’s happening. Supported by initiatives like the UK Health Data Research Alliance. Positively data driven.