Reproducible Machine Learning in Health Data Science

HDR UK Site

Multiple

Project Description

Four key challenges will be investigated to improve the trustworthiness of machine learning in medicine:

How should researchers report machine learning in health data science? Here we will work with international experts to develop a set of guidelines to help healthcare professions report how they actually use machine learning in decision making.
Can synthetic datasets be used to evaluate the stability of machine learning models in health data science? Here we will develop methods to generate synthetic, or mock, time-series healthcare datasets.
What are the minimal requirements for reproducibility in restricted ‘safe-haven’ environments? Here methods will be developed to support risk prediction in restricted NHS Scotland ‘safe-haven’ environments.
Can we strengthen the culture of reproducible machine learning within HDR-UK? To help start a culture of reproducible machine learning in UK health data science, we will train the next generation of researchers and create awards for good practice in reproducibility.

Project Milestones

Work package 1 – reporting (Gary Collins, Oxford)

Original aim: Our first milestone will be to publish a draft framework on a preprint server such as medRxiv at month 12 of the project. This will be updated based on relevant changes in direction to HDR UK priorities, or new insights gained after its deployment in use-case scenarios for work packages 2 and 3.

Current progress: The TRIPOD-AI protocol has been recently published in BMJ Open¹. An additional article has recently been published in the Journal of Clinical Oncology² on the critical need for better reporting of machine learning methods in health data science for risk prediction. The Delphi process to create the new guidance is well underway (despite initial delays due to COVID-19)

Work package 2 – synthetic data generation to evaluate stability of machine learning models (Aiden Doherty, Oxford)

Original aim: We anticipate pre-registering this research project by month 9 of the project. A manuscript will then be submitted to a preprint server such as medRxiv, by month 24, to report the utility and pitfalls of synthetic data generation models to support reproducible machine learning in wearable sensor data and electronic health records.

Current progress: An article has been accepted for publication in the British Journal of Sports Medicine³ to show that robust evaluation of machine learning methods in wearables can provide new insights into physical activity, sleep, and their association with cardiovascular disease. This follows on from a publication earlier this year in PLOS Medicine⁴ showing that physical activity is much more important than previously thought for the prevention of cardiovascular disease (using reproducible analysis methods for time-series wearable datasets). We have driven new HDR activites around synthetic data generation (workshop with ~160 attendees) and will continue to support new activities in this space.

Work package 3 – restricted safe-haven environments (Catalina Vallejos, Edinburgh)

Original aim: A manuscript will be submitted to a preprint server such as medRxiv, by month 28, to report on the utility of continuous analysis in restricted data safe-haven environments and on general guidelines for reproducibility within such environments.

Current progress: A pre-print⁵ has been released that has developed machine learning risk prediction methods for the Scottish Patients At Risk of admission and Re-Admission (SPARRA) project, a collaboration between the Alan Turing Institute and Public Health Scotland, as an exemplar for reproducibility practices within restricted DSH environments. The publication conforms to the TRIPOD guidelines. An article has also been published at the AISTATS 2020 conference⁶ to illustrate the challenges that arise when updating risk prediction models that are already widely deployed in clinical practice. We are continuing to collaborate with NHS Scotland to develop tools to enhance reproducibility in these environments.

Work package 4 – training (Sebastian Vollmer, Alan Turing Institute)

Original aim: We aim to have the reproducibility ambassador programme in place and host the first training event by 12 months. Further training events will take place around other HDR UK substantial sites, with material refined as more knowledge accrues through this project. Online training material will be made available along with the second workshop.

Current progress: A training workshop around reproducible machine learning of time-series healthcare data was successfully run for 15 PhD students at Oxford in March. We plan to include this in the next HDR UK summer school (which was due to take place in September but is now postponed until later). Reproducibility awards were prominent during HDR UK’s flagship annual conference. We are currently working with DNANexus to create hands-on tutorial on reproducible machine learning of wearables data which will be made available to the 20,000 registered research users of the UK Biobank resource.

Work package 5 – PPIE (Sophie Staniszewska, Warwick)

Original aim: We will establish a PPI Reference Group of 8-12 people with different areas of expertise and knowledge related to health data, including patients with conditions of relevance as well as public members who provide a broader societal view (aim 1), drawing on principles and values of co-production (aim 2).

Current progress: After substantial interest (70 interested applicants) we now have a PPIE reference group of ~15 people. The group has met three times, and a regular bi-monthly meeting is now in place.

Project Team / Collaborators

Aiden Doherty, Chris Holmes, Martin Landray, Shing Chan, Arabella Pratt, James Liley, Gary Collins, Catalina Vallejos, Sebastian Vollmer, Emanuele Di Angelantonio, Louis Aslett, Alastair Denniston, Verena Heise, Bilal Mateen, James Rudd, Sophie Staniszewska, Thanasis Tsanas, Kirstie Whittaker, Gabriella Rustici, John Danesh, Harry Hemingway, Tim Hubbard, Cathie Sudlow

Keywords

Machine learning, risk prediction, reporting, synthetic data, training

Areas of work

Science

Scientific priorities

Applied Analytics

Health priorities

Artificial Intelligence

HDR UK Site

Project Description

Project Milestones

Project Team / Collaborators

Keywords

Insights from linking police domestic abuse data and health data in South Wales, UK

Understanding genetic diversity through RNA data to inform future research

Better research for pregnant women living with long-term health conditions