Health Data Research UK (HDR UK) PhD students, Fabian Falck (Oxford) and Haoting Zhang (Cambridge), have developed a new approach to identify different patterns in complex data sets which could be used to reveal insights in health data.

Fabian Falck

Consider, for example, a collection of images of hand-written digits. There are 10 possible single digits, the numbers 0 to 9, and we may classify each image into one of 10 groups, each group corresponding to images from the same number. However, is this the only way of categorising these images? No. We could also have categorised the images based on how slanted the handwriting is, or the thickness of the lines. These groupings could also be valid – it really depends on the question you are asking.

Haoting Zhang

After studying complex electronic health care record data, as part of their HDR UK COVID-19 short project with NHS Digital and the British Heart Foundation Data Science Centre, Fabian and Haoting realised similar issues could arise with health data. What if there are different ways of looking at health data?

However, they realised that traditional machine learning algorithms for tackling these problems are normally “single faceted” – they will only look at data from one perspective and give you set of groupings. The set identified will depend on the design assumptions that went into that algorithms. As a consequence, before they could consider health data, Fabian and Haoting needed to develop a “multi-faceted” algorithm that could capture multiple different perspectives of the same data set.

Working with researchers, including HDR UK PhD Programme Director, Christopher Yau, and Professor Chris Holmes, Director of the Alan Turing Institute Health & Medical Sciences Programme, Fabian and Haoting devised a novel machine learning approach, called “multi-facet clustering”, to solving this problem using a methodology that extends a class of algorithms known as “variational autoencoders”. Their algorithm enables a user to identify multiple potential groupings of a data set automatically enabling the user to determine which might be relevant to their research question of interest.

Professor Christopher Yau, University of Manchester and Health Data Research UK, said:

Fabian and Haoting have produced a stunning piece of novel machine learning research that is now ready for submission to a major international machine learning conference. Their collaboration, stimulated by an HDRUK first year short project, but then driven independently by themselves has been outstanding particularly since it happened entirely during the pandemic lockdown. Future extensions of their fundamental methodological research could provide novel exploratory tools for investigating health data.”

A preprint of their work Multi-Facet Clustering Variational Autoencoders is now available on arXiv.

HDR UK Researchers Involved:

  • Fabian Falck, University of Oxford
  • Haoting Zhang, University of Cambridge
  • Christopher Yau, University of Manchester and Health Data Research UK
  • Chris Holmes, University of Oxford and the Alan Turing Institute