T. Katrien J.Groenhof, Laurien R.Koers, Enja Blasse, Mark de Groot, Diederick E.Grobbee, Michiel L.Bots, Folkert W. Asselbergs, A. Titia Lely, Saskia Haitjema

Journal of Clinical Epidemiology 118 (2020) 100: e106

Abstract

Objectives Researchers are increasingly using routine clinical data for care evaluations and feedback to patients and clinicians. The quality of these evaluations depends on the quality and completeness of the input data.

Study Design and Setting We assessed the performance of an electronic health record (EHR)-based data mining algorithm, using the example of the smoking status in a cardiovascular population. As a reference standard, we used the questionnaire from the Utrecht Cardiovascular Cohort (UCC). To assess diagnostic accuracy, we calculated sensitivity, specificity, negative predictive value (NPV), and positive predictive value (PPV).

Results We analyzed 1,661 patients included in the UCC to January 18, 2019. Of those, 14% (n = 238) had missing information on smoking status in the UCC questionnaire. Data mining provided information on smoking status in 99% of the 1,661 participants. Diagnostic accuracy for current smoking was sensitivity 88%, specificity 92%, NPV 98%, and PPV 63%. From false positives, 85% reported they had quit smoking at the time of the UCC.

Conclusion Data mining showed great potential in retrieving information on smoking (a near complete yield). Its diagnostic performance is good for negative smoking statuses. The implications of misclassification with data mining are dependent on the application of the data.

 

What is new?

Key findings Data mining showed great potential in retrieving information on smoking from the electronic health record (EHR). Its diagnostic performance is good for negative smoking statuses.

What this adds this to what is known? Via data mining we can successfully extract information from both structured and unstructured fields in the EHR for scientific evaluations. Data quality evaluation, comparing the EHR information to a reference standard, should be part of the mining process.

What is the implication and what should change now? If EHR-based data mining algorithms are used to retrieve information for care or scientific purposes, the effect of time and clinical practice on the outcome, and the implications of misclassification need to be taken into account.