Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations

Kevin Wu, Dominik Dahlem, Christopher Hane, Eran Halperin, James Zou

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect, which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be retrained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model’s performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.

Original languageEnglish
Pages (from-to)229-243
Number of pages15
JournalProceedings of Machine Learning Research
Volume209
StatePublished - 2023
Externally publishedYes
Event2nd Conference on Health, Inference, and Learning, CHIL 2023 - Cambridge, United States
Duration: 22 Jun 202324 Jun 2023

Fingerprint

Dive into the research topics of 'Collecting data when missingness is unknown: a method for improving model performance given under-reporting in patient populations'. Together they form a unique fingerprint.

Cite this