TY - JOUR
T1 - Collecting data when missingness is unknown
T2 - 2nd Conference on Health, Inference, and Learning, CHIL 2023
AU - Wu, Kevin
AU - Dahlem, Dominik
AU - Hane, Christopher
AU - Halperin, Eran
AU - Zou, James
N1 - Publisher Copyright:
© 2023 K. Wu, D. Dahlem, C. Hane, E. Halperin & J. Zou.
PY - 2023
Y1 - 2023
N2 - Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect, which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be retrained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model’s performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.
AB - Machine learning models for healthcare commonly use binary indicator variables to represent the diagnosis of specific health conditions in medical records. However, in populations with significant under-reporting, the absence of a recorded diagnosis does not rule out the presence of a condition, making it difficult to distinguish between negative and missing values. This effect, which we refer to as latent missingness, may lead to model degradation and perpetuate existing biases in healthcare. To address this issue, we propose that healthcare providers and payers allocate a budget towards data collection (eg. subsidies for check-ups or lab tests). However, given finite resources, only a subset of data points can be collected. Additionally, most models are unable to be retrained after deployment. In this paper, we propose a method for efficient data collection in order to maximize a fixed model’s performance on a given population. Through simulated and real-world data, we demonstrate the potential value of targeted data collection to address model degradation.
UR - http://www.scopus.com/inward/record.url?scp=85173571987&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.conferencearticle???
AN - SCOPUS:85173571987
SN - 2640-3498
VL - 209
SP - 229
EP - 243
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 22 June 2023 through 24 June 2023
ER -