Short-Term Prediction Model for Breast Cancer Risk Based on One Million Medical Records

Ofer Feinstein, Dan Ofer, Eitan Bachmat, Sivan Gazit, Michal Linial, Tehillah S. Menes*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Despite progress in breast cancer screening many women are diagnosed with advanced stage. We sought to develop a short-term (one year) prediction model for breast cancer risk, based on readily available data from electronic medical records (EMRs), to support decision-making. Methods: A retrospective cohort study using data of 1,039,212 members of a large healthcare organization between the years 1985 and 2021. During the study years, 18,959 people were diagnosed with breast cancer. Longitudinal personal medical information such as demographics, cancer-related family history, smoking habits, medical history, fertility treatments, surgeries, biopsies, medications, BMI, blood pressure and lab tests was used to predict the outcome: breast cancer diagnosis one year from the recorded data. Prediction models were trained using the CatBoost decision tree methodology. SHapley Additive exPlanations (SHAP) values were used to estimate the marginal impact of a feature on the model performance, considering the other features. Results: The model includes numerous features not utilized in existing breast cancer risk models (e.g., medications, systolic blood pressure, TSH levels and more), available from the EMR. The informative features, ranked by SHAP values, include age, the number of surgical consultations and the number of breast biopsies. The model achieved high performance with an area under the ROC curve (AUC-ROC) of 0.85. Conclusions: Use of data readily available from the EMR, can assist clinicians when assessing the short-term breast cancer risk.

Original languageEnglish
JournalClinical Breast Cancer
DOIs
StateAccepted/In press - 2025

Funding

FundersFunder number
Ministry of Health, State of Israel
Applebaum Foundation3035000440

    Keywords

    • Big data
    • CatBoost
    • Machine learning
    • Risk model

    Fingerprint

    Dive into the research topics of 'Short-Term Prediction Model for Breast Cancer Risk Based on One Million Medical Records'. Together they form a unique fingerprint.

    Cite this