TY - JOUR
T1 - Confidence Intervals for Parameters of Unobserved Events
AU - Painsky, Amichai
N1 - Publisher Copyright:
© 2024 The Author(s). Published with license by Taylor & Francis Group, LLC.
PY - 2025
Y1 - 2025
N2 - Consider a finite sample from an unknown distribution over a countable alphabet. Unobserved events are alphabet symbols which do not appear in the sample. Estimating the probabilities of unobserved events is a basic problem in statistics and related fields, which was extensively studied in the context of point estimation. In this work we introduce a novel interval estimation scheme for unobserved events. Our proposed framework applies selective inference, as we construct confidence intervals (CIs) for the desired set of parameters. Interestingly, we show that obtained CIs are dimension-free, as they do not grow with the alphabet size. Further, we show that these CIs are (almost) tight, in the sense that they cannot be further improved without violating the prescribed coverage rate. We demonstrate the performance of our proposed scheme in synthetic and real-world experiments, showing a significant improvement over the alternatives. Finally, we apply our proposed scheme to large alphabet modeling. We introduce a novel simultaneous CI scheme for large alphabet distributions which outperforms currently known methods while maintaining the prescribed coverage rate. Supplementary materials for this article are available online including a standardized description of the materials available for reproducing the work.
AB - Consider a finite sample from an unknown distribution over a countable alphabet. Unobserved events are alphabet symbols which do not appear in the sample. Estimating the probabilities of unobserved events is a basic problem in statistics and related fields, which was extensively studied in the context of point estimation. In this work we introduce a novel interval estimation scheme for unobserved events. Our proposed framework applies selective inference, as we construct confidence intervals (CIs) for the desired set of parameters. Interestingly, we show that obtained CIs are dimension-free, as they do not grow with the alphabet size. Further, we show that these CIs are (almost) tight, in the sense that they cannot be further improved without violating the prescribed coverage rate. We demonstrate the performance of our proposed scheme in synthetic and real-world experiments, showing a significant improvement over the alternatives. Finally, we apply our proposed scheme to large alphabet modeling. We introduce a novel simultaneous CI scheme for large alphabet distributions which outperforms currently known methods while maintaining the prescribed coverage rate. Supplementary materials for this article are available online including a standardized description of the materials available for reproducing the work.
KW - Categorical data analysis
KW - Count data
KW - Large alphabet probability estimation
KW - Missing mass
KW - Rule-of-three
KW - Selective inference
UR - http://www.scopus.com/inward/record.url?scp=85185781255&partnerID=8YFLogxK
U2 - 10.1080/01621459.2024.2314318
DO - 10.1080/01621459.2024.2314318
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85185781255
SN - 0162-1459
VL - 120
SP - 226
EP - 236
JO - Journal of the American Statistical Association
JF - Journal of the American Statistical Association
IS - 549
ER -