TY - GEN

T1 - A Data-driven Missing Mass Estimation Framework

AU - Painsky, Amichai

N1 - Publisher Copyright:
© 2022 IEEE.

PY - 2022

Y1 - 2022

N2 - Consider a finite sample from an unknown distribution over a countable alphabet. The missing mass refers to the probability of symbols that do not appear in the sample. Missing mass estimation is a fundamental problem in statistics, information theory and related fields, which dates back to the early work of Laplace, and the more recent seminal contribution of Good and Turing. Most popular missing mass estimation schemes are universal, in the sense that they preform well for every possible distribution. Interestingly, the worst-case distribution, for which these schemes perform the worst, is known to be uniform. On the other hand, real-world distributions are typically heavy-tailed. This means that current frameworks may be over-pessimistic, in many cases of interest. In this work we suggest a data-dependent estimation scheme to address this caveat. Specifically, we infer a subset of distributions from the sample, and control the worst-case performance only over that subset. Our suggested scheme demonstrates improved performance guarantees compared to alternative methods.

AB - Consider a finite sample from an unknown distribution over a countable alphabet. The missing mass refers to the probability of symbols that do not appear in the sample. Missing mass estimation is a fundamental problem in statistics, information theory and related fields, which dates back to the early work of Laplace, and the more recent seminal contribution of Good and Turing. Most popular missing mass estimation schemes are universal, in the sense that they preform well for every possible distribution. Interestingly, the worst-case distribution, for which these schemes perform the worst, is known to be uniform. On the other hand, real-world distributions are typically heavy-tailed. This means that current frameworks may be over-pessimistic, in many cases of interest. In this work we suggest a data-dependent estimation scheme to address this caveat. Specifically, we infer a subset of distributions from the sample, and control the worst-case performance only over that subset. Our suggested scheme demonstrates improved performance guarantees compared to alternative methods.

UR - http://www.scopus.com/inward/record.url?scp=85136296072&partnerID=8YFLogxK

U2 - 10.1109/ISIT50566.2022.9834566

DO - 10.1109/ISIT50566.2022.9834566

M3 - פרסום בספר כנס

AN - SCOPUS:85136296072

T3 - IEEE International Symposium on Information Theory - Proceedings

SP - 2991

EP - 2995

BT - 2022 IEEE International Symposium on Information Theory, ISIT 2022

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 26 June 2022 through 1 July 2022

ER -