TY - JOUR

T1 - Concentration bounds for unigram language models

AU - Drukh, Evgeny

AU - Mansour, Yishay

PY - 2005

Y1 - 2005

N2 - We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O (k/√m). We improve its dependency on k to O (4√k/√m + k/m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O (1/k + √k/m). We derive a combined estimator, which has an error of approximately O (m-2/5), for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O (1/√m), for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.

AB - We show several high-probability concentration bounds for learning unigram language models. One interesting quantity is the probability of all words appearing exactly k times in a sample of size m. A standard estimator for this quantity is the Good-Turing estimator. The existing analysis on its error shows a high-probability bound of approximately O (k/√m). We improve its dependency on k to O (4√k/√m + k/m). We also analyze the empirical frequencies estimator, showing that with high probability its error is bounded by approximately O (1/k + √k/m). We derive a combined estimator, which has an error of approximately O (m-2/5), for any k. A standard measure for the quality of a learning algorithm is its expected per-word log-loss. The leave-one-out method can be used for estimating the log-loss of the unigram model. We show that its error has a high-probability bound of approximately O (1/√m), for any underlying distribution. We also bound the log-loss a priori, as a function of various parameters of the distribution.

KW - Chernoff bounds

KW - Good-Turing estimators

KW - Leave-one-out estimation

KW - Logarithmic loss

UR - http://www.scopus.com/inward/record.url?scp=23744441936&partnerID=8YFLogxK

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:23744441936

SN - 1533-7928

VL - 6

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

ER -