TY - JOUR
T1 - Semi-supervised empirical risk minimization
T2 - Using unlabeled data to improve prediction
AU - Yuval, Oren
AU - Rosset, Saharon
N1 - Publisher Copyright:
© 2022, Institute of Mathematical Statistics. All rights reserved.
PY - 2022
Y1 - 2022
N2 - We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.
AB - We present a general methodology for using unlabeled data to design semi supervised learning (SSL) variants of the Empirical Risk Minimization (ERM) learning process. Focusing on generalized linear regression, we analyze of the effectiveness of our SSL approach in improving prediction performance. The key ideas are carefully considering the null model as a competitor, and utilizing the unlabeled data to determine signal-noise combinations where SSL outperforms both supervised learning and the null model. We then use SSL in an adaptive manner based on estimation of the signal and noise. In the special case of linear regression with Gaussian covariates, we prove that the non-adaptive SSL version is in fact not capable of improving on both the supervised estimator and the null model simultaneously, beyond a negligible O(1/n) term. On the other hand, the adaptive model presented in this work, can achieve a substantial improvement over both competitors simultaneously, under a variety of settings. This is shown empirically through extensive simulations, and extended to other scenarios, such as non-Gaussian covariates, misspecified linear regression, or generalized linear regression with non-linear link functions.
KW - Generalized linear model
KW - Predictive modeling
KW - Semi-supervised regression
UR - http://www.scopus.com/inward/record.url?scp=85128469319&partnerID=8YFLogxK
U2 - 10.1214/22-EJS1985
DO - 10.1214/22-EJS1985
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85128469319
SN - 1935-7524
VL - 16
SP - 1434
EP - 1460
JO - Electronic Journal of Statistics
JF - Electronic Journal of Statistics
ER -