TY - JOUR
T1 - Testing Dependency of Unlabeled Databases
AU - Paslev, Vered
AU - Huleihel, Wasim
N1 - Publisher Copyright:
© 1963-2012 IEEE.
PY - 2024
Y1 - 2024
N2 - In this paper, we investigate the problem of deciding whether two random databases X ϵ Xn\× d and Y\in Yn\×d are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation σ , such that X and Yσ , a permuted version of Y, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as d\to ∞ , then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.
AB - In this paper, we investigate the problem of deciding whether two random databases X ϵ Xn\× d and Y\in Yn\×d are statistically dependent or not. This is formulated as a hypothesis testing problem, where under the null hypothesis, these two databases are statistically independent, while under the alternative, there exists an unknown row permutation σ , such that X and Yσ , a permuted version of Y, are statistically dependent with some known joint distribution, but have the same marginal distributions as the null. We characterize the thresholds at which optimal testing is information-theoretically impossible and possible, as a function of n, d, and some spectral properties of the generative distributions of the datasets. For example, we prove that if a certain function of the eigenvalues of the likelihood function and d, is below a certain threshold, as d\to ∞ , then weak detection (performing slightly better than random guessing) is statistically impossible, no matter what the value of n is. This mimics the performance of an efficient test that thresholds a centered version of the log-likelihood function of the observed matrices. We also analyze the case where d is fixed, for which we derive strong (vanishing error) and weak detection lower and upper bounds.
KW - Detection
KW - algorithms
KW - error analysis
KW - hypothesis testing
UR - http://www.scopus.com/inward/record.url?scp=85201289569&partnerID=8YFLogxK
U2 - 10.1109/TIT.2024.3442977
DO - 10.1109/TIT.2024.3442977
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:85201289569
SN - 0018-9448
VL - 70
SP - 7410
EP - 7431
JO - IEEE Transactions on Information Theory
JF - IEEE Transactions on Information Theory
IS - 10
ER -