TY - GEN
T1 - Clustering small samples with quality guarantees
T2 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
AU - Cohen, Edith
AU - Chechik, Shiri
AU - Kaplan, Haim
N1 - Publisher Copyright:
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2018
Y1 - 2018
N2 - Clustering of data points is a fundamental tool in data analysis. We consider points X in a relaxed metric space, where the triangle inequality holds within a constant factor. A clustering of X is a partition of X defined by a set of points Q (centroids), according to the closest centroid. The cost of clustering X by Q is V (Q) = x ∈ X dxQ. This formulation generalizes classic k-means clustering, which uses squared distances. Two basic tasks, parametrized by k ≥ 1, are cost estimation, which returns (approximate) V (Q) for queries Q such that |Q| = k and clustering, which returns an (approximate) minimizer of V (Q) of size |Q| = k. When the data set X is very large, we seek efficient constructions of small samples that can act as surrogates for performing these tasks. Existing constructions that provide quality guarantees, however, are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. The core of our design are the novel one2all probabilities, computed for a set M of centroids and α ≥ 1: The clustering cost of each Q with cost V (Q) ≥ V (M)/α can be estimated well from a sample of size O(α|M| − 2 ). For cost estimation, we apply one2all with a bicriteria approximate M, while adaptively balancing |M| and α to optimize sample size per quality. For clustering, we present a wrapper that adaptively applies a base clustering algorithm to a sample S, using the smallest sample that provides the desired statistical guarantees on quality. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.
AB - Clustering of data points is a fundamental tool in data analysis. We consider points X in a relaxed metric space, where the triangle inequality holds within a constant factor. A clustering of X is a partition of X defined by a set of points Q (centroids), according to the closest centroid. The cost of clustering X by Q is V (Q) = x ∈ X dxQ. This formulation generalizes classic k-means clustering, which uses squared distances. Two basic tasks, parametrized by k ≥ 1, are cost estimation, which returns (approximate) V (Q) for queries Q such that |Q| = k and clustering, which returns an (approximate) minimizer of V (Q) of size |Q| = k. When the data set X is very large, we seek efficient constructions of small samples that can act as surrogates for performing these tasks. Existing constructions that provide quality guarantees, however, are either worst-case, and unable to benefit from structure of real data sets, or make explicit strong assumptions on the structure. We show here how to avoid both these pitfalls using adaptive designs. The core of our design are the novel one2all probabilities, computed for a set M of centroids and α ≥ 1: The clustering cost of each Q with cost V (Q) ≥ V (M)/α can be estimated well from a sample of size O(α|M| − 2 ). For cost estimation, we apply one2all with a bicriteria approximate M, while adaptively balancing |M| and α to optimize sample size per quality. For clustering, we present a wrapper that adaptively applies a base clustering algorithm to a sample S, using the smallest sample that provides the desired statistical guarantees on quality. We demonstrate experimentally the huge gains of using our adaptive instead of worst-case methods.
UR - http://www.scopus.com/inward/record.url?scp=85060497358&partnerID=8YFLogxK
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85060497358
T3 - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
SP - 2884
EP - 2891
BT - 32nd AAAI Conference on Artificial Intelligence, AAAI 2018
PB - AAAI press
Y2 - 2 February 2018 through 7 February 2018
ER -