Abstract
We propose a new formulation of the clustering problem that differs from previous work in several aspects. First, the goal is to explicitly output a collection of simple and meaningful conjunctive descriptions of the clusters. Second, the clusters might overlap, i.e., a point can belong to multiple clusters. Third, the clusters might not cover all points, i.e., not every point is clustered. Finally, we allow a point to be assigned to a conjunctive cluster description even if it does not completely satisfy all of the attributes, but rather only satisfies most. A convenient way to view our clustering problem is that of finding a collection of large bicliques in a bipartite graph. Identifying one largest conjunctive cluster is equivalent to finding a maximum edge biclique. Since this problem is NP-hard and there is evidence that it is difficult to approximate, we solve a relaxed version where the objective is to find a large subgraph that is close to being a biclique. We give a randomized algorithm that finds a relaxed biclique with almost as many edges as the maximum biclique. We then extend this algorithm to identify a good collection of large relaxed bicliques. A key property of these algorithms is that their running time is independent of the number of data points and linear in the number of attributes.
Original language | English |
---|---|
Pages (from-to) | 448-462 |
Number of pages | 15 |
Journal | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
Volume | 2777 |
DOIs | |
State | Published - 2003 |
Event | 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003 - Washington, DC, United States Duration: 24 Aug 2003 → 27 Aug 2003 |
Keywords
- Conceptual Clustering
- Max Edge Biclique
- Unsupervised Learning