TY - JOUR
T1 - CM-tree
T2 - A dynamic clustered index for similarity search in metric databases
AU - Aronovich, Lior
AU - Spiegler, Israel
PY - 2007/12
Y1 - 2007/12
N2 - Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.
AB - Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.
KW - Clustering methods
KW - Database indexing
KW - Metric access methods
KW - Metric spaces
KW - Similarity search
UR - http://www.scopus.com/inward/record.url?scp=34548784401&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2007.06.001
DO - 10.1016/j.datak.2007.06.001
M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???
AN - SCOPUS:34548784401
SN - 0169-023X
VL - 63
SP - 919
EP - 946
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
IS - 3
ER -