SpectralCAT: Categorical spectral clustering of numerical and nominal data

Gil David*, Amir Averbuch

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

55 Scopus citations

Abstract

Data clustering is a common technique for data analysis, which is used in many fields, including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. Although many clustering algorithms have been proposed, most of them deal with clustering of one data type (numerical or nominal) or with mix data type (numerical and nominal) and only few of them provide a generic method that clusters all types of data. It is required for most real-world applications data to handle both feature types and their mix. In this paper, we propose an automated technique, called SpectralCAT, for unsupervised clustering of high-dimensional data that contains numerical or nominal or mix of attributes. We suggest to automatically transform the high-dimensional input data into categorical values. This is done by discovering the optimal transformation according to the CalinskiHarabasz index for each feature and attribute in the dataset. Then, a method for spectral clustering via dimensionality reduction of the transformed data is applied. This is achieved by automatic non-linear transformations, which identify geometric patterns in the data, and find the connections among them while projecting them onto low-dimensional spaces. We compare our method to several clustering algorithms using 16 public datasets from different domains and types. The experiments demonstrate that our method outperforms in most cases these algorithms.

Original languageEnglish
Pages (from-to)416-433
Number of pages18
JournalPattern Recognition
Volume45
Issue number1
DOIs
StatePublished - Jan 2012

Keywords

  • Categorical data clustering
  • Diffusion Maps
  • Dimensionality reduction
  • Spectral clustering

Fingerprint

Dive into the research topics of 'SpectralCAT: Categorical spectral clustering of numerical and nominal data'. Together they form a unique fingerprint.

Cite this