Geometric component analysis and its applications to data analysis

Amit Bermanis, Moshe Salhov, Amir Averbuch*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Dimensionality reduction methods are designed to overcome the ‘curse of dimensionality’ phenomenon that makes the analysis of high dimensional big data difficult. Many of these methods are based on principal component analysis which is statistically driven and do not directly address the geometry of the data. Thus, machine learning tasks, such as classification and anomaly detection, may not benefit from a PCA-based methodology. This work provides a dictionary-based framework for geometrically driven data analysis, for both linear and non-linear (diffusion geometries), that includes dimensionality reduction, out-of-sample extension and anomaly detection. This paper proposes the Geometric Component Analysis (GCA) methodology for dimensionality reduction of linear and non-linear data. The main algorithm greedily picks multidimensional data points that form linear subspaces in the ambient space that contain as much information as possible from the original data. For non-linear data, this greedy approach to the “diffusion kernel” is commonly used in diffusion geometry. The GCA-based diffusion maps appear to be a direct application of a greedy algorithm to the kernel matrix constructed in diffusion maps. The algorithm greedily selects data points from the data according to their distances from the subspace spanned by the previously selected data points. When the distance of all the remaining data points is smaller than a prespecified threshold, the algorithm stops. The extracted geometry of the data is preserved up to a user-defined distortion rate. In addition, a subset of landmark data points, known as dictionary, is identified by the presented algorithm for dimensionality reduction that is geometric-based. The performance of the method is demonstrated and evaluated on both synthetic and real-world data sets. It achieves good results for unsupervised learning tasks. The proposed algorithm is attractive for its simplicity, low computational complexity and tractability.

Original languageEnglish
Pages (from-to)20-43
Number of pages24
JournalApplied and Computational Harmonic Analysis
Volume54
DOIs
StatePublished - Sep 2021

Keywords

  • Dictionary construction
  • Diffusion maps
  • Incomplete pivoted QR
  • Landmark data points
  • Linear and non-linear dimensionality reduction

Fingerprint

Dive into the research topics of 'Geometric component analysis and its applications to data analysis'. Together they form a unique fingerprint.

Cite this