TY - JOUR

T1 - Geometric component analysis and its applications to data analysis

AU - Bermanis, Amit

AU - Salhov, Moshe

AU - Averbuch, Amir

N1 - Publisher Copyright:
© 2021 Elsevier Inc.

PY - 2021/9

Y1 - 2021/9

N2 - Dimensionality reduction methods are designed to overcome the ‘curse of dimensionality’ phenomenon that makes the analysis of high dimensional big data difficult. Many of these methods are based on principal component analysis which is statistically driven and do not directly address the geometry of the data. Thus, machine learning tasks, such as classification and anomaly detection, may not benefit from a PCA-based methodology. This work provides a dictionary-based framework for geometrically driven data analysis, for both linear and non-linear (diffusion geometries), that includes dimensionality reduction, out-of-sample extension and anomaly detection. This paper proposes the Geometric Component Analysis (GCA) methodology for dimensionality reduction of linear and non-linear data. The main algorithm greedily picks multidimensional data points that form linear subspaces in the ambient space that contain as much information as possible from the original data. For non-linear data, this greedy approach to the “diffusion kernel” is commonly used in diffusion geometry. The GCA-based diffusion maps appear to be a direct application of a greedy algorithm to the kernel matrix constructed in diffusion maps. The algorithm greedily selects data points from the data according to their distances from the subspace spanned by the previously selected data points. When the distance of all the remaining data points is smaller than a prespecified threshold, the algorithm stops. The extracted geometry of the data is preserved up to a user-defined distortion rate. In addition, a subset of landmark data points, known as dictionary, is identified by the presented algorithm for dimensionality reduction that is geometric-based. The performance of the method is demonstrated and evaluated on both synthetic and real-world data sets. It achieves good results for unsupervised learning tasks. The proposed algorithm is attractive for its simplicity, low computational complexity and tractability.

AB - Dimensionality reduction methods are designed to overcome the ‘curse of dimensionality’ phenomenon that makes the analysis of high dimensional big data difficult. Many of these methods are based on principal component analysis which is statistically driven and do not directly address the geometry of the data. Thus, machine learning tasks, such as classification and anomaly detection, may not benefit from a PCA-based methodology. This work provides a dictionary-based framework for geometrically driven data analysis, for both linear and non-linear (diffusion geometries), that includes dimensionality reduction, out-of-sample extension and anomaly detection. This paper proposes the Geometric Component Analysis (GCA) methodology for dimensionality reduction of linear and non-linear data. The main algorithm greedily picks multidimensional data points that form linear subspaces in the ambient space that contain as much information as possible from the original data. For non-linear data, this greedy approach to the “diffusion kernel” is commonly used in diffusion geometry. The GCA-based diffusion maps appear to be a direct application of a greedy algorithm to the kernel matrix constructed in diffusion maps. The algorithm greedily selects data points from the data according to their distances from the subspace spanned by the previously selected data points. When the distance of all the remaining data points is smaller than a prespecified threshold, the algorithm stops. The extracted geometry of the data is preserved up to a user-defined distortion rate. In addition, a subset of landmark data points, known as dictionary, is identified by the presented algorithm for dimensionality reduction that is geometric-based. The performance of the method is demonstrated and evaluated on both synthetic and real-world data sets. It achieves good results for unsupervised learning tasks. The proposed algorithm is attractive for its simplicity, low computational complexity and tractability.

KW - Dictionary construction

KW - Diffusion maps

KW - Incomplete pivoted QR

KW - Landmark data points

KW - Linear and non-linear dimensionality reduction

UR - http://www.scopus.com/inward/record.url?scp=85102506329&partnerID=8YFLogxK

U2 - 10.1016/j.acha.2021.02.005

DO - 10.1016/j.acha.2021.02.005

M3 - ???researchoutput.researchoutputtypes.contributiontojournal.article???

AN - SCOPUS:85102506329

SN - 1063-5203

VL - 54

SP - 20

EP - 43

JO - Applied and Computational Harmonic Analysis

JF - Applied and Computational Harmonic Analysis

ER -