• Spectroscopy and Spectral Analysis
  • Vol. 42, Issue 9, 2694 (2022)

Abstract

Terahertz (THz) waves characterized by low energy, instantaneity and proficiency in spectral analysis have a promising futures in material identification. Although the existing substance identification methods based on THz have achieved certain effects, they are prone to fall into local optimization, resulting in low identification accuracy. Uniform manifold Approximation and Projection (UMAP), as a nonlinear dimensionality reduction method, assume that the data are uniformly distributed on Riemannian manifolds, which can be used to model manifolds with fuzzy topology. UMAP dimension reduction is to optimize the layout of data representation in low-dimensional space by minimizing the cross-entropy between two topological representations. The initial clustering centre is often given randomly in the traditional fuzzy C-clustering method (FCM). When the initial clustering center is not selected correctly, it is easy to fall into the local optimum, leading to wrong clustering. To this end, this paper proposes a Uniform Manifold Approximation and Projection (UMAP) assisted fuzzy C-clustering algorithm. Firstly, UMAP is used to reduce the dimensionality of the input THz sample matrix. And then, based on the principle of maximizing the distance between categories, the appropriate initial clustering center is selected. Finally, the fuzzy C-means method is employed to perform the clustering analysis. This proposed algorithm can solve the overcrowding problem between categories in the clustering process and reflect the distance information between categories to facilitate the selection of appropriate initial clustering centers. In order to verify the reliability of the algorithm proposed in this paper, four different types of genetically modified cotton seeds of Lu Mianyan28, Lu Mianyan29, Lu Mianyan36, and Zhongmian28 were detected by using THz time-domain spectroscopy technology. Then, the UMAP-assisted fuzzy C-clustering method was used to cluster the absorbance spectral data of four different types of genetically modified cotton seeds. The different cotton seeds are successfully well separated, and the clustering effect with a total correct rate of 0.9833 is obtained. The result fully demonstrates that the fuzzy C-clustering method based on UMAP-assisted proposed in this paper has a good application prospect in identifying material THz spectrum.