Wasserstein GAN for the Classification of Unbalanced THz Database

Rong-sheng ZHU; Tao SHEN; Ying-li LIU; Yan ZHU; Xiang-wei CUI

doi:10.3964/j.issn.1000-0593(2021)02-0425-05

Abstract

The terahertz spectrum of the matter is unique. At present, combined with advanced machine learning methods, research on terahertz spectrum recognition technology based on large-scale spectral databases has become the focus of terahertz application technology. It is difficult to collect multi-material equilibrium spectral data, which is the basis for classifying terahertz spectral data. This paper proposes an unbalanced terahertz spectrum recognition method based on WGAN (Wasserstein Generative Adversarial Networks). As a new method of generating data, WGAN uses the generated data under the condition that the model reaches the Nash equilibrium to supplement the data set, and is finally trained by a support vector machine (SVM). The experimental results prove that the generated data can effectively map the distribution of real data, and the accuracy of identifying unbalanced spectral data can be improved by mixing the generated data with the real data. In this paper, three types of maltose compounds with similar characteristics spectra are used for verification. We first use S-G filtering and cubic spline interpolation to normalize the spectral data of the three substances, and then expand the unbalanced terahertz spectral data of the three substances by constructing a WGAN model to bring it to class equilibrium. The experiments are verified under the same test set, and three sets of comparative experiments are used to prove the effectiveness of WGAN in the processing of uneven data sets. First we use WGAN to generate data. As the number of iterations increases, the generated data gradually conforms to the real data distribution. When the model reaches the Nash equilibrium, the generated data basically conforms to the original data distribution. The experimental results prove that training the SVM model using the extended WGAN data set can solve the problem that the model has a small sample data (Maltotriose, Malthexaose) biased toward a large sample data (Maltoheptaose) on the test set. After comparing WGAN with traditional methods for processing unbalanced data sets FWSVM and COPY, we find that the training set accuracy of the three classification algorithms on the dataset-1 dataset can reach more than 90%. However, due to the limitation of the generalization ability of the model, the effect of the traditional method on the test set is not very satisfactory, and the accuracy of the test set after using WGAN can reach 91.54%. In terms of different imbalances, the data sets with imbalances of 16, 81, and 256 were used for verification. The accuracy rates on the three test sets are 92.08%, 91.54%, and 90.27%, which can meet the requirements of dealing with different imbalances in actual work.