• Spectroscopy and Spectral Analysis
  • Vol. 39, Issue 4, 1047 (2019)
LI Si-hai1、* and ZHAO Lei2
Author Affiliations
  • 1[in Chinese]
  • 2[in Chinese]
  • show less
    DOI: 10.3964/j.issn.1000-0593(2019)04-1047-06 Cite this Article
    LI Si-hai, ZHAO Lei. A Variable Selection Method Based on Ensemble-SISPLS for Near Infrared Spectroscopy[J]. Spectroscopy and Spectral Analysis, 2019, 39(4): 1047 Copy Citation Text show less

    Abstract

    Near-infrared spectroscopy has the characteristics of high-dimensional small sample, which means the number of variables is by far larger compared to that of samples. Variable selection is an effective method to improve the robustness and interpretability of quantitative analysis models of near-infrared spectroscopy. Sure Independence Screening (SIS), an effective feature selection method for ultrahigh dimensional space based on marginal correlations between each predictor and response, is widely used for variable selection of gene microarray data. SIS has the ability to reduce the dimensionality of data to the size of the sample, which is comparable to the reduction ability of LASSO. In a fairly general asymptotic framework, the use of SIS with the sure screening property means that all the significant variables remain after employing the variable screening method with probability tending to one. The variable selection method, based on sure independence screening combined with partial least squares regression (SIS-SPLS), is an iterative SIS method. Firstly, the SIS method is used to complete the initial selection of significant variables, then the stepwise forward selection is carried out on the basis of the marginal correlation of selected significant variables: the partial least squares regression model is established, and the final variable selection result is determined according to the Bayesian Information Criterion (BIC). SIS-SPLS implements an incremental screening of important variables in the stepwise forward selection manner. As the number of latent variables increases and the residual decreases gradually, the number of variables selected by SIS-SPLS will stay steady. Whereas, the evaluation of the importance of variables only by the marginal correlation, when the number of spectral variables is much larger than that of samples, will make the selected variable still large in number, or make the robustness of the variable selection results unsatisfactory. To improve the robustness of variable selection results in the case of small samples, a new variable selection method based on ensemble learning, the SIS method and partial least squares regression (Ensemble-SISPLS) was developed in this paper. First, using the bagging ensemble strategy, the bootstrap method was adopted to resample at random on the calibration set. The variable selection was performed by SIS-SPLS on each calibration subset. The variable selection results of all the calibration subsets were aggregated together by the vote rule. The variable whose frequency was greater than the given threshold was selected and the partial least squares regression model was established to calculate the root mean square error of the 5-fold cross validation. The grid search method was utilized to optimize the two key parameters of the frequency threshold and the number of latent variables. Based on the cross-validation root mean square error and number of variables of the sub-models, the sub-model performance was comprehensively evaluated, and the variables included in the optimal sub-model were treated as the final variable selection result. The variable selection experiments were respectively performed on the Corn dataset and the Angelica sinensis dataset, several variable selection methods such as Ensemble-SISPLS, SIS-SPLS and UVE-PLS were compared in selected variable number and model robustness. A total of 77 Angelica sinensis samples were collected from Minxian and Weiyuan Counties in Gansu Province. Near infrared spectra of all samples were obtained through a Nicolet-6700 near-infrared spectrometer for the prediction of ferulic acid content in Angelica sinensis. The number of selected variables, RMSEP and the coefficient of determination of the Ensemble-SISPLS method on the Corn dataset were 22, 0.000 8 and 0.999 8 respectively; the number of selected variables, RMSEP and the coefficient of determination of the SIS-SPLS method on the Corn dataset were 97, 0.007 3 and 0.998 8 respectively. The number of selected variables, RMSEP and the coefficient of determination of the Ensemble-SISPLS method on Angelica sinensis dataset were 24, 0.018 1 and 0.996 3 respectively; the number of selected variables, RMSEP and the coefficient of determination of the SIS-SPLS method on Angelica sinensis dataset were 38, 0.022 6 and 0.994 3. The results showed that the Ensemble-SISPLS method further improved the robustness and predictability of the variable selection result. The Ensemble-SISPLS method which combines the variable selection ability of the SIS-SPLS method and the good generalization capacity of ensemble learning can improve the robustness of variable selection. In addition, the evaluation criteria of sub-models manage to make an optimal compromise between the prediction performance and the number of selected variables, which reduces the number of selected variables to some extent and at the same time improves the interpretability of the model.
    LI Si-hai, ZHAO Lei. A Variable Selection Method Based on Ensemble-SISPLS for Near Infrared Spectroscopy[J]. Spectroscopy and Spectral Analysis, 2019, 39(4): 1047
    Download Citation