Variable Selection Methods in Spectral Data Analysis

Yan-kun LI; Ru-nan DONG; Jin ZHANG; Ke-nan HUANG; Zhi-yi MAO

doi:10.3964/j.issn.1000-0593(2021)11-3331-08

Journals >Spectroscopy and Spectral Analysis >Volume 41 >Issue 11 >Page 3331 > Article

Spectroscopy and Spectral Analysis
Vol. 41, Issue 11, 3331 (2021)

Variable Selection Methods in Spectral Data Analysis

Yan-kun LI^{1,1; *;}, Ru-nan DONG^1,1;, Jin ZHANG^2,2;, Ke-nan HUANG^3,3;, and Zhi-yi MAO^4,4;

Author Affiliations

¹1. Department of Environmental Science and Engineering, North China Electric Power University, Hebei Key Lab of Power Plant Flue Gas Multi-Pollutants Control, Baoding 071003, China

²2. School of Food Science, Guizhou Medical University, Guiyang 550025, China

³3. The 82nd Army Group Hospital of the Chinese People’s Liberation Army, Baoding 071000, China

⁴4. Tianjin Building Material Science Research Academy, Tianjin 300110, China

show less

DOI: 10.3964/j.issn.1000-0593(2021)11-3331-08 Cite this Article

Yan-kun LI, Ru-nan DONG, Jin ZHANG, Ke-nan HUANG, Zhi-yi MAO. Variable Selection Methods in Spectral Data Analysis[J]. Spectroscopy and Spectral Analysis, 2021, 41(11): 3331 Copy Citation Text

show less

Fig. 1. An overview of related works on variable/wavelength selection

Download full size | View in the Article

Fig. 2. The methods of WS and WIS

Download full size | View in the Article

Fig. 3. Comparisons of variable selection methods in NIR-protein model for corn data

Download full size | View in the Article

Fig. 4. Illustration for filter (F), wrapper (W) and embedded (E) methods

Download full size | View in the Article

Method	First appearance^[Ref.]	Characteristic (Merit and Drawback)
UVE (uninformative variable elimination)	Massart, 1996^[6]	Intuitive and practical, effectively eliminate the influence of non-objective factors; Random noise variables make the result unstable, and LOOCV makes calculation efficiency low.
MC-UVE (Monte Carlo-UVE)	Shao, 2008^[7]	MC technique instead of LOOCV, does not add noise variables, high stability; Needs to define a threshold, tends to select more variables.
iPLS (interval PLS)	Norgaard, 2000^[8]	Focus on a choice of better sub-intervals; Just testing a series of adjacent but nonoverlapping intervals, which would miss some more informative ones.
MWPLS (moving window PLS)	Jiang, 2002^[9]	Considers all the possible continuous intervals but maybe not the optimized intervals.
CARS (competitive adaptive reweighted sampling)-PLS	Liang, 2009^[10]	With fewer variables and latent variables; The reliability of PLS model parameters based on full spectra needs to be strengthened, low stability.
VIP (variable importance in projection)	Wold, 1993^[11]	Accumulate the importance of each variable reflected by loading weight from each component; It can be used when the independent variables number is more than the sample size; Require probabilistic considerations regarding VIP.
RT (randomization test)-PLS	Fisher, 1935^[12]	Combines permutation and statistical test, the result is more reliable; When the dataset is large, it has low efficiency and time consumption.
IVS (interactive variable selection)	Lindgren & Wold, 1994^[13]	Dimension-wise instead of model-wise, variable selection is carried out for each PLS component, an interactive variable selection approach; Large elements in sometimes suppress smaller values.
IPW (iterative predictor weighting)^[15]	Forina, 1999^[14]	The importance measure is used both to re-scale the original X-variables and to eliminate the least important variables; Time-consuming for too many variables.

Table 1. PLS parameter-based variables selection methods

View in the Article

Selection strategy	Representative methods^[Ref.]	First appearance^[Ref.]	Characteristic(Merit and Drawback)
Intelligent optimizing algorithms (IOA)-based	GA(Genetic algorithm)	Holand, 1975^[43]	Return to the mathematical essence of variable combination optimization, retain advantages of the combination of variables; Too many combinations of variables to optimize, usually need more preset parameters, sometimes easy to fall into local optimum.
	SA(Simulated Annealing)	Metropolis, 1953^[44]
	PSO(Particle swarm optimization)	Eberhart&Kennedy, 1995^[45]
	ACO(Ant colony optimization)	Colorni, 1991^[46]
	GWO(Gray wolf optimizer)	Mirjalili, 2014^[47]
Model population analysis (MPA)-based	BOSS (Bootstrapping soft shrinkage)	Liang, 2016^[48]	The traditional strategy of rigidly eliminating variables according to a single index is transformed into a flexible strategy of changing weight, which can preserve the effective variables more safely; The introduction of random algorithm helps to preserve the combination effect among spectral variables, however, it also makes the calculation more complicated.
	VCPA (Variable combination population analysis)	Liang, 2015^[49]
	VISSA (Variable iterative space shrinkage approach)	Liang, 2014^[50]
	ICO (Interval combination optimization)	Xiong & Min, 2016^[51]
	iRF (internal Random frog)	Liang, 2013^[52]
Collinearity minimization-based	SPA (Successive projection algorithm)^{[53, 54]}	Araujo, 2001^[55]	Minimizing the influence of multi-collinearity variables on the model; In the optimization, each variable is used as the starting point, the calculation amount is too large to be suitable for small-size sample.
Collinearity minimization-based	SR (Stepwise regression)^[56]
Category model-based	LDA (Linear discriminant analysis)	Fisher, 1936^[57]	The correlation between variables and model is preserved, and the overall prediction accuracy is improved by combining different classification algorithms. The computational complexity is small, but the result is limited by the performance of the classification model.
	ULDA (Uncorrelated lineardiscriminant analysis)^[58]	Jin, 2001^[59]
	RF (Random forest)^[60,61,62]	Breiman, 2001^[63]
	SVM (Support vector machine)	Vapnik, 1995^[64]
Regularization method	LASSO (Least absolute shrinkage and selection operator)^[65]	Tibshirani, 1996^[66]	Parameter estimation and variable selection are realized simultaneously, fast. When the number of variables is large, the over-fitting can be avoided; The suitable parameter value should be chosen.
	EN (Elastic net)	Zou, 2003^[67]
	RR (Ridge regression)	Hoerl & Kennard, 1998^[68]

Table 2. Other common methods of spectral variables selection

Yan-kun LI, Ru-nan DONG, Jin ZHANG, Ke-nan HUANG, Zhi-yi MAO. Variable Selection Methods in Spectral Data Analysis[J]. Spectroscopy and Spectral Analysis, 2021, 41(11): 3331

Download Citation

Tools

Save the article for my favorites

Paper Information

微信扫一扫：分享

微信扫一扫：分享