Language Identification Using Joint Voice Activity Detection and Dynamic Range Control

Yankai Wang; Hua Long; Yubin Shao; Qingzhi Du; Yao Wang

doi:10.3788/LOP202259.1307001

Journals >Laser & Optoelectronics Progress >Volume 59 >Issue 13 >Page 1307001 > Article

Laser & Optoelectronics Progress
Vol. 59, Issue 13, 1307001 (2022)

Language Identification Using Joint Voice Activity Detection and Dynamic Range Control

Yankai Wang, Hua Long^*, Yubin Shao, Qingzhi Du, and Yao Wang

Author Affiliations

Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan , China

show less

DOI: 10.3788/LOP202259.1307001 Cite this Article Set citation alerts

Yankai Wang, Hua Long, Yubin Shao, Qingzhi Du, Yao Wang. Language Identification Using Joint Voice Activity Detection and Dynamic Range Control[J]. Laser & Optoelectronics Progress, 2022, 59(13): 1307001 Copy Citation Text

show less

MFCC0 feature voice activity detection. (a) Voice waveform; (b) MFCC0 features; (c) MFCC0 feature voice activity detection result after median filtering

Fig. 1. MFCC₀ feature voice activity detection. (a) Voice waveform; (b) MFCC₀ features; (c) MFCC₀ feature voice activity detection result after median filtering

Download full size

Fig. 2. DRC input/output processing unit

Download full size

Fig. 3. Voice changes before and after DRC processing. (a) Voice waveform changes before and after DRC processing; (b) spectropram before DRC processing; (c) spectropram after DRC processing

Download full size

Fig. 4. Comparison of different frequency scales. (a) Linear scale spectrogram; (b) log scale spectrogram

Download full size

Fig. 5. Flow chart of language recognition

Download full size

Fig. 6. Multi-classification task evaluation parameters

Download full size

Fig. 7. Results of different frequency coordinate scales

Download full size

Fig. 8. Resnet classification results

Download full size

Fig. 9. ResNeSt classification results

Download full size

Fig. 10. Language recognition result confusion matrix

Download full size

Probability distribution before VAD	$ξ$	1	0
Probability distribution before VAD	P	$\frac{L_{2}}{S}$	$\frac{L_{1}}{S}$
Probability distribution after VAD	$ξ$	1	0
Probability distribution after VAD	P	$\frac{L_{2} + L_{3}}{S}$	$\frac{L_{1} - L_{3}}{S}$

Table 1. Probability distribution change before and after VAD

Language type	Training set		Testing set		Total wav number	Duration /s
Language type	Wav number	People number	Wav number	People number	Total wav number	Duration /s
French	1200	150	300	149	1500	3
German	1200	150	300	150	1500	3
Spanish	1200	151	300	151	1500	3
English	1200	169	300	154	1500	3
Italian	1200	151	300	150	1500	3
Russian	1200	150	300	148	1500	3
Total	7200	921	1800	902	9000

Table 2. Data allocation of training set and testing set

Feature	（Frame_number， Data_dimension）	A_accuracy /%
MFCC-SDC	（374， 56）	65.72
MFCC	（374， 39）	80.88
GFCC	（374， 32）	85.44
Log scale Fbank feature	（374， 64）	93.05
Linear scale spectrogram	（374， 128）	93.66
Log scale spectrogram （proposed）	（374， 128）	97.94

Table 3. Comparison of language identification results of several different features

Download Citation

Set citation alerts for the article

Tools

Set citation alerts for the article

Save the article for my favorites

Paper Information