E-portfolio Entry in FYP2 Progress Report

Speaker identification is the process of automatically recognizing the person who is speaking according to the speaker’s information included in the speech waves that can be obtained through feature extraction. This process will then verify the identities being claimed by people who are accessing the system. A maximum selector will then identify the speaker.

Based on the research papers, the default structure of GFCC is better than MFCC in noise robustness. However, the key factor of the noise robustness is the non-linear rectification using cubic root. Hence, it is suggested that the performance of noise robustness can be optimized by adding the non-linear rectification that is executed after the Gammatone filterbank in GFCC into the stages after the Mel filterbank in MFCC. Besides, it is also shown that Mel-spectrogram has better performance than MFCC and GFCC in both noisy and clean environments if the feature extraction techniques are involved in the model training process. This is because Mel-spectrogram can still reserve most of the complex representation of the voice information in the model training process with a simpler feature extraction structure. Despite the consideration of feature extraction techniques, some research papers had been found to study the suitable method for the model training process. It is found that Siamese Neural Network (SNN) is more suitable than Convolutional Neural Network (CNN) in speaker identification as it is learning from the similarity score. Hence, the speaker identification model that is trained by using SNN is still able to identify a new speaker without having the relevant speech wave during the training process.

After collecting the relevant information, the training process is started by doing the feature extraction on the VoxCeleb2 dataset using the Mel-spectrogram, resulting in mel-spectrogram images. Then, these images are fetched into SNN and the training process is started by using MobileNetv2 as the baseline of the subnetwork of SNN. After that, MobileNetv2 is replaced by SqueezeNet and MCUNet respectively. This process is then repeated by replacing binary cross-entropy with contrastive loss as the loss function. The summary of the training result is shown in the table below.

A speaker identification system is created by integrating a wake word detector with the speaker identification model too. Once there is an incoming speech, it will trigger the wake word detector to identify the correct wake word. The speaker identification model will only be triggered to identify the speaker if the wake word detector detects the correct wake word in the incoming speech. Hence, this process can save the power consumption effectively as the speaker identification model is not required to be kept loading to identify the speaker as it is controlled by the wake word detector. The relevant result is shown in the diagram below.

Capture.JPG.1

In conclusion, Contrastive loss outperforms binary cross-entropy loss in SNN. SNN with SqueezeNet as the subnetwork has the best performance in inference time while SNN with MCUNet as the subnetwork has the best performance in accuracy and loss. Thus, lightweight network and Neural Architecture Search (NAS) can be utilized to optimize the performance of the speaker identification model on the embedded system effectively.

References

1.B. Bushofa and N. Bazina, "User Authentication based on his/her Speech," 2014.

2.F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size," 2016.

3.G. J. Ginovart-Panisello, E. Vidaña-Vila, S. Caro-Via, C. Martínez-Suquía and M. Freixes, "Low-Cost WASN for Real-Time Soundmap Generation," 2021.

4.Hajavi and A. Etemad, "Siamese Capsule Network for End-to-End Speaker Recognition In The Wild," 2020.

5.J. Joshy and K. Sambyo, "A Comparison and Contrast of the Various Feature Extraction Techniques in Speaker Recognition," 2016.

6.J. Lin, W.-M. Chen, Y. Lin, J. Cohn, C. Gan and S. Han, "MCUNet: Tiny Deep Learning on IoT Devices," 2020.

7.J. S. Chung, A. Nagrani and A. Zisserman, "VoxCeleb2: Deep Speaker Recognition," 2018.

8.Jalil, A. M., Hasan, F. S., & Alabbasi, H. A. (2019). Speaker Identification using Convolutional Neural Network for Clean and Noisy Speech Samples.

9.Sharifuddin, M. S., Nordin, S., & Ali, A. M. (2020). Comparison of CNNs and SVM for Voice Control Wheelchair. 9, 387-393.

10.V´elez, C. Rascon and G. Fuentes-Pineda, "One-Shot Speaker Identification for a Service Robot using a CNN-based Generic Verifier.," 2018.

11.Vimala.C and Radha.V, "Suitable Feature Extraction and Speech Recognition Technique for Isolated Tamil Spoken Words," 2014.

12.W. Burgos, "Gammatone and MFCC features in Speaker Recognition," 2014.

13.W. Glego?a, A. Karpus and A. Przyby?ek, "MobileNet family tailored for Raspberry Pi," Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 25th International Conference KES2021, vol. 192, pp. 2249-2258, 2021.

14.X. Wang, G. Li and P. Wang, "Qt-Based Cross-platform Design of Management System for Distributed Real-time Simulation Platform," 2016.

15.Zhao, X., & Wang, D. (2013). Analyzing Noise Robustness of MFCC and GFCC Features in Speaker Identification.

E-portfolio Entry | FYP2 Progress Report by LIM JUN JIE