FYP2 Progress Report

E-portfolio Entry RSS

Speaker identification is the process of automatically recognizing the person who is speaking according to the speaker’s information included in the speech waves that can be obtained through feature extraction. This process will then verify the identities being claimed by people who are accessing the system. A maximum selector will then identify the speaker.

Based on the research papers, the default structure of GFCC is better than MFCC in noise robustness. However, the key factor of the noise robustness is the non-linear rectification using cubic root. Hence, it is suggested that the performance of noise robustness can be optimized by adding the non-linear rectification that is executed after the Gammatone filterbank in GFCC into the stages after the Mel filterbank in MFCC. Besides, it is also shown that Mel-spectrogram has better performance than MFCC and GFCC in both noisy and clean environments if the feature extraction techniques are involved in the model training process. This is because Mel-spectrogram can still reserve most of the complex representation of the voice information in the model training process with a simpler feature extraction structure. Despite the consideration of feature extraction techniques, some research papers had been found to study the suitable method for the model training process. It is found that Siamese Neural Network (SNN) is more suitable than Convolutional Neural Network (CNN) in speaker identification as it is learning from the similarity score. Hence, the speaker identification model that is trained by using SNN is still able to identify a new speaker without having the relevant speech wave during the training process.

After collecting the relevant information, the training process is started by doing the feature extraction on the VoxCeleb2 dataset using the Mel-spectrogram, resulting in mel-spectrogram images. Then, these images are fetched into SNN and the training process is started by using MobileNetv2 as the baseline of the subnetwork of SNN. After that, MobileNetv2 is replaced by SqueezeNet and MCUNet respectively. This process is then repeated by replacing binary cross-entropy with contrastive loss as the loss function. The summary of the training result is shown in the table below.

A speaker identification system is created by integrating a wake word detector with the speaker identification model too. Once there is an incoming speech, it will trigger the wake word detector to identify the correct wake word. The speaker identification model will only be triggered to identify the speaker if the wake word detector detects the correct wake word in the incoming speech. Hence, this process can save the power consumption effectively as the speaker identification model is not required to be kept loading to identify the speaker as it is controlled by the wake word detector. The relevant result is shown in the diagram below.

Capture.JPG.1

In conclusion, Contrastive loss outperforms binary cross-entropy loss in SNN. SNN with SqueezeNet as the subnetwork has the best performance in inference time while SNN with MCUNet as the subnetwork has the best performance in accuracy and loss. Thus, lightweight network and Neural Architecture Search (NAS) can be utilized to optimize the performance of the speaker identification model on the embedded system effectively.

References

1.B. Bushofa and N. Bazina, "User Authentication based on his/her Speech," 2014.

2.F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally and K. Keutzer, "SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size," 2016.

3.G. J. Ginovart-Panisello, E. Vidaña-Vila, S. Caro-Via, C. Martínez-Suquía and M. Freixes, "Low-Cost WASN for Real-Time Soundmap Generation," 2021.

4.Hajavi and A. Etemad, "Siamese Capsule Network for End-to-End Speaker Recognition In The Wild," 2020.

5.J. Joshy and K. Sambyo, "A Comparison and Contrast of the Various Feature Extraction Techniques in Speaker Recognition," 2016.

6.J. Lin, W.-M. Chen, Y. Lin, J. Cohn, C. Gan and S. Han, "MCUNet: Tiny Deep Learning on IoT Devices," 2020.

7.J. S. Chung, A. Nagrani and A. Zisserman, "VoxCeleb2: Deep Speaker Recognition," 2018.

8.Jalil, A. M., Hasan, F. S., & Alabbasi, H. A. (2019). Speaker Identification using Convolutional Neural Network for Clean and Noisy Speech Samples.

9.Sharifuddin, M. S., Nordin, S., & Ali, A. M. (2020). Comparison of CNNs and SVM for Voice Control Wheelchair. 9, 387-393.

10.V´elez, C. Rascon and G. Fuentes-Pineda, "One-Shot Speaker Identification for a Service Robot using a CNN-based Generic Verifier.," 2018.

11.Vimala.C and Radha.V, "Suitable Feature Extraction and Speech Recognition Technique for Isolated Tamil Spoken Words," 2014.

12.W. Burgos, "Gammatone and MFCC features in Speaker Recognition," 2014.

13.W. Glego?a, A. Karpus and A. Przyby?ek, "MobileNet family tailored for Raspberry Pi," Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 25th International Conference KES2021, vol. 192, pp. 2249-2258, 2021.

14.X. Wang, G. Li and P. Wang, "Qt-Based Cross-platform Design of Management System for Distributed Real-time Simulation Platform," 2016.

15.Zhao, X., & Wang, D. (2013). Analyzing Noise Robustness of MFCC and GFCC Features in Speaker Identification.

Details

FYP2 Progress Report 3 RSS

From week 10 to week 12, the environment of Raspberry Pi 4 is set up again to make sure it is compatible with the touchless GUI and speaker identification models. Raspbian 64-bit OS is installed so Debian 11 (bullseye) is being used. Then, the relevant python libraries are installed to execute the touchless GUI successfully on the Raspberry Pi 4. Besides, touchless GUI is created by using Qt framework in Python (pyqt5). The tricky part while designing the touchless GUI is repaint() function is used to prevent the GUI is blocked by the other events.

In order to embed the speaker identification model into the touchless GUI, another stage is deployed which is the wake word detector. Hence, this can make sure a 2-stages system is built so the speaker identification model will only be triggered if the result of the wake word detector is correct. The result of the speaker identification model will then be displayed in the end as shown in the Figure below.

Even though the desired result has been achieved, the touchless GUI is continuously explored to exploit its potential. For example, making the touchless GUI autorun in full screen after booting. At the same time, IEEE journal, presentation slides and FYP report are being prepared.

References:

1. https://arxiv.org/pdf/1602.07360.pdf (SqueezeNet)

2. https://arxiv.org/pdf/2007.10319.pdf (MCUNet)

3. https://github.com/gmalivenko/pytorch2keras (pytorch2keras)

4. https://github.com/rouyunpan/mcunet (MCUNet GitHub Repository)

5. https://github.com/Picovoice/porcupine (wake word detector GitHub Repository)

6. https://build-system.fman.io/pyqt5-tutorial (Simple pyqt5 tutorial link)

7. https://stackoverflow.com/questions/4510712/qlabel-settext-not-displaying-text-immediately-before-running-other-method (repaint())

Details

FYP2 Progress Report 2 RSS

From week 5 to week 9, the speaker identification models that are trained by using different sub-networks are fine-tuned based on the siamese neural network structure as shown in the figure below. This is to prevent overfitting issues.

Based on this structure, the subnetwork is changed based on the list of the network in the result section while the loss function of binary cross entropy is replaced by using contrastropy loss as it works more efficient in the siamese neural network.

The summary of the overall results is shown as the figure below. (lr = learning rate, BS = batch size)

The default parameter of learning rate and optimizer is 3e-4 and Adam respectively while the default bastch size is 64 with 1000 as the steps per epoch. However, the batch size is set to 32 while training the model by using MobileNet due to GPU memory is not enough. In addition, the learning rate of MCUNet512kb is set to 1e-5 to make sure it can converge successfully as the default learning rate for this experiment (3e-4) fails to make it to converge.

According to the result, MCUNet256kb has the smallest memory size while its inference time is slightly higher than SqueezeNet. Even though SqueezeNet has the shortest inference time, its memory size is slightly higher than MCUNet256kb and it has lower accuracy compared to MCUNet256kb too.

Furthermore, a simple GUI is created by using pyqt5 to act as the frontend of the speaker identification system. However, it will be modified to become a touchless system. The speaker identification system will then be enhanced by integrating the wake word detector by using module pvporcupinedemo which can work in an offline mode.

References:

1. https://arxiv.org/pdf/1602.07360.pdf (SqueezeNet)

2. https://arxiv.org/pdf/2007.10319.pdf (MCUNet)

3. https://github.com/gmalivenko/pytorch2keras (pytorch2keras)

4. https://github.com/rouyunpan/mcunet (MCUNet GitHub Repository)

5. https://github.com/Picovoice/porcupine (wake word detector GitHub Repository)

6. https://build-system.fman.io/pyqt5-tutorial (Simple pyqt5 tutorial link)

Details

FYP2 Progress Report 1 RSS

Week 1

-Resize the size of the dataset to around 200k and use Mel spectrogram as the feature extraction.

-Train and fine-tune the model by using MobileNetv2.

Week 2

-Study MCUNet and NAS architecture.

-Train and fine-tune the model by using SqueezeNet.

Week 3

-Change the existing solution from TensorFlow to Pytorch to train the Siamese Neural Network as the MCUNet and the tinyNAS architecture are available in Pytorch only.

-Deploy the existing models on the Raspberry Pi, then the relevant performance of the model is observed.

Week 4

-Use the module of pytorch2keras to convert the pytorch model to keras model and change the data format from NCHW to NHWC to make sure the model can be executed on the CPU of Raspberry Pi.

-Train the model with the help of pytorch2keras and deploy the converted model on Raspberry Pi and the relevant performance is obsserved.

Week 5

-Fine-tune and retrain the model by using ProxylessNAS with larger scale of dataset to overcome overfitting issue.

-Start training the MCUNet.

-Deploy the model on the Raspberry Pi and test the model.

Remark:

The model is trained by using the architecture as shown below. The subnetwork is replaced with MobileNetv2, SqueezeNet and ProxylessNAS respectively. For the further improvement, the binary cross entropy loss is replaced with the contrastive loss that can suit better in Siamese Neural Network. Besides, implementation of MCUNet is being explored.

WhatsApp Image 2022-04-15 at 6.54.46 PM.jpeg