Søren Steen defended his Master's thesis on 'Deep learning for speaker recognition in noisy environments'


Speaker recognition is an emerging biometric field for areas in IT-security, such as online banking, access control, call center user validation, and consumer electronics. Speaker recognition is interesting due to the universal presence of microphones in mobile devices, and for use cases including the use of a smart-phones, speaker recognition is an obvious tool for verification purposes. Mobile devices are limited in both computing power and storage space however, and for these devices, efficient methods are required. Sound waves propagate through the air from all sound sources, which induce noise from the environment in real world applications. Voice recordings of short durations are challenging as well, yet is good performance of these conditions desirable. The challenge exists as short duration voice samples do not contain a satisfactory amount of information to model each individual speaker, yet the desirability of good performance on short voice segments is high, as the capture time of biometric characteristics is desired short for practical applications, such as phone conversations, or quick access control.

The goal of this thesis is to perform research in increasing performance of speaker recognition systems in regards to sample completeness and signal degradation.
An increase in performance when subjected to sample incompleteness and signal degradation is in this work approached by using quality information about the type of external degradation source. The quality information is used in the decision making stage, after biometric comparisons have been computed, by adjusting the threshold adaptively according to the quality information. The conventional cohort based methods have disadvantages in having to store cohort data, inducing privacy and storage related concerns in the process by storing biometric data, as well as requiring many comparisons. In this work, a novel approach is proposed using neural networks to perform quality informed comparison score normalization, by using machine learning to find patterns in the effect of environmental interactions on a cohort dataset. The privacy, storage, and computational concerns of traditional cohort based methods are avoided by using neural networks, where only network weights required stored, which do not contain biometric information.

In this thesis, the viability of the proposed method is researched by examining the performance of different configurations of neural networks. The baseline neural network has an improvement of 5.68% over the raw comparison score averaged over all the examined signal degradation conditions. By tuning the network size, a performance gain in equal error rate of 13.3% is reached over the whole dataset, compared to the raw comparison scores. The performance gain exists for every quality degradation pattern examined, yet the gain is higher in difficult situations, which promises performance gains in every situation. The configuration of the empirically tuned network is confirmed by performing Bayesian optimization. The sensitivity of the method towards unknown degradation patterns is also studied, where the model proves robust to unknown noise types, and less robust to unknown patterns of lower duration or higher noise level. Performing quality-informed comparison score normalization by using neural networks results in tangible performance improvements and is shown in this work to have great potential for flexible integration, as well as allowing computation on mobile devices.