13-04-2017, 03:51 PM
Unsupervised detection of the speaker change is addressed in this document. Three loudspeaker segmentation systems are examined. The first system investigates the AudioSpectrumCentroid and AudioWaveformEnvelope features, implements a dynamic merge scheme and applies the Bayesian Information Criterion (BIC). The second system consists of three modules. In the first module, a second order statistical measure is extracted; The Euclidean distance and the T2 Hotelling statistic are applied sequentially in the second module; And BIC is used in the third module. The third system first uses a metric-based approach in order to detect possible speaker change points, and then the BIC criterion is applied to validate the previously detected change points. The experiments are performed on a data set, which is created by concatenating the TIMIT database loudspeakers. A systematic performance comparison between the three systems is carried out using the one-way ANOVA method and Tukey's post hoc method.
The purpose of loudspeaker segmentation (audio) is to detect the speaker change limits (acoustic) in an audio stream. In the last decade, researchers in the speech-processing community have put a lot of effort into this problem because of their application to many voice and audio processing tasks, such as audio classification, automatic transcription of audio recordings, Follow-up of speakers and the diarization of speakers. Existing audio segmentation approaches generally fall into two categories, namely distance-based segmentation and segmentation based on model decoding. In distance-based segmentation, a distance measure of two audio segments is first defined and then an acoustic change detection strategy based on the distance measure is designed. In contrast to modeling-based segmentation, which detects acoustic changes in a supervised manner, distance-based segmentation has the advantage that acoustic changes can be detected in an unsupervised way, ie an a priori knowledge of the content Of the input audio stream is unnecessary. In this article, we focus on distance-based segmentation.
The purpose of loudspeaker segmentation (audio) is to detect the speaker change limits (acoustic) in an audio stream. In the last decade, researchers in the speech-processing community have put a lot of effort into this problem because of their application to many voice and audio processing tasks, such as audio classification, automatic transcription of audio recordings, Follow-up of speakers and the diarization of speakers. Existing audio segmentation approaches generally fall into two categories, namely distance-based segmentation and segmentation based on model decoding. In distance-based segmentation, a distance measure of two audio segments is first defined and then an acoustic change detection strategy based on the distance measure is designed. In contrast to modeling-based segmentation, which detects acoustic changes in a supervised manner, distance-based segmentation has the advantage that acoustic changes can be detected in an unsupervised way, ie an a priori knowledge of the content Of the input audio stream is unnecessary. In this article, we focus on distance-based segmentation.