Joint Audio-Visual Speech Processing
Visual speech information present in the speakerâ„¢s mouth region
has long been viewed as a source for improving the robustness
and naturalness of human-computer-interfaces . where the acoustic channel is corrupted, the automatic speech recognition
(ASR) systems falls below usability and this system comes into use here.
Introduction:
Human speech is by nature bimodal, both in its production and
perception. humans integrate audio
and visual stimuli to perceive speech. Researchs have been going on the integration of the visual modality into the speech channel of the human-computerinterface (HCI), aiming in improving its robustness and naturalness. the visual channel can benefit processes such as speaker identification, verification, localization, speech event detection , speech signal separation , coding , video indexing and retrieval , and text-to-speech.
The Visual Front End:
Visual speech features generally fit into one of the following
three categories:
a)Appearance based features: assume that all video pixels
within a region-of-interest (ROI) are informative about the
spoken utterance.
b) shape based ones:assumes that most speechreading information
is contained in the contours of the speakerâ„¢s lips, or more generally,
of the face
c) or combination of both.
Audio-visual features in our system:
The system used here produces appearance
based features and operates on full face video with no artificial
face markings due to which both face detection and ROI
extraction are required. Tracking provides the mouth location, size, and orientation,
which are then smoothed over time to improve robustness.a 6464 pixel ROI is obtained
for every video frame Based on the resulting estimates. a two-dimensional, separable discrete cosine
transform (DCT) is applied to the ROI, and the 100 highest energy
DCT coefficients are retained. Then an intraframe
linear discriminant analysis (LDA) projection is applied To reduce dimensionality which resulting in a 30-dimensional feature vector. Then a a maximum likelihood linear transformation (MLLT) is applied that improves maximum likelihood based statistical data modelling.
report:
http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf