03-05-2011, 05:07 PM
Abstract
Biometrics has been a topic of great interest since the advent of the information age and will soon
lead to a safer and simpler lifestyle where passcodes and keys are inherent to the user. We describe a
system capable of automatically extracting visual features from a human face for use in dynamic visual
biometrics. Automatic speech and speaker recognition has recently moved towards incorporating visual
information to improve upon audio-only recognition systems. With few exceptions, however, investigations
into audio-visual and visual-only automatic speech and speaker recognition have utilized ideal
visual databases in their audio-visual (AV-ASR) and visual-only automatic speech recognition (V-ASR)
experiments. Our system incorporates robust and efficient computer vision algorithms to automatically
detect, track and identify a speaker based on visual features extracted from the speaker’s mouth region.
The features are extracted in real-time, in adverse visual conditions. The system recognition performance
is evaluated by comparing speaker recognition results found using automatic tracking data with
those found using ground truth tracking data. Speaker recognition results found using ground truth
and automatic tracking data are 52.3% and 59.3%, respectively. The results are discussed and future
improvements and experiments are suggested.
1 Introduction
Over recent years the field of biometrics has become a mainstream research topic and has expanded to
include speaker verification and identification (recognition) through speakers’ dynamic visual articulation
characteristics. Advances in the field of biometrics will enhance security and safety and ultimately lead to
a simpler lifestyle where passwords and keys are superfluous.
A large part of the research effort in visual biometrics has been based on identification using static
images of a speaker’s face or parts of the face [29, 26, 18, 8, 20]. Unfortunately, these types of systems
are susceptible to imposter attacks using a photograph of an authorized user. This drawback has prompted
the development of dynamic biometrics where the time-varying characteristics of a speaker’s voice and
appearance are modeled in addition to the static appearance.
We describe a system capable of automatically extracting visual features from a human face for use
in dynamic visual biometrics. Many visual speech and speaker recognition experiments have utilized
ideal visual databases either to simplify visual feature representation and extraction, or due to the lack of
publicly available audio-visual speech corpora.
The system we propose utilizes robust and efficient computer vision algorithms to automatically detect,
track, and identify a speaker based on visual features extracted from the speaker’s mouth region. The
system operates in approximately real-time and can handle most adverse visual conditions.
We compare speaker recognition performance obtained using automatic tracking data to that found
using ground truth data. We show a 7% improvement in speaker recognition rate from 52.3% using
ground truth tracking data to 59.3% using automatic tracking data. Finally, we discuss the results and
suggest future refinements to the system.
Download full report
http://cslgreenhouse.csl.illinois.edu/al...s/0050.pdf