01-03-2010, 07:20 PM
[attachment=2508]
Automatic Speech Recognition
Automatic speech recognition
¢ What is the task
¢ What are the main difficulties
¢ How is it approached
¢ How good is it
¢ How much better could it be
What is the task
¢ Getting a computer to understand spoken language
¢ By understand we might mean
“ React appropriately
“ Convert the input speech into another medium, e.g. text
¢ Several variables impinge on this (see later)
How do humans do it
¢ Articulation produces
¢ sound waves which
¢ the ear conveys to the brain
¢ for processing
How might computers do it
¢ Digitization
¢ Acoustic analysis of the speech signal
¢ Linguistic interpretation
Whatâ„¢s hard about that
¢ Digitization
“ Converting analogue signal into digital representation
¢ Signal processing
“ Separating speech from background noise
¢ Phonetics
“ Variability in human speech
¢ Phonology
“ Recognizing individual sound distinctions (similar phonemes)
¢ Lexicology and syntax
“ Disambiguating homophones
“ Features of continuous speech
¢ Syntax and pragmatics
“ Interpreting prosodic features
¢ Pragmatics
“ Filtering of performance errors (disfluencies)
Digitization
¢ Analogue to digital conversion
¢ Sampling and quantizing
¢ Use filters to measure energy levels for various points on the frequency spectrum
¢ Knowing the relative importance of different frequency bands (for speech) makes this process more efficient
¢ E.g. high frequency sounds are less informative, so can be sampled using a broader bandwidth (log scale)
Separating speech from background noise
¢ Noise cancelling microphones
“ Two mics, one facing speaker, the other facing away
“ Ambient noise is roughly same for both mics
¢ Knowing which bits of the signal relate to speech
“ Spectrograph analysis
Variability in individualsâ„¢ speech
¢ Variation among speakers due to
“ Vocal range (f0, and pitch range “ see later)
“ Voice quality (growl, whisper, physiological elements such as nasality, adenoidality, etc)
“ ACCENT !!! (especially vowel systems, but also consonants, allophones, etc.)
¢ Variation within speakers due to
“ Health, emotional state
“ Ambient conditions
¢ Speech style: formal read vs spontaneous
Speaker-(in)dependent systems
¢ Speaker-dependent systems
“ Require training to teach the system your individual idiosyncracies
¢ The more the merrier, but typically nowadays 5 or 10 minutes is enough
¢ User asked to pronounce some key words which allow computer to infer details of the user™s accent and voice
¢ Fortunately, languages are generally systematic
“ More robust
“ But less convenient
“ And obviously less portable
¢ Speaker-independent systems
“ Language coverage is reduced to compensate need to be flexible in phoneme identification
“ Clever compromise is to learn on the fly
Identifying phonemes
¢ Differences between some phonemes are sometimes very small
“ May be reflected in speech signal (eg vowels have more or less distinctive f1 and f2)
“ Often show up in coarticulation effects (transition to next sound)
¢ e.g. aspiration of voiceless stops in English
“ Allophonic variation
Disambiguating homophones
¢ Mostly differences are recognised by humans by context and need to make sense
Itâ„¢s hard to wreck a nice beach
What dimeâ„¢s a neckâ„¢s drain to stop port
¢ Systems can only recognize words that are in their lexicon, so limiting the lexicon is an obvious ploy
¢ Some ASR systems include a grammar which can help disambiguation
¢ Discontinuous speech much easier to recognize
¢ Single words tend to be pronounced more clearly
¢ Continuous speech involves contextual coarticulation effects
¢ Weak forms
¢ Assimilation
¢ Contractions
Interpreting prosodic features
¢ Pitch, length and loudness are used to indicate stress
¢ All of these are relative
“ On a speaker-by-speaker basis
“ And in relation to context
¢ Pitch and length are phonemic in some languages
Pitch
¢ Pitch contour can be extracted from speech signal
“ But pitch differences are relative
“ One man™s high is another (wo)man™s low
“ Pitch range is variable
¢ Pitch contributes to intonation
“ But has other functions in tone languages
¢ Intonation can convey meaning
Length
¢ Length is easy to measure but difficult to interpret
¢ Again, length is relative
¢ It is phonemic in many languages
¢ Speech rate is not constant “ slows down at the end of a sentence
Template-based approach
¢ Hard to distinguish very similar templates
¢ And quickly degrades when input differs from templates
¢ Therefore needs techniques to mitigate this degradation:
“ More subtle matching techniques
“ Multiple templates which are aggregated
¢ Taken together, these suggested ¦
Rule-based approach
¢ Use knowledge of phonetics and linguistics to guide search process
¢ Templates are replaced by rules expressing everything (anything) that might help to decode:
“ Phonetics, phonology, phonotactics
“ Syntax
“ Pragmatics
Statistics-based approach
¢ Can be seen as extension of template-based approach, using more powerful mathematical and statistical tools
¢ Sometimes seen as anti-linguistic approach
“ Fred Jelinek (IBM, 1988): Every time I fire a linguist my system improves
¢ Collect a large corpus of transcribed speech recordings
¢ Train the computer to learn the correspondences (machine learning)
¢ At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
Machine learning
¢ Acoustic and Lexical Models
“ Analyse training data in terms of relevant features
“ Learn from large amount of data different possibilities
¢ different phone sequences for a given word
¢ different combinations of elements of the speech signal for a given phone/phoneme
“ Combine these into a Hidden Markov Model expressing the probabilities
The Noisy Channel Model
¢ Use the acoustic model to give a set of likely phone sequences
¢ Use the lexical and language models to judge which of these are likely to result in probable word sequences
¢ The trick is having sophisticated algorithms to juggle the statistics
¢ A bit like the rule-based approach except that it is all learned automatically from data
Evaluation
¢ Funders have been very keen on competitive quantitative evaluation
¢ Subjective evaluations are informative, but not cost-effective
¢ For transcription tasks, word-error rate is popular (though can be misleading: all words are not equally important)
¢ For task-based dialogues, other measures of understanding are needed
Comparing ASR systems
¢ Factors include
“ Speaking mode: isolated words vs continuous speech
“ Speaking style: read vs spontaneous
“ Enrollment: speaker (in)dependent
“ Vocabulary size (small <20 ¦ large > 20,000)
“ Equipment: good quality noise-cancelling mic ¦ telephone
“ Size of training set (if appropriate) or rule set
“ Recognition method
Remaining problems
¢ Robustness “ graceful degradation, not catastrophic failure
¢ Portability “ independence of computing platform
¢ Adaptability “ to changing conditions (different mic, background noise, new speaker, new task domain, new language even)
¢ Language Modelling “ is there a role for linguistics in improving the language models
¢ Confidence Measures “ better methods to evaluate the absolute correctness of hypotheses.
¢ Out-of-Vocabulary (OOV) Words “ Systems must have some method of detecting OOV words, and dealing with them in a sensible way.
¢ Spontaneous Speech “ disfluencies (filled pauses, false starts, hesitations, ungrammatical constructions etc) remain a problem.
¢ Prosody “Stress, intonation, and rhythm convey important information for word recognition and the user's intentions (e.g., sarcasm, anger)
¢ Accent, dialect and mixed language “ non-native speech is a huge problem, especially where code-switching is commonplace