ASK HERE

seminar class · 30-04-2011, 04:22 PM

[attachment=13164]
1. INTRODUCTION
Automatic Speech-To-Speech (S2S) translation breaks down communication barriers between people who do not share a common language and hence enable instant oral cross-lingual communication for many critical applications such as emergency medical care. The development of an accurate, efficient and robust S2S translation system poses a lot of challenges. This is especially true for colloquial speech and resource deficient languages.
The IBM MASTOR speech-to-speech translation system has been developed for the DARPA (Defense Advanced Research Projects Agency)’s CAST and TRANSTAC programs whose mission is to develop technologies that enable rapid deployment of real-time S2S translation of low-resource languages on portable devices. It originated from the IBM MARS S2S system handling the air travel reservation domain described in [1], which was later significantly improved in all components, including ASR (Automatic Speech Recognition), MT (Machine Translation) and TTS (Text To Speech), and later evolved into the MASTOR multilingual S2S system that covers much broader domains such as medical treatment and force protection. More recently, we have further broadened our experience and efforts to very rapidly develop systems for under-studied languages, such as regional dialects of Arabic. The intent of this program is to provide language support to military, medical and humanitarian personnel during operations in foreign territories, by deciphering possibly critical language communications with a two-way real-time speech-to-speech translation system designed for specific tasks such as medical triage and force protection.
The initial data collection effort for the project has shown that the domain of force protection and medical triage is, though limited, rather broad. In fact, the definition of domain coverage is tough when the speech from responding foreign language speakers are concerned, as their responses are less constrained and may include out-of-domain words and concepts. Moreover, flexible casual or colloquial speaking style inevitably appears in the human- to-human conversational communications. Therefore, the project is a great challenge that calls for major research efforts.
Among all the challenges for speech recognition and translation for under-studied languages, there are two main issues: 1) Lack of appropriate amount of speech data that represent the domain of interest and the oral language spoken by the target speakers, resulting in difficulties in accurate estimation of statistical models for speech recognition and translation. 2) Lack of linguistic knowledge realization in spelling standards, transcriptions, lexicons and dictionaries, or annotated corpora. Therefore, various different approaches have to be explored.
Another critical challenge is to embed complicated algorithms and programs into small devices for mobile users. A hand-held computing device may have a CPU of 256MHz and 64MB memory; to fit the programs, as well as the models and data files into this memory and operate the system in real-time are tremendous challenges.
In this paper, we will describe the overall framework of the MASTOR system and our approaches for each major component, i.e., speech recognition and translation. Various statistical approaches are explored and used to solve different technical challenges. We will show how we addressed the challenges that arise when building automatic speech recognition (ASR) and machine translation (MT) for colloquial Arabic on both the laptop and handheld PDA platforms.
2. SYSTEM OVERVIEW
IBM MASTOR (Multilingual Automatic Speech-To-Speech TranslatOR) is IBM’s highly trainable speech-to-speech translation system, targeting conversational spoken language translation between English and Mandarin Chinese for limited domains. Figure 1 depicts the architecture of MASTOR. The speech input is processed and decoded by a large-vocabulary speech recognition system. Then the transcribed text is analyzed by a statistical parser for semantic and syntactic features.
A sentence-level natural language generator based on maximum entropy (ME) modeling is used to generate sentences in the target language from the parser output. The produced sentence in target language is synthesized into speech by a high quality text-to-speech system.
The general framework of our speech translation system is illustrated in Figure 2. The general framework of our MASTOR system has components of ASR (Automatic Speech Recognition), MT (Machine Translation) and TTS (Text To Speech). ASR converts user’s speech to text in source language and then MT translates the source text into the target language. In the end, SST creates the synthesized speech from the target text. The cascaded approach allows us to deploy the power of the existing advanced speech and language processing techniques, while concentrating on the unique problems in speech-to-speech translation. Figure 3 illustrates the MASTOR GUI (Graphic User Interface) on laptop and PDA, respectively.
Figure 2. IBM MASTOR Speech-to-Speech Translation System
Acoustic models for English and Mandarin baseline are developed for large-vocabulary continuous speech and trained on over 200 hours of speech collected from about 2000 speakers for each language. However, the Arabic dialect speech recognizer was only trained using about 50 hours of dialectal speech. The training data for Arabic consists of about 200K short utterances. Large efforts were invested in initial cleaning and normalization of the training data because of large number of irregular dialectal words and variations in spellings. We experimented with three approaches for pronunciation and acoustic modeling: i.e. grapheme, phonetic, and context-sensitive grapheme as will be described in section 3.A. We found that using context-sensitive pronunciation rules reduces the WER of the grapheme based acoustic model by about 3% (from 36.7% to 35.8%). Based on these results, we decided to use context-sensitive grapheme models in our system.
The Arabic language model (LM) is an interpolated model consisting of a trigram LM, a class-based LM and a morphologically processed LM, all trained from a corpus of a few hundred thousand words. We also built a compact language model for the hand-held system, where singletons are eliminated and bigram and trigram counts are pruned with increased thresholds. The LM footprint size is 10MB.
There are two approaches for translation. The concept based approach uses natural language understanding (NLU) and natural language generation models trained from an annotated corpus. Another approach is the phrase-based finite state transducer which is trained using an un-annotated parallel corpus. A trainable, phrase-splicing and variable substitution TTS system is adopted to synthesize speech from translated sentences, which has a special ability to generate speech of mixed languages seamlessly. In addition, a small footprint TTS is developed for the handheld devices using embedded concatenative TTS technologies. Next, we will describe our approaches in automatic speech recognition and machine translation in greater detail.
Figure 3. IBM MASTOR system in Windows XP and Windows CE
3. AUTOMATIC SPEECH RECOGNITION
A. Acoustic Models
Acoustic models and the pronunciation dictionary greatly influence the ASR performance. In particular, creating an accurate pronunciation dictionary poses a major challenge when changing the language. Deriving pronunciations for resource rich languages like English or Mandarin is relatively straight forward using existing dictionaries or letter to sound models. In certain languages such as Arabic and Hebrew, the written form does not typically contain short vowels which a native speaker can infer from context. Deriving automatic phonetic transcription for speech corpora is thus difficult. This problem is even more apparent when considering colloquial Arabic, mainly due to the large number of irregular dialectal words.
One approach to overcome the absence of short vowels is to use grapheme based acoustic models. This leads to straightforward construction of pronunciation lexicons and hence facilitates
model training and decoding. However, the same grapheme may lead to different phonetic sounds depending on its context. This results in less accurate acoustic models. For this reason we experimented with two other different approaches. The first is a full phonetic approach which uses short vowels, and the second uses context-sensitive graphemes for the letter "A" (Alif) where two different phonemes are used for "A" depending on its position in the word.
Using phoneme based pronunciations would require vowelization of every word. To perform vowelization, we used a mix of dictionary search and a statistical approach. The word is first searched in an existing vowelized dictionary, and if not found it is passed to the statistical vowelizer. Due to the difficulties in accurately vowelizing dialectal words, our experiments have not shown any improvements using phoneme based ASR compared to grapheme based.
Speech recognition for both the laptop and hand-held systems is based on the IBM Via Voice engine. This highly robust and efficient framework uses rank based acoustic scores which are derived from tree-clustered context dependent Gaussian models. These acoustic scores together with n-gram LM probabilities are incorporated into a stack based search algorithm to yield the most probable word sequence given the input speech.
The English acoustic models use an alphabet of 52 phones. Each phone is modeled with a 3-state left-to-right hidden Markov model (HMM). The system has approximately 3,500 context dependent states modeled using 42K Gaussian distributions and trained using 40 dimensional features. The context-dependent states are generated using a decision-tree classifier. The colloquial Arabic acoustic models use about 30 phones that essentially correspond to graphemes in the Arabic alphabet. The colloquial Arabic HMM structure is the same as that of the English model. The Arabic acoustic models are also built using 40 dimensional features. The compact model for the PDA has about 2K leaves and 28K Gaussian distributions. The laptop version has over 3K leaves and 60K Gaussians. All acoustic models are trained using discriminative training.

Possibly Related Threads...
Thread		Author	Replies	Views	Last Post
	vote of thanks speech malayalam pdf		3	3,965	22-08-2017, 05:18 AM Last Post: Shahinsha SarSha
	anchoring speech for farewell party in gujrati		1	1,766	02-03-2017, 12:35 PM Last Post: jaseela123d
	welcome speech in malayalam pdf		1	1,777	28-02-2017, 10:08 AM Last Post: jaseela123d
	sample of farewell speech in punjabi language		1	3,016	24-08-2016, 11:16 AM Last Post: ijasti
	best speech in hindi for farewell party by anker		2	2,200	23-08-2016, 11:13 AM Last Post: ijasti
	anchoring speech for technical event		2	1,028	28-07-2016, 03:59 PM Last Post: visalakshik
	retirement speech in hindi pdf		1	2,530	28-07-2016, 03:41 PM Last Post: jaseela123d
	vote of thanks speech malayalam pdf		1	2,932	28-07-2016, 03:13 PM Last Post: jaseela123d
	farewell party speech in hindi pdf		3	3,395	22-07-2016, 04:40 PM Last Post: jaseela123d
	valedictory function speech on anchoring script		1	2,037	22-07-2016, 02:28 PM Last Post: visalakshik

Important Note..!

ASK HERE