Multilingual Automated Speech to speech Translator
#1

Can you send the seminar report on MASTOR?
Reply
#2
A speech-to-speech
translation system called IBM MASTOR that can translate spontaneous free-form
speech in real-time on both laptop and hand-held PDAs is described here.

1. INTRODUCTION
Automatic speech-to-speech (S2S) translation breaks down communication barriers between people speaking different languages. An efficient and robust S2S translation system poses a lot of challenges. dialects of Arabic. The aim of this project is to provide language support to military, medical and humanitarian personnel
during operations in foreign territories, by providing a two-way real-time speech-to-speech translation system designed for specific tasks such as medical triage and force protection.casual or colloquial speaking style of speaking is a major challenge.

Translation for under-studied languages is also a difficulty due Lack of
appropriate amount of speech data , adverse environments, lack of training data and linguistic resources for under-studied languages, and the Lack of linguistic
knowledge realization in spelling standards, transcriptions, lexicons and dictionaries, or annotated corpora.And these complicated algorithms
and programs must be made to fit into small devices for mobile users.

SYSTEM OVERVIEW
The general framework of our MASTOR system
has components of ASR, MT and TTS. Acoustic models for English and Mandarin baseline are developed for large-vocabulary continuous speech and trained on over 200 hours of speech collected from about 2000 speakers for each language. grapheme,
phonetic, and context-sensitive grapheme are the approaches used for pronunciation and acoustic modeling.


AUTOMATIC SPEECH RECOGNITION

Acoustic Models:
Here , the pronunciation dictionary greatly influence the ASR performance. a major challenge when changing the language is the creation of the appropriate pronunciation dictionary. One approach to overcome the absence of short vowels is to use
grapheme based acoustic models.a full phonetic approach which uses short vowels, and the second uses context-sensitive graphemes for the letter "A" were used for overcoming the fact that same grapheme may lead to different phonetic sounds depending on its context. Using phoneme based pronunciations would require vowelization of every word.section Due to the difficulties in
accurately vowelizing dialectal words, this approach of ASR hasn't made much progress.
Speech recognition for both the laptop and hand-held systems is
based on the IBM ViaVoice engine. This highly efficient and robust framework uses rank based acoustic scores which are derived from tree-clustered context dependent Gaussian models.n-gram LM probabilities and acoustic scores are used to implement a stack based search algorithm to yield the most probable word sequence given the input speech.

Language Modeling
Language modeling (LM) of the probability of various word sequences
is crucial for high-performance ASR . The approach used here is to build statistical tri-gram LMs fall into three categories: 1) obtaining additional training material automatically; 2) interpolating domain-specific LMs with other LMs; 3) improving distribution estimation robustness and accuracy with limited in-domain resources. the English language model has two components that are linearly interpolated. The first component built using in-domain data and the second one acting as a background model.

SPEECH TRANSLATION
NLU/NLG-based Speech Translation:

statistical translation method based on natural language understanding (NLU) and natural language generation(NLG) has been used. Statistical machine translation methods translate a sentence W in the source language into a sentence A in the target language by using a statistical model that estimates the probability.

full report download:
[attachment=1306]
Reply
#3
[attachment=12579]
1. INTRODUCTION
Automatic Speech-To-Speech (S2S) translation breaks down communication barriers between people who do not share a common language and hence enable instant oral cross-lingual communication for many critical applications such as emergency medical care. The development of an accurate, efficient and robust S2S translation system poses a lot of challenges. This is especially true for colloquial speech and resource deficient languages.
The IBM MASTOR speech-to-speech translation system has been developed for the DARPA (Defense Advanced Research Projects Agency)’s CAST and TRANSTAC programs whose mission is to develop technologies that enable rapid deployment of real-time S2S translation of low-resource languages on portable devices. It originated from the IBM MARS S2S system handling the air travel reservation domain described in [1], which was later significantly improved in all components, including ASR (Automatic Speech Recognition), MT (Machine Translation) and TTS (Text To Speech), and later evolved into the MASTOR multilingual S2S system that covers much broader domains such as medical treatment and force protection. More recently, we have further broadened our experience and efforts to very rapidly develop systems for under-studied languages, such as regional dialects of Arabic. The intent of this program is to provide language support to military, medical and humanitarian personnel during operations in foreign territories, by deciphering possibly critical language communications with a two-way real-time speech-to-speech translation system designed for specific tasks such as medical triage and force protection.
The initial data collection effort for the project has shown that the domain of force protection and medical triage is, though limited, rather broad. In fact, the definition of domain coverage is tough when the speech from responding foreign language speakers are concerned, as their responses are less constrained and may include out-of-domain words and concepts. Moreover, flexible casual or colloquial speaking style inevitably appears in the human- to-human conversational communications. Therefore, the project is a great challenge that calls for major research efforts.
Among all the challenges for speech recognition and translation for under-studied languages, there are two main issues: 1) Lack of appropriate amount of speech data that represent the domain of interest and the oral language spoken by the target speakers, resulting in difficulties in accurate estimation of statistical models for speech recognition and translation. 2) Lack of linguistic knowledge realization in spelling standards, transcriptions, lexicons and dictionaries, or annotated corpora. Therefore, various different approaches have to be explored.
Another critical challenge is to embed complicated algorithms and programs into small devices for mobile users. A hand-held computing device may have a CPU of 256MHz and 64MB memory; to fit the programs, as well as the models and data files into this memory and operate the system in real-time are tremendous challenges.
In this paper, we will describe the overall framework of the MASTOR system and our approaches for each major component, i.e., speech recognition and translation. Various statistical approaches are explored and used to solve different technical challenges. We will show how we addressed the challenges that arise when building automatic speech recognition (ASR) and machine translation (MT) for colloquial Arabic on both the laptop and handheld PDA platforms.
2. SYSTEM OVERVIEW
IBM MASTOR (Multilingual Automatic Speech-To-Speech TranslatOR) is IBM’s highly trainable speech-to-speech translation system, targeting conversational spoken language translation between English and Mandarin Chinese for limited domains. Figure 1 depicts the architecture of MASTOR. The speech input is processed and decoded by a large-vocabulary speech recognition system. Then the transcribed text is analyzed by a statistical parser for semantic and syntactic features.
A sentence-level natural language generator based on maximum entropy (ME) modeling is used to generate sentences in the target language from the parser output. The produced sentence in target language is synthesized into speech by a high quality text-to-speech system.
The general framework of our speech translation system is illustrated in Figure 2. The general framework of our MASTOR system has components of ASR (Automatic Speech Recognition), MT (Machine Translation) and TTS (Text To Speech). ASR converts user’s speech to text in source language and then MT translates the source text into the target language. In the end, SST creates the synthesized speech from the target text. The cascaded approach allows us to deploy the power of the existing advanced speech and language processing techniques, while concentrating on the unique problems in speech-to-speech translation. Figure 3 illustrates the MASTOR GUI (Graphic User Interface) on laptop and PDA, respectively.
Figure 2. IBM MASTOR Speech-to-Speech Translation System
Acoustic models for English and Mandarin baseline are developed for large-vocabulary continuous speech and trained on over 200 hours of speech collected from about 2000 speakers for each language. However, the Arabic dialect speech recognizer was only trained using about 50 hours of dialectal speech. The training data for Arabic consists of about 200K short utterances. Large efforts were invested in initial cleaning and normalization of the training data because of large number of irregular dialectal words and variations in spellings. We experimented with three approaches for pronunciation and acoustic modeling: i.e. grapheme, phonetic, and context-sensitive grapheme as will be described in section 3.A. We found that using context-sensitive pronunciation rules reduces the WER of the grapheme based acoustic model by about 3% (from 36.7% to 35.8%). Based on these results, we decided to use context-sensitive grapheme models in our system.
The Arabic language model (LM) is an interpolated model consisting of a trigram LM, a class-based LM and a morphologically processed LM, all trained from a corpus of a few hundred thousand words. We also built a compact language model for the hand-held system, where singletons are eliminated and bigram and trigram counts are pruned with increased thresholds. The LM footprint size is 10MB.
There are two approaches for translation. The concept based approach uses natural language understanding (NLU) and natural language generation models trained from an annotated corpus. Another approach is the phrase-based finite state transducer which is trained using an un-annotated parallel corpus. A trainable, phrase-splicing and variable substitution TTS system is adopted to synthesize speech from translated sentences, which has a special ability to generate speech of mixed languages seamlessly. In addition, a small footprint TTS is developed for the handheld devices using embedded concatenative TTS technologies. Next, we will describe our approaches in automatic speech recognition and machine translation in greater detail.
Reply
#4
Presented by-
Kh. Santosh Singh

[attachment=13154]
What is MASTOR ?
• Speech-to-speech translation system
• Translate spontaneous free-form speech in Real-Time
• Applicable in Laptops and hand-held PDAs.
• Supports multiple languages
General Framework
 ASR – Automatic Speech Recognition
 MT – Machine Translation
 TTS – Text To Speech
Speech Translation
 NLU/NLG based Speech Translation
 Fast and Memory Efficient Translation using SIPL
Note:
• NLU – Natural Language Understanding
• NLG – Natural Language Generation
• WFST – Weighted Finite-State Transducer
• SIPL – Statistical Integrated Phrase Lattices
Applications
Some popular STS available on handheld devices:
 NEC Speech Translation System
 Babylon
 Speechlator
Challenges
 Linguistic and communicative quality of the translations
 Embedding complicated Algorithms and Programs into small devices for mobile users.
Advantages
 Travelling
 Business Networking
 Globalization of Social Networking
 Learning a foreign language
Reply
#5
[attachment=13164]
1. INTRODUCTION
Automatic Speech-To-Speech (S2S) translation breaks down communication barriers between people who do not share a common language and hence enable instant oral cross-lingual communication for many critical applications such as emergency medical care. The development of an accurate, efficient and robust S2S translation system poses a lot of challenges. This is especially true for colloquial speech and resource deficient languages.
The IBM MASTOR speech-to-speech translation system has been developed for the DARPA (Defense Advanced Research Projects Agency)’s CAST and TRANSTAC programs whose mission is to develop technologies that enable rapid deployment of real-time S2S translation of low-resource languages on portable devices. It originated from the IBM MARS S2S system handling the air travel reservation domain described in [1], which was later significantly improved in all components, including ASR (Automatic Speech Recognition), MT (Machine Translation) and TTS (Text To Speech), and later evolved into the MASTOR multilingual S2S system that covers much broader domains such as medical treatment and force protection. More recently, we have further broadened our experience and efforts to very rapidly develop systems for under-studied languages, such as regional dialects of Arabic. The intent of this program is to provide language support to military, medical and humanitarian personnel during operations in foreign territories, by deciphering possibly critical language communications with a two-way real-time speech-to-speech translation system designed for specific tasks such as medical triage and force protection.
The initial data collection effort for the project has shown that the domain of force protection and medical triage is, though limited, rather broad. In fact, the definition of domain coverage is tough when the speech from responding foreign language speakers are concerned, as their responses are less constrained and may include out-of-domain words and concepts. Moreover, flexible casual or colloquial speaking style inevitably appears in the human- to-human conversational communications. Therefore, the project is a great challenge that calls for major research efforts.
Among all the challenges for speech recognition and translation for under-studied languages, there are two main issues: 1) Lack of appropriate amount of speech data that represent the domain of interest and the oral language spoken by the target speakers, resulting in difficulties in accurate estimation of statistical models for speech recognition and translation. 2) Lack of linguistic knowledge realization in spelling standards, transcriptions, lexicons and dictionaries, or annotated corpora. Therefore, various different approaches have to be explored.
Another critical challenge is to embed complicated algorithms and programs into small devices for mobile users. A hand-held computing device may have a CPU of 256MHz and 64MB memory; to fit the programs, as well as the models and data files into this memory and operate the system in real-time are tremendous challenges.
In this paper, we will describe the overall framework of the MASTOR system and our approaches for each major component, i.e., speech recognition and translation. Various statistical approaches are explored and used to solve different technical challenges. We will show how we addressed the challenges that arise when building automatic speech recognition (ASR) and machine translation (MT) for colloquial Arabic on both the laptop and handheld PDA platforms.
2. SYSTEM OVERVIEW
IBM MASTOR (Multilingual Automatic Speech-To-Speech TranslatOR) is IBM’s highly trainable speech-to-speech translation system, targeting conversational spoken language translation between English and Mandarin Chinese for limited domains. Figure 1 depicts the architecture of MASTOR. The speech input is processed and decoded by a large-vocabulary speech recognition system. Then the transcribed text is analyzed by a statistical parser for semantic and syntactic features.
A sentence-level natural language generator based on maximum entropy (ME) modeling is used to generate sentences in the target language from the parser output. The produced sentence in target language is synthesized into speech by a high quality text-to-speech system.
The general framework of our speech translation system is illustrated in Figure 2. The general framework of our MASTOR system has components of ASR (Automatic Speech Recognition), MT (Machine Translation) and TTS (Text To Speech). ASR converts user’s speech to text in source language and then MT translates the source text into the target language. In the end, SST creates the synthesized speech from the target text. The cascaded approach allows us to deploy the power of the existing advanced speech and language processing techniques, while concentrating on the unique problems in speech-to-speech translation. Figure 3 illustrates the MASTOR GUI (Graphic User Interface) on laptop and PDA, respectively.
Figure 2. IBM MASTOR Speech-to-Speech Translation System
Acoustic models for English and Mandarin baseline are developed for large-vocabulary continuous speech and trained on over 200 hours of speech collected from about 2000 speakers for each language. However, the Arabic dialect speech recognizer was only trained using about 50 hours of dialectal speech. The training data for Arabic consists of about 200K short utterances. Large efforts were invested in initial cleaning and normalization of the training data because of large number of irregular dialectal words and variations in spellings. We experimented with three approaches for pronunciation and acoustic modeling: i.e. grapheme, phonetic, and context-sensitive grapheme as will be described in section 3.A. We found that using context-sensitive pronunciation rules reduces the WER of the grapheme based acoustic model by about 3% (from 36.7% to 35.8%). Based on these results, we decided to use context-sensitive grapheme models in our system.
The Arabic language model (LM) is an interpolated model consisting of a trigram LM, a class-based LM and a morphologically processed LM, all trained from a corpus of a few hundred thousand words. We also built a compact language model for the hand-held system, where singletons are eliminated and bigram and trigram counts are pruned with increased thresholds. The LM footprint size is 10MB.
There are two approaches for translation. The concept based approach uses natural language understanding (NLU) and natural language generation models trained from an annotated corpus. Another approach is the phrase-based finite state transducer which is trained using an un-annotated parallel corpus. A trainable, phrase-splicing and variable substitution TTS system is adopted to synthesize speech from translated sentences, which has a special ability to generate speech of mixed languages seamlessly. In addition, a small footprint TTS is developed for the handheld devices using embedded concatenative TTS technologies. Next, we will describe our approaches in automatic speech recognition and machine translation in greater detail.
Figure 3. IBM MASTOR system in Windows XP and Windows CE
3. AUTOMATIC SPEECH RECOGNITION
A. Acoustic Models

Acoustic models and the pronunciation dictionary greatly influence the ASR performance. In particular, creating an accurate pronunciation dictionary poses a major challenge when changing the language. Deriving pronunciations for resource rich languages like English or Mandarin is relatively straight forward using existing dictionaries or letter to sound models. In certain languages such as Arabic and Hebrew, the written form does not typically contain short vowels which a native speaker can infer from context. Deriving automatic phonetic transcription for speech corpora is thus difficult. This problem is even more apparent when considering colloquial Arabic, mainly due to the large number of irregular dialectal words.
One approach to overcome the absence of short vowels is to use grapheme based acoustic models. This leads to straightforward construction of pronunciation lexicons and hence facilitates
model training and decoding. However, the same grapheme may lead to different phonetic sounds depending on its context. This results in less accurate acoustic models. For this reason we experimented with two other different approaches. The first is a full phonetic approach which uses short vowels, and the second uses context-sensitive graphemes for the letter "A" (Alif) where two different phonemes are used for "A" depending on its position in the word.
Using phoneme based pronunciations would require vowelization of every word. To perform vowelization, we used a mix of dictionary search and a statistical approach. The word is first searched in an existing vowelized dictionary, and if not found it is passed to the statistical vowelizer. Due to the difficulties in accurately vowelizing dialectal words, our experiments have not shown any improvements using phoneme based ASR compared to grapheme based.
Speech recognition for both the laptop and hand-held systems is based on the IBM Via Voice engine. This highly robust and efficient framework uses rank based acoustic scores which are derived from tree-clustered context dependent Gaussian models. These acoustic scores together with n-gram LM probabilities are incorporated into a stack based search algorithm to yield the most probable word sequence given the input speech.
The English acoustic models use an alphabet of 52 phones. Each phone is modeled with a 3-state left-to-right hidden Markov model (HMM). The system has approximately 3,500 context dependent states modeled using 42K Gaussian distributions and trained using 40 dimensional features. The context-dependent states are generated using a decision-tree classifier. The colloquial Arabic acoustic models use about 30 phones that essentially correspond to graphemes in the Arabic alphabet. The colloquial Arabic HMM structure is the same as that of the English model. The Arabic acoustic models are also built using 40 dimensional features. The compact model for the PDA has about 2K leaves and 28K Gaussian distributions. The laptop version has over 3K leaves and 60K Gaussians. All acoustic models are trained using discriminative training.
Reply
#6
(22-01-2010, 12:44 AM)justlikeheaven Wrote: A speech-to-speech
translation system called IBM MASTOR that can translate spontaneous free-form
speech in real-time on both laptop and hand-held PDAs is described here.

1. INTRODUCTION
Automatic speech-to-speech (S2S) translation breaks down communication barriers between people speaking different languages. An efficient and robust S2S translation system poses a lot of challenges. dialects of Arabic. The aim of this project is to provide language support to military, medical and humanitarian personnel
during operations in foreign territories, by providing a two-way real-time speech-to-speech translation system designed for specific tasks such as medical triage and force protection.casual or colloquial speaking style of speaking is a major challenge.

Translation for under-studied languages is also a difficulty due Lack of
appropriate amount of speech data , adverse environments, lack of training data and linguistic resources for under-studied languages, and the Lack of linguistic
knowledge realization in spelling standards, transcriptions, lexicons and dictionaries, or annotated corpora.And these complicated algorithms
and programs must be made to fit into small devices for mobile users.

SYSTEM OVERVIEW
The general framework of our MASTOR system
has components of ASR, MT and TTS. Acoustic models for English and Mandarin baseline are developed for large-vocabulary continuous speech and trained on over 200 hours of speech collected from about 2000 speakers for each language. grapheme,
phonetic, and context-sensitive grapheme are the approaches used for pronunciation and acoustic modeling.


AUTOMATIC SPEECH RECOGNITION

Acoustic Models:
Here , the pronunciation dictionary greatly influence the ASR performance. a major challenge when changing the language is the creation of the appropriate pronunciation dictionary. One approach to overcome the absence of short vowels is to use
grapheme based acoustic models.a full phonetic approach which uses short vowels, and the second uses context-sensitive graphemes for the letter "A" were used for overcoming the fact that same grapheme may lead to different phonetic sounds depending on its context. Using phoneme based pronunciations would require vowelization of every word.section Due to the difficulties in
accurately vowelizing dialectal words, this approach of ASR hasn't made much progress.
Speech recognition for both the laptop and hand-held systems is
based on the IBM ViaVoice engine. This highly efficient and robust framework uses rank based acoustic scores which are derived from tree-clustered context dependent Gaussian models.n-gram LM probabilities and acoustic scores are used to implement a stack based search algorithm to yield the most probable word sequence given the input speech.

Language Modeling
Language modeling (LM) of the probability of various word sequences
is crucial for high-performance ASR . The approach used here is to build statistical tri-gram LMs fall into three categories: 1) obtaining additional training material automatically; 2) interpolating domain-specific LMs with other LMs; 3) improving distribution estimation robustness and accuracy with limited in-domain resources. the English language model has two components that are linearly interpolated. The first component built using in-domain data and the second one acting as a background model.

SPEECH TRANSLATION
NLU/NLG-based Speech Translation:

statistical translation method based on natural language understanding (NLU) and natural language generation(NLG) has been used. Statistical machine translation methods translate a sentence W in the source language into a sentence A in the target language by using a statistical model that estimates the probability.

full report download:

Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: speech for anchoring for fairwell party, speech for seminar in college, concluding speech anchoring, speech recognition engine android, funny speech abt bhangra, free seminar full report on speech recognition, pani short speech in marathi,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  vote of thanks speech malayalam pdf 3 3,891 22-08-2017, 05:18 AM
Last Post: Shahinsha SarSha
  anchoring speech for farewell party in gujrati 1 1,687 02-03-2017, 12:35 PM
Last Post: jaseela123d
  welcome speech in malayalam pdf 1 1,734 28-02-2017, 10:08 AM
Last Post: jaseela123d
  sample of farewell speech in punjabi language 1 2,958 24-08-2016, 11:16 AM
Last Post: ijasti
  best speech in hindi for farewell party by anker 2 2,138 23-08-2016, 11:13 AM
Last Post: ijasti
  anchoring speech for technical event 2 981 28-07-2016, 03:59 PM
Last Post: visalakshik
  retirement speech in hindi pdf 1 2,490 28-07-2016, 03:41 PM
Last Post: jaseela123d
  vote of thanks speech malayalam pdf 1 2,872 28-07-2016, 03:13 PM
Last Post: jaseela123d
  farewell party speech in hindi pdf 3 3,301 22-07-2016, 04:40 PM
Last Post: jaseela123d
Star valedictory function speech on anchoring script 1 1,992 22-07-2016, 02:28 PM
Last Post: visalakshik

Forum Jump: