20-06-2011, 11:10 AM
[attachment=14130]
INTRODUCTION
One of the most important inventions of the nineteenth century was the telephone. Then at the midpoint of twentieth century, the invention of the digital computer amplified the power of our minds, enabled us to think and work more efficiently and made us more imaginative then we could ever have imagined.
Now several new technologies have empowered us to teach computers to talk to us in our native languages and to listen to us when we speak (recognition); haltingly computers have begun to understand what we say.
Having given our computers both oral and aural abilities, we have been able to produce innumerable computer applications that further enhance our productivity. Such capabilities enable us to route phone calls automatically and to obtain and update computer based information by telephone, using a group of activities collectively referred to as Voice Processing.
Speech is one of the most natural ways to interact. When it comes to computers it is no different. If an application can be controlled solely by way of voice commands then the opportunity that lies is unlimited. Even though the idea of using speech as an input mechanism for an application is not new there are not a lot of applications that use speech as in input. In other words speech is still a big opportunity that is yet to be explored.
Speech recognition allows you to provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to an application, speech recognition allows you to provide input by talking. In the desktop world, you need a microphone to be able to do this.
Broadly, speech analysis can be divided in to two paradigms : - Text to speech and Speech to Text conversion.
Speech recognition can be of two types based on the grammar that the recognition is based on. (Grammar is in other words the list of possible recognition outputs that can be generated). An application can limit the possible combination of the words spoken by choosing proper grammar.
In a command and control scenario a developer provides a limited set of possible word combinations, and the speech recognition engine matches the words spoken by the user to the limited list. In command and control the accuracy of recognition is very high. It is always better for applications to implement command and control as the higher accuracy of recognition makes the application respond better.
In Dictation mode the recognition engine compares the input speech to the whole list of the dictionary words. For the dictation mode to have a high accuracy of recognition is it important that the user has prior trained the recognition engine by speaking in to it. The training or creating of a profile can be done by using the speech properties in the control panel.
Speaker Dependence vs. Speaker Independence
Speaker dependence describes the degree to which a speech recognition system requires knowledge of a speaker’s individual voice characteristics to successfully process speech. The speech recognition engine can “learn” how you speak words and phrases; it can be trained to your voice.
Speech recognition systems that require a user to train the system to his/her voice are known as speaker-dependent systems. If you are familiar with desktop dictation systems, most are speaker dependent. Because they operate on very large vocabularies, dictation systems perform much better when the speaker has spent the time to train the system to his/her voice.
Speech recognition systems that do not require a user to train the system are known as speaker-independent systems. Speech recognition in the Voice XML world must be speaker-independent. Think of how many users (hundreds, maybe thousands) may be calling into your web site. You cannot require that each caller train the system to his or her voice. The speech recognition system in a voice-enabled web application MUST successfully process the speech of many different callers without having to understand the individual voice characteristics of each caller.
INTRODUCTION
One of the most important inventions of the nineteenth century was the telephone. Then at the midpoint of twentieth century, the invention of the digital computer amplified the power of our minds, enabled us to think and work more efficiently and made us more imaginative then we could ever have imagined.
Now several new technologies have empowered us to teach computers to talk to us in our native languages and to listen to us when we speak (recognition); haltingly computers have begun to understand what we say.
Having given our computers both oral and aural abilities, we have been able to produce innumerable computer applications that further enhance our productivity. Such capabilities enable us to route phone calls automatically and to obtain and update computer based information by telephone, using a group of activities collectively referred to as Voice Processing.
Speech is one of the most natural ways to interact. When it comes to computers it is no different. If an application can be controlled solely by way of voice commands then the opportunity that lies is unlimited. Even though the idea of using speech as an input mechanism for an application is not new there are not a lot of applications that use speech as in input. In other words speech is still a big opportunity that is yet to be explored.
Speech recognition allows you to provide input to an application with your voice. Just like clicking with your mouse, typing on your keyboard, or pressing a key on the phone keypad provides input to an application, speech recognition allows you to provide input by talking. In the desktop world, you need a microphone to be able to do this.
Broadly, speech analysis can be divided in to two paradigms : - Text to speech and Speech to Text conversion.
Speech recognition can be of two types based on the grammar that the recognition is based on. (Grammar is in other words the list of possible recognition outputs that can be generated). An application can limit the possible combination of the words spoken by choosing proper grammar.
In a command and control scenario a developer provides a limited set of possible word combinations, and the speech recognition engine matches the words spoken by the user to the limited list. In command and control the accuracy of recognition is very high. It is always better for applications to implement command and control as the higher accuracy of recognition makes the application respond better.
In Dictation mode the recognition engine compares the input speech to the whole list of the dictionary words. For the dictation mode to have a high accuracy of recognition is it important that the user has prior trained the recognition engine by speaking in to it. The training or creating of a profile can be done by using the speech properties in the control panel.
Speaker Dependence vs. Speaker Independence
Speaker dependence describes the degree to which a speech recognition system requires knowledge of a speaker’s individual voice characteristics to successfully process speech. The speech recognition engine can “learn” how you speak words and phrases; it can be trained to your voice.
Speech recognition systems that require a user to train the system to his/her voice are known as speaker-dependent systems. If you are familiar with desktop dictation systems, most are speaker dependent. Because they operate on very large vocabularies, dictation systems perform much better when the speaker has spent the time to train the system to his/her voice.
Speech recognition systems that do not require a user to train the system are known as speaker-independent systems. Speech recognition in the Voice XML world must be speaker-independent. Think of how many users (hundreds, maybe thousands) may be calling into your web site. You cannot require that each caller train the system to his or her voice. The speech recognition system in a voice-enabled web application MUST successfully process the speech of many different callers without having to understand the individual voice characteristics of each caller.