04-05-2011, 12:55 PM
Abstract
Subvocal electromyogram (EMG) signalclassification is used to control a modified web browserinterface. Recorded surface signals from the larynx andsublingual areas below the jaw are filtered and transformedinto features using a complex dual quad tree wavelettransform. Feature sets for six sub-vocally pronounced controlwords, 10 digits, 17 vowels and 23 consonants are trained usinga scaled conjugate gradient neural network. The sub vocalsignals are classified and used to initiate web browser queriesthrough a matrix based alphabet coding scheme. Hyperlinks onweb pages returned by the browser are numbered sequentiallyand queried using digits only. Classification methodology,accuracy, and feasibility for scale up to real world humanmachine interface tasks are discussed in the context of voweland consonant recognition accuracy.Index Terms —EMG, sub-vocal speech, wavelet, neuralnetwork, speech recognition, web browsing, vowels, consonants
I. INTRODUCTION
UMAN to human or human to machine communicationcan occur in many ways [4]. Traditionally visual andverbal processes tend to dominate both the method and thepresentation format. As a result, technology to enhancehuman communication has focused on public, audible taskssuch as those addressed by commercial speech recognition.However, audible tasks place a number of constraints onsituation suitability. These constraints include a vulnerabilityto ambient noise, requirements for clear formation andenunciation of words, and a shared language. When soundproduction limitations intervene, they can become veryproblematic. Examples of such situations might be suitedHAZMAT operations, underwater or space EVA, crowdedenvironments, high privacy requirements, or medical speechimpairment. In many situations, very private communicationis desirable, such as telephone calls, password entry, offlinediscussion while teleconferencing, military operations, orhuman /machine data queries. Vision based modalities, suchThis work was supported by NASA Ames Research Center under theCICT/ITSR program, Program manager Dr. Eugene Tu. †Dr. ChuckJorgensen. is with the Computational Sciences Division, NASA AmesResearch Center, Moffett Field CA 94035. (e-mail:cjorgensen[at]mail.arc.nasa.gov)..†† Dr. Kim Binstead is with theUniversity of Hawaii, Honolulu (email: binsted[at]hawaii.edu).as email, can also cause problems because of non visualemotional information otherwise recognizable during speech.In addition, the intensity or forcefulness of thecommunication may be lost or misinterpreted. Acommunication alternative that can be private, non-dependenton physical production of audible signals, and still containemotional subtleties of speech, could add valuable enrichmentto the communication process.An alternative way of communicating being considered atNASA Ames Research Center is the direct interpretation ofnervous system control signals sent to speech muscles [9].Specifically, we use non invasive aggregate surfacemeasurements of electromyographic signals or EMGs tocategorize muscle activation prior to sound generation [3].Such signals arise when reading or speaking to oneself withor without actual lip or facial movements. Hence theinformation we are using does not show up using externalobservation, nor in current methods used to enhance speechrecognition, such as machine lip reading.In the present paper we demonstrate one EMG approach tothe recognition of discrete, speaker dependent, non vocalizedspeech used to control a web browser. In previous work weshowed the adequacy of EMG signals for the control of avirtual joystick and virtual numeric keypad entry [2]. In [1] wedemonstrated recognition of a small sub acoustic controlvocabulary.The present control demonstration uses differential EMGsignals measured on the side of the throat near the larynx andunder the chin to pick up weak signals associated withaggregate muscle activity of the vocal tract and tongue. Wecapitalize on the fact that muscle activation leading to speechmust remain relatively consistent and standardized to beunderstood by others. The concept is to intercept speechsignals prior to sound generation and use them directly,bypassing auditory models such as mel cepstrums to filtersignals.After an appropriate feature transformation, EMG signalsare input into a neural network or support vector machineclassifier for recognition training and testing. Givensufficiently precise sensors, optimal feature selection, and avalid signal processing architecture, it is possible to use theseextremely weak signals to perform usable tasks withoutvocalization and non-invasively. In a sense, we areapproximating a totally silent control methodology such asthat sought using EEG (i.e. thought based approaches [11]),but with much lower signal and measurement complexity.
Download full report
http://citeseerx.ist.psu.edu/viewdoc/dow...1&type=pdf