04-05-2011, 09:49 AM
Abstract
The Intelligent Multimodal Multimedia and the Adaptive
Systems Groups at the Navy Center for Applied Research in
Artificial Intelligence have been investigating a natural
language and gesture interface to a mobile robot. Our
interface utilizes robust natural language understanding and
resolves some of the ambiguities in natural language by
means of gesture input. The natural language and gestural
information is integrated with knowledge of a particular
environment and appropriate robotic responses are
produced.
So-called “deictic” elements or objects (e.g. “this chair,”
“that table,” “him” or “her”) and directional elements (e.g.
“over there,” “my left” and “your right”) when parsed by a
natural language system can be comprehensible, but mean
nothing if the utterance is unaccompanied by gesture. A
command such as "Go/Move over there” is ambiguous
without an appropriate gesture to indicate some place in the
environment to which to move. Moreover, a command such
as “Turn left fifteen degrees” can be confusing if an
inappropriate or contradictory gesture is perceived. This
interface handles both natural language ambiguity and
appropriate or inappropriate (contradictory) gestures.
Introduction
Our research implementing a natural language and
gestural interface to a semi-autonomous robot is based on
two assumptions. The first, or linguistic, assumption is that
certain types of ambiguity in natural language can be
resolved when gestures are incorporated in the input. For
example, a sentence such as “Go over there” is devoid of
meaning unless it is accompanied by a gesture indicating
the place where the speaker wishes the hearer to move.
Furthermore, while gestures are an integral part of
communication [1], our second, or gestural, assumption is
that stylized or symbolic gestures place a heavier burden
on the human, frequently requiring a learning period, since
such gestures tend to be arbitrary in nature. Natural
gestures, i.e. gestures that do not require learning and
which any human might produce as a natural co-occurrence
to a particular verbal command, are simpler means of
imparting certain kinds of information in human-computer
interaction. With systems that have fairly robust vision
capabilities, natural gestures obviate the need for additional
interactive devices, such as computer terminals,
touchscreens, or data gloves. So from a linguistic and
gestural standpoint, certain utterances, such as those that
involve movement or location information, can be
disambiguated by means of natural, accompanying gesture
For this study, we limit ourselves to two types of
commands: commands that involve direction, e.g. “Turn
left,” and those that involve locomotion, e.g. “Go over
there.” For such commands, environmental conditions
permitting, people communicate with each other by
pointing to objects in their surroundings, or gesturing in the
specified direction. Granted, if the environment or
meteorological conditions are not favorable, as for example
when it is too dark to see or if foggy or heavy precipitation
prevails, humans may rely on other methods to
communicate, which will not concern us here. However,
given a more or less ideal environment, human to human
communication typically involves the use of natural
language and gesture, and it is this type of interaction that
we have emulated in our human-computer interface to a
semi-autonomous robot.
For the kinds of interaction that we have outlined above,
touchscreens or data gloves also allow humans to
communicate and talk about deictic elements in various
computer applications. So-called “deictic” elements are
linguistic strings that refer to objects in the discourse which
in turn usually refer to objects in the real world. For
example, in the sentence “the box in the corner is red,” the
subject of the sentence “the box in the corner” can be
analyzed as a deictic element if one exists in the same
environment as the speaker and/or hearer of this utterance.
If the intended referent, namely “the box,” does not exist in
this environment, either the speaker is playing some sort of
linguistic trick, or the utterance is uninterpretable.
Download full report
http://citeseerx.ist.psu.edu/viewdoc/dow...p1&type=ps