28-01-2012, 03:56 PM
Improving Data Quality with Dynamic Forms
[attachment=16803]
INTRODUCTION
Organizations and individuals routinely make important
decisions based on inaccurate data stored in supposedly authoritative
databases. Data errors in some domains, such as
medicine, may have particularly severe consequences. These
errors can arise at a variety of points in the lifecycle of data,
from data entry, through storage, integration and cleaning to
analysis and decision-making [1]. While each step presents
an opportunity to address data quality, entry-time offers the
earliest opportunity to catch and correct errors. The database
community has focused on data cleaning once data has
been collected into a database, and has paid relatively little
attention to data quality at collection time [1], [2].
The contributions of this paper are fourfold:
1) We describe our designs for two probabilistic models for
an arbitrary data entry form that model both question
ordering and error likelihood.
2) We describe how USHER uses these models to provide
three forms of guidance: static form design, dynamic
question ordering, and re-asking.
3) We present experiments showing that USHER has the potential
to improve data quality at reduced cost. We study
two representative data sets: direct electronic entry of
survey results about political opinion, and transcription
of paper-based patient intake forms from an HIV/AIDS
clinic in Tanzania.
4) Extending our ideas on form dynamics, we propose new
user interface principles for designing contextualized,
intuitive feedback about the likelihood of data as it is entered.
This provides a foundation for incorporating data
cleaning visualizations directly into the entry process.
II. RELATED WORK
Our work builds upon several areas of related work. We
provide an overview in this section.
SYSTEM
A. A Data-driven Approach
USHER builds a probabilistic model for an arbitrary data
entry form in two steps: first, by learning the relationships
between form questions via structure learning; and second, by
estimating the parameters of a Bayesian network, which then
allows us to generate predictions and error probabilities for
the form.
LEARNING A MODEL FOR DATA ENTRY
The core of the USHER system is its probabilistic model of
the data, represented as a Bayesian network over form questions.
This network captures relationships between a form’s
question elements in a stochastic manner. In particular, given
input values for some subset of the questions of a particular
form instance, the model can infer probability distributions
over values of that instance’s remaining unanswered questions.
In this section, we show how standard machine learning
techniques can be used to induce this model from previous
form entries.