Record Matching over Query Results from Multiple Web Databases
#1

Abstract
integration. Most state-of-the-art record matching methods are supervised, which requires the user to provide training data. Thesemethods are not applicable for the Web database scenario, where the records to match are query results dynamically generated onthe-fly. Such records are query-dependent and a prelearned method using training examples from previous query results may fail onthe results of a new query. To address the problem of record matching in the Web database scenario, we present an unsupervised,online record matching method, UDD, which, for a given query, can effectively identify duplicates from the query result records ofmultiple Web databases. After removal of the same-source duplicates, the “presumed” nonduplicate records from the same source canbe used as training examples alleviating the burden of users having to manually label training examples. Starting from the nonduplicateset, we use two cooperating classifiers, a weighted component similarity summing classifier and an SVM classifier, to iteratively identifyduplicates in the query results from multiple Web databases. Experimental results show that UDD works well for the Web databasescenario where existing supervised methods do not apply.Index Terms—Record matching, duplicate detection, record linkage, data deduplication, data integration, Web database, query resultrecord, SVM.
1 INTRODUCTION
TODAY, more and more databases that dynamicallygenerate Web pages in response to user queries areavailable on the Web. These Web databases compose the deepor hidden Web, which is estimated to contain a much largeramount of high quality, usually structured information andto have a faster growth rate than the static Web. Most Webdatabases are only accessible via a query interface throughwhich users can submit queries. Once a query is received,the Web server will retrieve the corresponding results fromthe back-end database and return them to the user.To build a system that helps users integrate and, moreimportantly, compare the query results returned from multipleWeb databases, a crucial task is to match the differentsources’ records that refer to the same real-world entity. Forexample, Fig. 1 shows some of the query results returned bytwo online bookstores, booksamillion.com and abebooks.com, inresponse to the same query “Harry Potter” over the Titlefield. It can be seen that the record numbered 3 in Fig. 1a andthe third record in Fig. 1b refer to the same book, since they have the same ISBN number although their authors differsomewhat. In comparison, the record numbered 5 in Fig. 1aand the second record in Fig. 1b also refer to the same book ifwe are interested only in the book title and author.1The problem of identifying duplicates,2 that is, two (ormore) records describing the same entity, has attractedmuch attention from many research fields, includingDatabases, Data Mining, Artificial Intelligence, and NaturalLanguage Processing.3 Most previous work4 is basedon predefined matching rules hand-coded by domainexperts or matching rules learned offline by some learningmethod from a set of training examples. Such approacheswork well in a traditional database environment, where allinstances of the target databases can be readily accessed,as long as a set of high-quality representative records canbe examined by experts or selected for the user to label.In the Web database scenario, the records to match arehighly query-dependent, since they can only be obtainedthrough online queries. Moreover, they are only a partialand biased portion of all the data in the source Webdatabases. Consequently, hand-coding or offline-learningapproaches are not appropriate for two reasons. First, thefull data set is not available beforehand, and therefore, goodrepresentative data for training are hard to obtain.


Download full report
http://ieeexplore.ieeeiel5/69/4358933/04...er=4840347
Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: 3d results latest, interview notes record, iq test results of, rpet results, sit up results, results realty ortonville, record store day,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  WEB SERVICE SELECTION BASED ON RANKING OF QOS USING ASSOCIATIVE CLASSIFICATION 1 939 15-02-2017, 04:13 PM
Last Post: jaseela123d
  Migrating Component-based Web Applications to Web Services: towards considering a ”We 1 855 15-02-2017, 10:56 AM
Last Post: jaseela123d
  Online Rental House Web Portal smart paper boy 6 5,453 06-02-2016, 01:00 PM
Last Post: seminar report asees
  Revisiting Dynamic Query Protocols in Unstructured Peer-to-Peer Networks Projects9 2 1,338 14-07-2015, 02:11 PM
Last Post: seminar report asees
  Web Based Blood Bank Management System project report maker 4 12,633 18-04-2015, 07:12 PM
Last Post: Guest
  WEB PORTAL FOR STUDENT INFORMATION SYSTEM OF E.C.A smart paper boy 2 3,160 29-03-2014, 11:49 PM
Last Post: Guest
  Developing a web application to transfer image and patient information project report maker 2 3,696 21-03-2014, 01:44 AM
Last Post: MichaelPn
  ONLINE COLLEGE RECORD MANAGEMENT SYSTEM seminar class 8 5,952 12-09-2013, 10:28 AM
Last Post: computer topic
  Hybrid Intrusion Detection with Weighted Signature Generation over Anomalous Internet electronics seminars 6 3,322 26-04-2013, 01:58 PM
Last Post: Guest
  web based supply chain management full report project report tiger 11 10,023 02-02-2013, 04:28 PM
Last Post: seminar details

Forum Jump: