ABSTRACT
The world-wide web has become the most important infor-mation source for most of us. Unfortunately, there is noguarantee for the correctness of information on the web.Moreover, different web sites often provide conflicting in-formation on a subject, such as different specifications forthe same product. In this paper we propose a new problemcalled Veracity, i.e., conformity to truth, which studies howto find true facts from a large amount of conflicting informa-tion on many subjects that is provided by various web sites.We design a general framework for the Veracity problem,and invent an algorithm called TruthFinder, which uti-lizes the relationships between web sites and their informa-tion, i.e., a web site is trustworthy if it provides many piecesof true information, and a piece of information is likely to betrue if it is provided by many trustworthy web sites. Our ex-periments show that TruthFinder successfully finds truefacts among conflicting information, and identifies trustwor-thy web sites better than the popular search engines.Keywords: data quality, web mining, page link analysis.
1. INTRODUCTION
The world-wide web has become a necessary part of ourlives, and might have become the most important informa-tion source for most people. Everyday people retrieve allkinds of information from the web. For example, when shop-ping online, people find product specifications from web siteslike Amazon.com or ShopZilla.com. When looking for inter-esting DVDs, they get information and read movie reviewson web sites such as NetFlix.com or IMDB.com.“Is the world-wide web always trustable?” Unfortunately,the answer is “no”. There is no guarantee for the correctness of information on the web. Even worse, different web sitesoften provide conflicting information, as shown below.Example 1: Authors of books. We tried to find outwho wrote the book “Rapid Contextual Design” (ISBN:0123540518). We found many different sets of authors fromdifferent online bookstores, and we show several of them inTable 1. From the image of the book cover we found thatA1 Books provides the most accurate information. In com-parison, the information from Powell’s books is incomplete,and that from Lakeside books is incorrect. Table 1: Conflicting information about book authorsThe trustworthiness problem of the web has been realizedby today’s Internet users. According to a survey on credibil-ity of web sites, 54% of Internet users trust news web sitesat least most of time, while this ratio is only 26% for websites that sell products, and is merely 12% for blogs.There have been many studies on ranking web pages ac-cording to authority based on hyperlinks, such as Authority-Hub analysis [2], PageRank [4], and more general link-basedanalysis [1]. But does authority or popularity of web siteslead to accuracy of information? The answer is unfortu-nately no. For example, according to our experiments thebookstores ranked on top by Google (Barnes & Noble andPowell’s books) contain many errors on book author infor-mation, and some small bookstores (e.g., A1 Books) providemore accurate information.In this paper we propose a new problem called Verac-ity problem, which is formulated as follows: Given a largeamount of conflicting information about many objects, whichis provided by multiple web sites (or other types of informa-tion providers), how to discover the true fact about eachobject. We use the word “fact” to represent something thatis claimed as a fact by some web site, and such a fact can beeither true or false. There are often conflicting facts on theweb, such as different sets of authors for a book.
Download full report
http://cs.uiuc.edu/~hanj/pdf/kdd07_xyin.pdf