04-05-2011, 03:06 PM
Abstract
Phishing websites, fraudulent sites that impersonate atrusted third party to gain access to private data, continueto cost Internet users over a billion dollars each year. Inthis paper, we describe the design and performance char-acteristics of a scalable machine learning classifier we de-veloped to detect phishing websites. We use this classifierto maintain Google’s phishing blacklist automatically. Ourclassifier analyzes millions of pages a day, examining theURL and the contents of a page to determine whether ornot a page is phishing. Unlike previous work in this field,we train the classifier on a noisy dataset consisting of mil-lions of samples from previously collected live classificationdata. Despite the noise in the training data, our classifierlearns a robust model for identifying phishing pages whichcorrectly classifies more than 90% of phishing pages sev-eral weeks after training concludes.
1 Introduction
Phishing is a social engineering crime generally definedas impersonating a trusted third party to gain access to privatedata. For example, an adversary might send the victiman email directing him to a fraudulent website that lookslike a page belonging to a bank. The adversary can useany information the victim enters into the phishing page todrain the victim’s bank account or steal the victim’s identity.Despite increasing public awareness, phishing continues tobe a major threat to Internet users. Gartner estimates thatphishers stole $1.7 billion in 2008, and the Anti-PhishingWorking Group identified roughly twenty thousand uniquenew phishing sites each month between July and Decemberof 2008 [3], [17]. To help combat phishing, Googlepublishes a blacklist of phishing URLs and phishing URLpatterns [7], [29]. The anti-phishing features in Firefox 3,Google Chrome, and Apple Safari use this blacklist. Weprovide access to the list to other clients through our publicAPI [18].In order for an anti-phishing blacklist to be effective, itmust be comprehensive, error-free, and timely. A blacklistthat is not comprehensive fails to protect a portion of itsusers. One that is not error-free subjects users to unnecessarywarnings and ultimately trains its users to ignore thewarnings. A blacklist that is not timely may fail to warn itsusers about a phishing page in time to protect them. Consideringthat phishing pages only remain active for an averageof approximately three days, with the majority of pageslasting less than a day, a delay of only a few hours can significantlydegrade the quality of a blacklist [2], [30].Currently, human reviewers maintain some blacklists,like the one published by PhishTank [25]. With Phish-Tank, the user communitymanually verifies potential phishingpages submitted by community members to keep theirblacklist mostly error-free. Unfortunately, this review processtakes a considerable amount of time, ranging from amedian of over ten hours in March, 2009 to a median ofover fifty hours in June, 2009, according to PhishTank’sstatistics. Omitting verification to improve the timelinessof the data is not a good option for PhishTank. Without verification,the list would have many false positives comingfrom either innocent confusion or malicious abuse.An automatic classifier could handle this verificationtask. Previously published efforts have shown that a classificationsystem could examine the same signals a humanreviewer uses to evaluate whether a page is phishing [13],[16], [20], [21], [35]. Such a system could add verifiedphishing pages to the blacklist automatically, substantiallyreducing the verification time and improving the throughput.With higher throughput, the system could even examinelarge numbers of questionable, automatically collectedURLs to look for otherwise missed phishing pages.This paper describes such an automatic phishing classifierthat we built and currently use to evaluate phishingpages and maintain our blacklist. Since its activation inNovember, 2008, this system evaluates millions of potentialphishing pages every day. To evaluate each page, theclassifier considers features regarding the page’s URL, content,and hosting information. We retrain this classifier dailyusing approximately ten million samples from classificationdata collected over the last three months
Download full report
http://isocisoc/conferences/ndss/10/pdf/08.pdf