Title:
Automatic identification and removal of low quality online information

dc.contributor.advisor Pu, Calton
dc.contributor.author Webb, Steve en_US
dc.contributor.committeeMember Ahamad, Mustaque
dc.contributor.committeeMember Feamster, Nick
dc.contributor.committeeMember Liu, Ling
dc.contributor.committeeMember Wu, Shyhtsun Felix
dc.contributor.department Computing en_US
dc.date.accessioned 2009-01-22T15:55:32Z
dc.date.available 2009-01-22T15:55:32Z
dc.date.issued 2008-11-17 en_US
dc.description.abstract The advent of the Internet has generated a proliferation of online information-rich environments, which provide information consumers with an unprecedented amount of freely available information. However, the openness of these environments has also made them vulnerable to a new class of attacks called Denial of Information (DoI) attacks. Attackers launch these attacks by deliberately inserting low quality information into information-rich environments to promote that information or to deny access to high quality information. These attacks directly threaten the usefulness and dependability of online information-rich environments, and as a result, an important research question is how to automatically identify and remove this low quality information from these environments. The first contribution of this thesis research is a set of techniques for automatically recognizing and countering various forms of DoI attacks in email systems. We develop a new DoI attack based on camouflaged messages, and we show that spam producers and information consumers are entrenched in a spam arms race. To break free of this arms race, we propose two solutions. One solution involves refining the statistical learning process by associating disproportionate weights to spam and legitimate features, and the other solution leverages the existence of non-textual email features (e.g., URLs) to make the classification process more resilient against attacks. The second contribution of this thesis is a framework for collecting, analyzing, and classifying examples of DoI attacks in the World Wide Web. We propose a fully automatic Web spam collection technique and use it to create the Webb Spam Corpus -- a first-of-its-kind, large-scale, and publicly available Web spam data set. Then, we perform the first large-scale characterization of Web spam using content and HTTP session analysis. Next, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information. The final contribution of this thesis research is a collection of techniques that detect and help prevent DoI attacks within social environments. First, we provide detailed descriptions for each of these attacks. Then, we propose a novel technique for capturing examples of social spam, and we use our collected data to perform the first characterization of social spammers and their behaviors. en_US
dc.description.degree Ph.D. en_US
dc.identifier.uri http://hdl.handle.net/1853/26669
dc.publisher Georgia Institute of Technology en_US
dc.subject Denial of information en_US
dc.subject Email spam en_US
dc.subject Web spam en_US
dc.subject Social spam en_US
dc.subject Applied machine learning en_US
dc.subject Information security en_US
dc.subject.lcsh Spam filtering (Electronic mail)
dc.subject.lcsh World Wide Web
dc.subject.lcsh Online social networks
dc.subject.lcsh Computer networks Security measures
dc.title Automatic identification and removal of low quality online information en_US
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.advisor Pu, Calton
local.contributor.corporatename College of Computing
relation.isAdvisorOfPublication fc48a3de-da43-4d32-af59-414047eb7cd7
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
webb_steve_r_200812_phd.pdf
Size:
4.77 MB
Format:
Adobe Portable Document Format
Description: