Automatic identification and removal of low quality online information

Webb, Steve

Title:

Automatic identification and removal of low quality online information

dc.contributor.advisor	Pu, Calton
dc.contributor.author	Webb, Steve	en_US
dc.contributor.committeeMember	Ahamad, Mustaque
dc.contributor.committeeMember	Feamster, Nick
dc.contributor.committeeMember	Liu, Ling
dc.contributor.committeeMember	Wu, Shyhtsun Felix
dc.contributor.department	Computing	en_US
dc.date.accessioned	2009-01-22T15:55:32Z
dc.date.available	2009-01-22T15:55:32Z
dc.date.issued	2008-11-17	en_US
dc.description.abstract	The advent of the Internet has generated a proliferation of online information-rich environments, which provide information consumers with an unprecedented amount of freely available information. However, the openness of these environments has also made them vulnerable to a new class of attacks called Denial of Information (DoI) attacks. Attackers launch these attacks by deliberately inserting low quality information into information-rich environments to promote that information or to deny access to high quality information. These attacks directly threaten the usefulness and dependability of online information-rich environments, and as a result, an important research question is how to automatically identify and remove this low quality information from these environments. The first contribution of this thesis research is a set of techniques for automatically recognizing and countering various forms of DoI attacks in email systems. We develop a new DoI attack based on camouflaged messages, and we show that spam producers and information consumers are entrenched in a spam arms race. To break free of this arms race, we propose two solutions. One solution involves refining the statistical learning process by associating disproportionate weights to spam and legitimate features, and the other solution leverages the existence of non-textual email features (e.g., URLs) to make the classification process more resilient against attacks. The second contribution of this thesis is a framework for collecting, analyzing, and classifying examples of DoI attacks in the World Wide Web. We propose a fully automatic Web spam collection technique and use it to create the Webb Spam Corpus -- a first-of-its-kind, large-scale, and publicly available Web spam data set. Then, we perform the first large-scale characterization of Web spam using content and HTTP session analysis. Next, we present a lightweight, predictive approach to Web spam classification that relies exclusively on HTTP session information. The final contribution of this thesis research is a collection of techniques that detect and help prevent DoI attacks within social environments. First, we provide detailed descriptions for each of these attacks. Then, we propose a novel technique for capturing examples of social spam, and we use our collected data to perform the first characterization of social spammers and their behaviors.	en_US
dc.description.degree	Ph.D.	en_US
dc.identifier.uri	http://hdl.handle.net/1853/26669
dc.publisher	Georgia Institute of Technology	en_US
dc.subject	Denial of information	en_US
dc.subject	Email spam	en_US
dc.subject	Web spam	en_US
dc.subject	Social spam	en_US
dc.subject	Applied machine learning	en_US
dc.subject	Information security	en_US
dc.subject.lcsh	Spam filtering (Electronic mail)
dc.subject.lcsh	World Wide Web
dc.subject.lcsh	Online social networks
dc.subject.lcsh	Computer networks Security measures
dc.title	Automatic identification and removal of low quality online information	en_US
dc.type	Text
dc.type.genre	Dissertation
dspace.entity.type	Publication
local.contributor.advisor	Pu, Calton
local.contributor.corporatename	College of Computing
relation.isAdvisorOfPublication	fc48a3de-da43-4d32-af59-414047eb7cd7
relation.isOrgUnitOfPublication	c8892b3c-8db6-4b7b-a33a-1b67f7db2021

Files

Original bundle

Now showing 1 - 1 of 1

Name:: webb_steve_r_200812_phd.pdf
Size:: 4.77 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Theses and Dissertations

Title: Automatic identification and removal of low quality online information

Files

Original bundle

Collections

Title:

Automatic identification and removal of low quality online information