Title:
Modern aspects of unsupervised learning

dc.contributor.advisor Balcan, Maria-Florina
dc.contributor.author Liang, Yingyu
dc.contributor.committeeMember Blum, Avrim
dc.contributor.committeeMember Fortnow, Lance
dc.contributor.committeeMember Isbell, Charles L.
dc.contributor.committeeMember Randall, Dana
dc.contributor.committeeMember Song, Le
dc.contributor.department Computer Science
dc.date.accessioned 2014-08-27T13:38:49Z
dc.date.available 2014-08-27T13:38:49Z
dc.date.created 2014-08
dc.date.issued 2014-06-23
dc.date.submitted August 2014
dc.date.updated 2014-08-27T13:38:49Z
dc.description.abstract Unsupervised learning has become more and more important due to the recent explosion of data. Clustering, a key topic in unsupervised learning, is a well-studied task arising in many applications ranging from computer vision to computational biology to the social sciences. This thesis is a collection of work exploring two modern aspects of clustering: stability and scalability. In the first part, we study clustering under a stability property called perturbation resilience. As an alternative approach to worst case analysis, this novel theoretical framework aims at understanding the complexity of clustering instances that satisfy natural stability assumptions. In particular, we show how to correctly cluster instances whose optimal solutions are resilient to small multiplicative perturbations on the distances between data points, significantly improving existing guarantees. We further propose a generalized property that allows small changes in the optimal solutions after perturbations, and provide the first known positive results in this more challenging setting. In the second part, we study the problem of clustering large scale data distributed across nodes which communicate over the edges of a connected graph. We provide algorithms with small communication cost and provable guarantees on the clustering quality. We also propose algorithms for distributed principal component analysis, which can be used to reduce the communication cost of clustering high dimensional data while merely comprising the clustering quality. In the third part, we study community detection, the modern extension of clustering to network data. We propose a theoretical model of communities that are stable in the presence of noisy nodes in the network, and design an algorithm that provably detects all such communities. We also provide a local algorithm for large scale networks, whose running time depends on the sizes of the output communities but not that of the entire network.
dc.description.degree Ph.D.
dc.format.mimetype application/pdf
dc.identifier.uri http://hdl.handle.net/1853/52282
dc.language.iso en_US
dc.publisher Georgia Institute of Technology
dc.subject Unsupervised learning
dc.subject Clustering
dc.subject Perturbation resilience
dc.subject Distributed clustering
dc.subject Community detection
dc.title Modern aspects of unsupervised learning
dc.type Text
dc.type.genre Dissertation
dspace.entity.type Publication
local.contributor.corporatename College of Computing
relation.isOrgUnitOfPublication c8892b3c-8db6-4b7b-a33a-1b67f7db2021
thesis.degree.level Doctoral
Files
Original bundle
Now showing 1 - 1 of 1
Thumbnail Image
Name:
LIANG-DISSERTATION-2014.pdf
Size:
943.86 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
LICENSE_1.txt
Size:
3.86 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
3.86 KB
Format:
Plain Text
Description: