Organizational Unit:

School of Computational Science and Engineering

Permanent Link

https://hdl.handle.net/1853/70780

Parent Organization

Organizational Unit

College of Computing

ArchiveSpace Name Record

https://finding-aids.library.gatech.edu/agents/corporate_entities/1111

Full item page

Publication Search Results

Now showing 1 - 10 of 27

Matrix algorithms for data clustering and nonlinear dimension reduction

(Georgia Institute of Technology, 2008-10-03) Zha, Hongyuan ; Zhang, Ming
Workshop on Future Direction in Numerical Algorithms and Optimization

(Georgia Institute of Technology, 2008-01-15) Park, Haesun ; Golub, Gene ; Wu, Weili ; Du, Ding-Zhu
Sparse Nonnegative Matrix Factorization for Clustering

(Georgia Institute of Technology, 2008) Kim, Jingu ; Park, Haesun

Properties of Nonnegative Matrix Factorization (NMF) as a clustering method are studied by relating its formulation to other methods such as K-means clustering. We show how interpreting the objective function of K-means as that of a lower rank approximation with special constraints allows comparisons between the constraints of NMF and K-means and provides the insight that some constraints can be relaxed from K-means to achieve NMF formulation. By introducing sparsity constraints on the coefficient matrix factor in NMF objective function, we in term can view NMF as a clustering method. We tested sparse NMF as a clustering method, and our experimental results with synthetic and text data shows that sparse NMF does not simply provide an alternative to K-means, but rather gives much better and consistent solutions to the clustering problem. In addition, the consistency of solutions further explains how NMF can be used to determine the unknown number of clusters from data. We also tested with a recently proposed clustering algorithm, Affinity Propagation, and achieved comparable results. A fast alternating nonnegative least squares algorithm was used to obtain NMF and sparse NMF.
SNARE: Spatio-temporal Network-level Automatic Reputation Engine

(Georgia Institute of Technology, 2008) Feamster, Nick ; Gray, Alexander ; Krasser, Sven ; Syed, Nadeem Ahmed

Current spam filtering techniques classify email based on content and IP reputation blacklists or whitelists. Unfortunately, spammers can alter spam content to evade content based filters, and spammers continually change the IP addresses from which they send spam. Previous work has suggested that filters based on network-level behavior might be more efficient and robust, by making decisions based on how messages are sent, as opposed to what is being sent or who is sending them. This paper presents a technique to identify spammers based on features that exploit the network-level spatio temporal behavior of email senders to differentiate the spamming IPs from legitimate senders. Our behavioral classifier has two benefits: (1) it is early (i.e., it can automatically detect spam without seeing a large amount of email from a sending IP address-sometimes even upon seeing only a single packet); (2) it is evasion-resistant (i.e., it is based on spatial and temporal features that are difficult for a sender to change). We build classifiers based on these features using two different machine learning methods, support vector machine and decision trees, and we study the efficacy of these classifiers using labeled data from a deployed commercial spam-filtering system. Surprisingly, using only features from a single IP packet header (i.e., without looking at packet contents), our classifier can identify spammers with about 93% accuracy and a reasonably low false-positive rate (about 7%). After looking at a single message spammer identification accuracy improves to more than 94% with a false rate of just over 5%. These suggest an effective sender reputation mechanism.
Toward Faster Nonnegative Matrix Factorization: A New Algorithm and Comparisons

(Georgia Institute of Technology, 2008) Kim, Jingu ; Park, Haesun

Nonnegative Matrix Factorization (NMF) is a dimension reduction method that has been widely used for various tasks including text mining, pattern analysis, clustering, and cancer class discovery. The mathematical formulation for NMF appears as a non-convex optimization problem, and various types of algorithms have been devised to solve the problem. The alternating nonnegative least squares (ANLS) framework is a block coordinate descent approach for solving NMF, which was recently shown to be theoretically sound and empirically efficient. In this paper, we present a novel algorithm for NMF based on the ANLS framework. Our new algorithm builds upon the block principal pivoting method for the nonnegativity constrained least squares problem that overcomes some limitations of active set methods. We introduce ideas to efficiently extend the block principal pivoting method within the context of NMF computation. Our algorithm inherits the convergence theory of the ANLS framework and can easily be extended to other constrained NMF formulations. Comparisons of algorithms using datasets that are from real life applications as well as those artificially generated show that the proposed new algorithm outperforms existing ones in computational speed.
Fast Linear Discriminant Analysis using QR Decomposition and Regularization

(Georgia Institute of Technology, 2007-03-23) Park, Haesun ; Drake, Barry L. ; Lee, Sangmin ; Park, Cheong Hee

Linear Discriminant Analysis (LDA) is among the most optimal dimension reduction methods for classification, which provides a high degree of class separability for numerous applications from science and engineering. However, problems arise with this classical method when one or both of the scatter matrices is singular. Singular scatter matrices are not unusual in many applications, especially for high-dimensional data. For high-dimensional undersampled and oversampled problems, the classical LDA requires modification in order to solve a wider range of problems. In recent work the generalized singular value decomposition (GSVD) has been shown to mitigate the issue of singular scatter matrices, and a new algorithm, LDA/GSVD, has been shown to be very robust for many applications in machine learning. However, the GSVD inherently has a considerable computational overhead. In this paper, we propose fast algorithms based on the QR decomposition and regularization that solve the LDA/GSVD computational bottleneck. In addition, we present fast algorithms for classical LDA and regularized LDA utilizing the framework based on LDA/GSVD and preprocessing by the Cholesky decomposition. Experimental results are presented that demonstrate substantial speedup in all of classical LDA, regularized LDA, and LDA/GSVD algorithms without any sacrifice in classification performance for a wide range of machine learning applications.
Non-Negative Matrix Factorization Based on Alternating Non-Negativity Constrained Least Squares and Active Set Method

(Georgia Institute of Technology, 2007) Kim, Hyunsoo ; Park, Haesun
Parallel Shortest Path Algorithms for Solving Large-Scale Instances

(Georgia Institute of Technology, 2006-08-30) Madduri, Kamesh ; Bader, David A. ; Berry, Jonathan W. ; Crobak, Joseph R.

We present an experimental study of parallel algorithms for solving the single source shortest path problem with non-negative edge weights (NSSP) on large-scale graphs. We implement Meyer and Sander's Δ-stepping algorithm and report performance results on the Cray MTA-2, a multithreaded parallel architecture. The MTA-2 is a high-end shared memory system offering two unique features that aid the efficient implementation of irregular parallel graph algorithms: the ability to exploit fine-grained parallelism, and low-overhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with a competitive sequential algorithm, for low-diameter sparse graphs. For instance, Δ-stepping on a directed scale-free graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA-2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a parallel NSSP problem on realistic graph instances in the order of billions of vertices and edges.
Parallel Algorithms for Evaluating Centrality Indices in Real-World Networks

(Georgia Institute of Technology, 2006-04-14) Bader, David A. ; Madduri, Kamesh

This paper discusses fast parallel algorithms for evaluating several centrality indices frequently used in complex network analysis. These algorithms have been optimized to exploit properties typically observed in real-world large scale networks, such as the low average distance, high local density, and heavy-tailed power law degree distributions. We test our implementations on real datasets such as the web graph, protein-interaction networks, movie-actor and citation networks, and report impressive parallel performance for evaluation of the computationally intensive centrality metrics (betweenness and closeness centrality) on high-end shared memory symmetric multiprocessor and multithreaded architectures. To our knowledge, these are the first parallel implementations of these widely-used social network analysis metrics. We demonstrate that it is possible to rigorously analyze networks three orders of magnitude larger than instances that can be handled by existing network analysis (SNA) software packages. For instance, we compute the exact betweenness centrality value for each vertex in a large US patent citation network (3 million patents, 16 million citations) in 42 minutes on 16 processors, utilizing 20GB RAM of the IBM p5 570. Current SNA packages on the other hand cannot handle graphs with more than hundred thousand edges.
Designing Multithreaded Algorithms for Breadth-First Search and st-connectivity on the Cray MTA-2

(Georgia Institute of Technology, 2006-02-26) Bader, David A. ; Madduri, Kamesh

Graph abstractions are extensively used to understand and solve challenging computational problems in various scientific and engineering domains. They have particularly gained prominence in recent years for applications involving large-scale networks. In this paper, we present fast parallel implementations of three fundamental graph theory problems, Breadth-First Search, st-connectivity and shortest paths for unweighted graphs, on multithreaded architectures such as the Cray MTA-2. The architectural features of the MTA-2 aid the design of simple, scalable and high-performance graph algorithms. We test our implementations on large scale-free and sparse random graph instances, and report impressive results, both for algorithm execution time and parallel performance. For instance, Breadth-First Search on a scale-free graph of 200 million vertices and 1 billion edges takes less than 5 seconds on a 40-processor MTA-2 system with an absolute speedup of close to 30. This is a significant result in parallel computing, as prior implementations of parallel graph algorithms report very limited or no speedup on irregular and sparse graphs, when compared to the best sequential implementation.