ArchiveSpace Name Record
Publication Search Results
Now showing 1 - 5 of 5
ItemDetecting the Change of Clustering Structure in Categorical Data Streams(Georgia Institute of Technology, 2005) Chen, Keke ; Liu, LingClustering data streams can provide critical information for making decision in real-time. We argue that detecting the change of clustering structure in the data streams can be beneficial to many realtime monitoring applications. In this paper, we present a framework for detecting changes of clustering structure in categorical data streams. The change of clustering structure is detected by the change of the best number of clusters in the data stream. The framework consists of two main components: the BkPlot method for determining the best number of clusters in a categorical dataset, and the summarization structure, Hierarchical Entropy Tree (HE-Tree), for efficiently capturing the entropy property of the categorical data streams. HE-Tree enables us to quickly and precisely draw the clustering information from the data stream that is needed by BkPlot method to identify the change of best number of clusters. Combining the snapshots of the HE-Tree information and the BkPlot method, we are able to observe the change of clustering structure online. The experiments show that HE-Tree + BkPlot method can efficiently and precisely detect the change of clustering structure in categorical data streams.
ItemA Random Rotation Perturbation Approach to Privacy Preserving Data Classification(Georgia Institute of Technology, 2005) Chen, Keke ; Liu, LingThis paper presents a random rotation perturbation approach for privacy preserving data classification. Concretely, we identify the importance of classification-specific information with respect to the loss of information factor, and present a random rotation perturbation framework for privacy preserving data classification. Our approach has two unique characteristics. First, we identify that many classification models utilize the geometric properties of datasets, which can be preserved by geometric rotation. We prove that the three types of classifiers will deliver the same performance over the rotation perturbed dataset as over the original dataset. Second, we propose a multi-column privacy model to address the problems of evaluating privacy quality for multidimensional perturbation. With this metric, we develop a local optimal algorithm to find the good rotation perturbation in terms of privacy guarantee. We also analyze both naive estimation and ICA-based reconstruction attacks with the privacy model. Our initial experiments show that the random rotation approach can provide high privacy guarantee while maintaining zero-loss of accuracy for the discussed classifiers.
ItemTowards Finding Optimal Partitions of Categorical Datasets(Georgia Institute of Technology, 2003) Chen, Keke ; Liu, LingA considerable amount of work has been dedicated to clustering numerical data sets, but only a handful of categorical clustering algorithms are reported to date. Furthermore, almost none has addressed the following two important cluster validity problems: (1) Given a data set and a clustering algorithm that partitions the data set into k clusters, how can we determine the best k with respect to the given dataset? (2) Given a dataset and a set of clustering algorithms with a fixed k, how to determine which one will produce k clusters of the best quality? In this paper, we investigate the entropy and expected-entropy concepts for clustering categorical data, and propose a cluster validity method based on the characteristics of expected-entropy. In addition, we develop an agglomerative hierarchical algorithm (HierEntro) to incorporate the proposed cluster validity method into the clustering process. We report our initial experimental results showing the effectiveness of the proposed clustering validity method and the benefits of the HierEntro clustering algorithm.
ItemValidating and Refining Clusters via Visual Rendering(Georgia Institute of Technology, 2003) Chen, Keke ; Liu, LingClustering is an important technique for understanding of large multi-dimensional datasets. Most of clustering research to date has been focused on developing automatic clustering algorithms and cluster validation methods. The automatic algorithms are known to work well in dealing with clusters of regular shapes, e.g. compact spherical shapes, but may incur higher error rates when dealing with arbitrarily shaped clusters. Although some efforts have been devoted to addressing the problem of skewed datasets, the problem of handling clusters with irregular shapes is still in its infancy, especially in terms of dimensionality of the datasets and the precision of the clustering results considered. Not surprisingly, the statistical indices works ineffective in validating clusters of irregular shapes, too. In this paper, we address the problem of clustering and validating arbitrarily shaped clusters with a visual framework (VISTA). The main idea of the VISTA approach is to capitalize on the power of visualization and interactive feedbacks to encourage domain experts to participate in the clustering revision and clustering validation process. The VISTA system has two unique features. First, it implements a linear and reliable visualization model to interactively visualize multidimensional datasets in a 2D star-coordinate space. Second, it provides a rich set of user-friendly interactive rendering operations, allowing users to validate and refine the cluster structure based on their visual experience as well as their domain knowledge.
ItemVista: Looking Into the Clusters in Very Large Multidimensional Datasets(Georgia Institute of Technology, 2002) Chen, Keke ; Liu, LingInformation Visualization is commonly recognized as a useful method for understanding sophistication in large datasets. In this paper, we introduce an efficient and flexible clustering approach that combines visual clustering and fast disk labelling for very large datasets. This paper has three contributions. First, we propose a framework Vista that incorporates information visualization methods into the clustering process in order to enhance the understanding of the intermediate clustering results and allow user to revise the clustering results before disk labelling phase. Second, we introduce a fast and flexible disk-labelling algorithm ClusterMap, which utilizes the visual clustering result to improve the overall performance of clustering on very large datasets. Third, we develop a visualization model that maps multidimensional dataset to 2D visualization while preserving or partially preserving clusters. Experiments show that Vista combining with ClusterMap, is faster and has lower error rate than existing algorithms for very large datasets. It is also flexible because the "cluster map" can be easily adjusted to meet application-specific clustering requirements.