ArchiveSpace Name Record
Publication Search Results
Now showing 1 - 10 of 11
ItemProcess Mining, Discovery, and Integration Using Distance Measures(Georgia Institute of Technology, 2006) Bae, Joonsoo ; Caverlee, James ; Liu, Ling ; Rouse, William B.Business processes continue to play an important role in today's service-oriented enterprise computing systems. Mining, discovering, and integrating process-oriented services has attracted growing attention in the recent year. In this paper we present a quantitative approach to modeling and capturing the similarity and dissimilarity between different workflow designs. Concretely, we introduce a graph-based distance measure and a framework for utilizing this distance measure to mine the process repository and discover workflow designs that are similar to a given design pattern or to produce one integrated workflow design by merging two or more business workflows of similar designs. We derive the similarity measures by analyzing the workflow dependency graphs of the participating workflow processes. Such an analysis is conducted in two phases. We first convert each workflow dependency graph into a normalized process network matrix. Then we calculate the metric space distance between the normalized matrices. This distance measure can be used as a quantitative and qualitative tool in process mining, process merging, and process clustering, and ultimately it can reduce or minimize the costs involved in design, analysis, and evolution of workflow systems.
ItemDetecting the Change of Clustering Structure in Categorical Data Streams(Georgia Institute of Technology, 2005) Chen, Keke ; Liu, LingClustering data streams can provide critical information for making decision in real-time. We argue that detecting the change of clustering structure in the data streams can be beneficial to many realtime monitoring applications. In this paper, we present a framework for detecting changes of clustering structure in categorical data streams. The change of clustering structure is detected by the change of the best number of clusters in the data stream. The framework consists of two main components: the BkPlot method for determining the best number of clusters in a categorical dataset, and the summarization structure, Hierarchical Entropy Tree (HE-Tree), for efficiently capturing the entropy property of the categorical data streams. HE-Tree enables us to quickly and precisely draw the clustering information from the data stream that is needed by BkPlot method to identify the change of best number of clusters. Combining the snapshots of the HE-Tree information and the BkPlot method, we are able to observe the change of clustering structure online. The experiments show that HE-Tree + BkPlot method can efficiently and precisely detect the change of clustering structure in categorical data streams.
ItemA Random Rotation Perturbation Approach to Privacy Preserving Data Classification(Georgia Institute of Technology, 2005) Chen, Keke ; Liu, LingThis paper presents a random rotation perturbation approach for privacy preserving data classification. Concretely, we identify the importance of classification-specific information with respect to the loss of information factor, and present a random rotation perturbation framework for privacy preserving data classification. Our approach has two unique characteristics. First, we identify that many classification models utilize the geometric properties of datasets, which can be preserved by geometric rotation. We prove that the three types of classifiers will deliver the same performance over the rotation perturbed dataset as over the original dataset. Second, we propose a multi-column privacy model to address the problems of evaluating privacy quality for multidimensional perturbation. With this metric, we develop a local optimal algorithm to find the good rotation perturbation in terms of privacy guarantee. We also analyze both naive estimation and ICA-based reconstruction attacks with the privacy model. Our initial experiments show that the random rotation approach can provide high privacy guarantee while maintaining zero-loss of accuracy for the discussed classifiers.
ItemTowards Finding Optimal Partitions of Categorical Datasets(Georgia Institute of Technology, 2003) Chen, Keke ; Liu, LingA considerable amount of work has been dedicated to clustering numerical data sets, but only a handful of categorical clustering algorithms are reported to date. Furthermore, almost none has addressed the following two important cluster validity problems: (1) Given a data set and a clustering algorithm that partitions the data set into k clusters, how can we determine the best k with respect to the given dataset? (2) Given a dataset and a set of clustering algorithms with a fixed k, how to determine which one will produce k clusters of the best quality? In this paper, we investigate the entropy and expected-entropy concepts for clustering categorical data, and propose a cluster validity method based on the characteristics of expected-entropy. In addition, we develop an agglomerative hierarchical algorithm (HierEntro) to incorporate the proposed cluster validity method into the clustering process. We report our initial experimental results showing the effectiveness of the proposed clustering validity method and the benefits of the HierEntro clustering algorithm.
ItemValidating and Refining Clusters via Visual Rendering(Georgia Institute of Technology, 2003) Chen, Keke ; Liu, LingClustering is an important technique for understanding of large multi-dimensional datasets. Most of clustering research to date has been focused on developing automatic clustering algorithms and cluster validation methods. The automatic algorithms are known to work well in dealing with clusters of regular shapes, e.g. compact spherical shapes, but may incur higher error rates when dealing with arbitrarily shaped clusters. Although some efforts have been devoted to addressing the problem of skewed datasets, the problem of handling clusters with irregular shapes is still in its infancy, especially in terms of dimensionality of the datasets and the precision of the clustering results considered. Not surprisingly, the statistical indices works ineffective in validating clusters of irregular shapes, too. In this paper, we address the problem of clustering and validating arbitrarily shaped clusters with a visual framework (VISTA). The main idea of the VISTA approach is to capitalize on the power of visualization and interactive feedbacks to encourage domain experts to participate in the clustering revision and clustering validation process. The VISTA system has two unique features. First, it implements a linear and reliable visualization model to interactively visualize multidimensional datasets in a 2D star-coordinate space. Second, it provides a rich set of user-friendly interactive rendering operations, allowing users to validate and refine the cluster structure based on their visual experience as well as their domain knowledge.
ItemPeerTrust: A Trust Mechanism for an Open Peer-to-Peer Information System(Georgia Institute of Technology, 2002) Xiong, Li ; Liu, LingIn an open peer-to-peer information system, peers often have to interact with unknown or unfamiliar peers and need to manage the risk that is involved with the interactions without any presence of trusted third parties or trust authorities. It is important for peers to be able to reason about trust when interacting with each other to accomplish a task. This paper presents PeerTrust, a simple and yet effective trust mechanism for quantifying and comparing the trustworthiness of peers. We argue that the amount of satisfaction a peer obtains through interactions, the total number of interactions that a peer has with other peers, and the balancing factor of trust all play a crucial role in evaluating the trustworthiness of the peer. This paper also discusses the architecture and the design considerations in implementing this mechanism in a decentralized peer-to-peer system. We report the set of initial experiments, showing the feasibility, the cost, and the benefit of our approach.
ItemA New Document Placement Scheme for Cooperative Caching on the Internet(Georgia Institute of Technology, 2002) Ramaswamy, Lakshmish Macheeri ; Liu, LingThe sharing of caches among proxies is an important technique to reduce Web traffic, alleviate network bottlenecks and improve response time of document requests. Most existing work on cooperative caching has been focused on serving misses collaboratively. Very few have studied the effect of cooperation on document placement schemes and its potential enhancements on cache hit ratio and latency reduction. In this paper we propose a new document placement scheme, which takes into account of the contentions at individual caches in order to limit the replication of documents within a cache group and increase document hit ratio. The main idea of this new scheme is to view the aggregate disk space of the cache group as a global resource of the group, and uses the concept of cache expiration age to measure the contention of individual caches. The decision of whether to cache a document at a proxy is made collectively among the caches that already have a copy of this document. We refer to this new document placement scheme as the expiration age based scheme (EA scheme for short). The EA scheme effectively reduces the replication of documents across the cache group, while ensuring that a copy of the document always resides in a cache where it is likely to stay for the longest time. We report our initial study on the potentials and limits of the EA scheme using both analytic modeling and trace-based simulation. The experiments show that the EA scheme yields higher hit rates and better response times compared to the existing document placement schemes used in most of the caching proxies.
ItemPeerCQ: A Scalable and Self-Configurable Peer-to-Peer Information Monitoring System(Georgia Institute of Technology, 2002) Gedik, Bugra ; Liu, LingPeerCQ is a peer-to-peer Continual Query system for information monitoring on the Internet. It uses Continual Queries (CQs) as its primitives to express information-monitoring requests. A primary objective of the PeerCQ system is to build a decentralized Internet scale distributed information-monitoring system, which is highly scalable, self-configurable and supports efficient and robust way of processing CQs. In this paper we describe the basic architecture of the PeerCQ system and focus on the mechanisms used for service partitioning and service lookup. There are two unique characteristics of PeerCQ. First, it introduces a donation based peer-aware mechanism for handling the peer heterogeneity. Second, it integrates CQ-aware and peer-aware information into its service partitioning scheme, while maintaining decentralization and self-configurability. We report a set of initial experiments demonstrating the sensitiveness of our approach to peer heterogeneity and the effectiveness of our service partitioning algorithm with respect to load balancing and system utilization.
ItemDistributed Workflow Restructuring: An Experimental Study(Georgia Institute of Technology, 2002) Ruiz, Duncan Dubugras ; Liu, Ling ; Pu, CaltonWorkflow systems have been one of the enabling technologies for automation of business processes in corporate enterprises. Many modern production workflows need to incorporate deadline control throughout the workflow management system. However, the increasing volume and diversity of digital information available online and the unpredictable amount of network delays or server failures have led to a growing problem that conventional workflow management systems do not have, namely how to reorganize existing workflow activities in order to meet deadlines in the presence of unexpected delays. We refer to this problem as the workflow-restructuring problem. This paper describes the notation and issues of workflow restructuring, and discusses a set of workflow activity restructuring operators. We illustrate the inherent semantics of these restructuring operators using the Transactional Activity Model (TAM). The paper contains two main contributions. First, we study the environmental instabilities (e.g., resource shortages and network delays) that cause workflows to perform sub-optimally and how workflow restructuring can address this problem. Second, we evaluate the effectiveness of workflow-restructuring operators through simulation. Our initial experiments demonstrate that run-time workflow restructuring can improve response time significantly for unstable environments.
ItemVista: Looking Into the Clusters in Very Large Multidimensional Datasets(Georgia Institute of Technology, 2002) Chen, Keke ; Liu, LingInformation Visualization is commonly recognized as a useful method for understanding sophistication in large datasets. In this paper, we introduce an efficient and flexible clustering approach that combines visual clustering and fast disk labelling for very large datasets. This paper has three contributions. First, we propose a framework Vista that incorporates information visualization methods into the clustering process in order to enhance the understanding of the intermediate clustering results and allow user to revise the clustering results before disk labelling phase. Second, we introduce a fast and flexible disk-labelling algorithm ClusterMap, which utilizes the visual clustering result to improve the overall performance of clustering on very large datasets. Third, we develop a visualization model that maps multidimensional dataset to 2D visualization while preserving or partially preserving clusters. Experiments show that Vista combining with ClusterMap, is faster and has lower error rate than existing algorithms for very large datasets. It is also flexible because the "cluster map" can be easily adjusted to meet application-specific clustering requirements.