Title:
Towards Finding Optimal Partitions of Categorical Datasets

Thumbnail Image
Author(s)
Chen, Keke
Liu, Ling
Authors
Person
Advisor(s)
Advisor(s)
Editor(s)
Associated Organization(s)
Organizational Unit
Supplementary to
Abstract
A considerable amount of work has been dedicated to clustering numerical data sets, but only a handful of categorical clustering algorithms are reported to date. Furthermore, almost none has addressed the following two important cluster validity problems: (1) Given a data set and a clustering algorithm that partitions the data set into k clusters, how can we determine the best k with respect to the given dataset? (2) Given a dataset and a set of clustering algorithms with a fixed k, how to determine which one will produce k clusters of the best quality? In this paper, we investigate the entropy and expected-entropy concepts for clustering categorical data, and propose a cluster validity method based on the characteristics of expected-entropy. In addition, we develop an agglomerative hierarchical algorithm (HierEntro) to incorporate the proposed cluster validity method into the clustering process. We report our initial experimental results showing the effectiveness of the proposed clustering validity method and the benefits of the HierEntro clustering algorithm.
Sponsor
Date Issued
2003
Extent
171223 bytes
Resource Type
Text
Resource Subtype
Technical Report
Rights Statement
Rights URI