Title:
Towards Finding Optimal Partitions of Categorical Datasets
Towards Finding Optimal Partitions of Categorical Datasets
Authors
Chen, Keke
Liu, Ling
Liu, Ling
Authors
Person
Advisors
Advisors
Associated Organizations
Organizational Unit
Series
Collections
Supplementary to
Permanent Link
Abstract
A considerable amount of work has been dedicated to clustering numerical data sets, but only a handful of categorical
clustering algorithms are reported to date. Furthermore, almost none has addressed the following two important cluster
validity problems: (1) Given a data set and a clustering algorithm that partitions the data set into k clusters, how can we
determine the best k with respect to the given dataset? (2) Given a dataset and a set of clustering algorithms with a fixed
k, how to determine which one will produce k clusters of the best quality? In this paper, we investigate the entropy and
expected-entropy concepts for clustering categorical data, and propose a cluster validity method based on the characteristics
of expected-entropy. In addition, we develop an agglomerative hierarchical algorithm (HierEntro) to incorporate
the proposed cluster validity method into the clustering process. We report our initial experimental results showing the
effectiveness of the proposed clustering validity method and the benefits of the HierEntro clustering algorithm.
Sponsor
Date Issued
2003
Extent
171223 bytes
Resource Type
Text
Resource Subtype
Technical Report