Algorithms for Clustering Data

Cluster analysis is an important technique in the rapidly growing field known as exploratory data analysis and is being applied in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, and remote sensing. Cluster analysis organizes data by abstracting underly-ing structure either as a grouping of individuals or as a hierarchy of groups. The representation can then be investigated to see if the data group according to precon-ceived ideas or to suggest new experiments. Cluster analysis is a tool for exploring the structure of the data that does not require the assumptions common to most statistical methods. It is called ‘unsupervised learning’ in the literature of pattern recognition and artificial intelligence.

This book will be useful for those in the scientific community who gather data and seek tools for analyzing and interpreting data. It will be a valuable reference for scientists in a variety of disciplines and can serve as a textbook for a graduate course in exploratory data analysis as well as a supplemental text in courses on research methodology, pattern recognition, image processing, and re-mote sensing. The book emphasizes informal algorithms for clustering data, and interpreting results. Graphical procedures and other tools for visually representing data are introduced both to evaluate the results of clustering and to explore data. Mathematical and statistical theory are introduced only when necessary.

Most existing books on cluster analysis are written by mathematicians, numer-ical taxonomists, social scientists, and psychologists who emphasize either the methods that lend themselves to mathematical treatment or the applications in their particular area. Our book strives for a sense of completeness and for a balanced presentation. We bring together many results that are scattered through the literature of several fields. The most unique feature of this book is its thorough, understand-able treatment of cluster validity, or the objective validation of the results of cluster analysis, from the application viewpoint.

This book resulted from class notes that the authors have used in a graduate course on clustering and scaling algorithms in the Department of Computer Science at Michigan State University. The prerequisite for this course is probability theory, matrix algebra, computer programming, and data structures. In addition to homework problems and an exam, the students in this course work on a project which can range from the analysis of a real data set to comparative analysis of various algorithms. This course is particularly useful for students who wish to pursue research in pattern recognition, image processing, and artificial intelligence. Interested readers may contact the authors for homework problems for this course.