Beyond Kleinberg’s Impossibility Theorem of Clustering: My Study Note of a Pragmatic Clustering Evaluation Framework | by Michio Suginoo

Now, let’s focus on internal validation and external validation. Below, I will list some metrics of my choice with hyper-links where you can trace their definitions and formulas in details.Since I will not cover the formulas for these metrics, the readers are advised to trace the hyper-links provided below to find them out!A. Metrics used for Internal ValidationThe objective of the internal validation is to establish the quality of the clustering structure solely based on the given dataset.Classification of Internal evaluation methods:Internal validation methods can be categorized accordingly to the classes of clustering methodologies. A typical classification of clustering can be formulated as follows:Partitioning methods (e.g. K-means),Hierarchical methods (e.g. agglomerative clustering),Density base methods (e.g. DBSCAN), andthe restHere, I cover the first two: partitioning clustering and hierarchical clustering.a) Partitioning Methods: e.g. K-meansFor partitioning methods, there are three basis of evaluation metrics: cohesion, separation, and their hybrid.Cohesion:Cohesion evaluates the closeness of the inner-cluster data structure. The lower the value of cohesion metrics, the better quality the clusters are. An example of cohesion metrics is:SSW: Sum of Squared Errors Within Cluster.Separation:Separation is an inter-cluster metrics and evaluates the dispersion of the inter-cluster data structure. The idea behind a separation metric is to maximize the distance between clusters. An example of cohesion metrics is:SSB: Sum of Squared Errors between Clusters.Hybrid of both cohesion and separation:Hybrid type quantifies the level of separation and cohesion in a single metric. Here is a list of examples:i) The silhouette coefficient: in the range of [-1, 1]This metrics is a relative measure of the inter-cluster distance with neighboring cluster.Here is a general interpretation of the metric:The best value: 1The worst value: -1.Values near 0: overlapping clusters.Negative values: high possibility that a sample is assigned to a wrong cluster.Here is a use case example of the metric: https://www.geeksforgeeks.org/silhouette-index-cluster-validity-index-set-2/?ref=ml_lbpii) The Calisnki-Harabasz coefficient:Also known as the Variance Ratio Criterion, this metrics measures the ratio of the sum of inter-clusters dispersion and of intra-cluster dispersion for all clusters.For a given assignment of clusters, the higher the value of the metric, the better the clustering result is: since a higher value indicates that the resulting clusters are compact and well-separated.Here is a use case example of the metric: https://www.geeksforgeeks.org/dunn-index-and-db-index-cluster-validity-indices-set-1/?ref=ml_lbpiii) Dann Index:For a given assignment of clusters, a higher Dunn index indicates better clustering.Here is a use case example of the metric: https://www.geeksforgeeks.org/dunn-index-and-db-index-cluster-validity-indices-set-1/?ref=ml_lbpiv) Davies Bouldin Score:The metric measures the ratio of intra-cluster similarity to inter-cluster similarity. Logically, a higher metric suggests a denser intra-cluster structure and a more separated inter-cluster structure, thus, a better clustering result.Here is a use case example of the metric: https://www.geeksforgeeks.org/davies-bouldin-index/b) Hierarchical Methods: e.g. agglomerate clustering algorithmi) Human judgement based on visual representation of dendrogram.Although Palacio-Niño & Berzal did not include human judgement; it is one of the most useful tools for internal validation for hierarchical clustering based on dendrogram.Instead, the co-authors listed the following two correlation coefficient metrics specialized in evaluating the results of a hierarchical clustering.For both, their higher values indicate better results. Both take values in the range of [-1, 1].ii) The Cophenetic Correlation Coefficient (CPCC): [-1, 1]It measures distance between observations in the hierarchical clustering defined by the linkage.iii) Hubert Statistic: [-1, 1]A higher Hubert value corresponds to a better clustering of data.c) Potential Category: Self-supervised learningSelf-supervised learning can generate the feature representations which can be used for clustering. Self-supervised learnings have no explicit labels in the dataset but use the input data itself as labels for learning. Palacio-Niño & Berzal did not include self-supervised framework, such as autoencoder and GANs, for their proposal in this section. Well, they are not clustering algorithm per se. Nevertheless, I will keep this particular domain pending for my note. Time will tell if any specialized metrics emerge from this particular domain.Before closing the section of internal validation, here is a caveat from Gere (2023).“Choosing the proper hierarchical clustering algorithm and number of clusters is always a key question … . In many cases, researchers do not publish any reason why it was chosen a given distance measure and linkage rule along with cluster numbers. The reason behind this could be that different cluster validation and comparison techniques give contradictory results in most cases. … The results of the validation methods deviate, suggesting that clustering depends heavily on the data set in question. Although Euclidean distance, Ward’s method seems a safe choice, testing, and validation of different clustering combinations is strongly suggested.”Yes, it is a hard task.Now, let’s move on to external validation.

Beyond Kleinberg’s Impossibility Theorem of Clustering: My Study Note of a Pragmatic Clustering Evaluation Framework | by Michio Suginoo | Jun, 2024

How Einstein Copilot Sharpens LLM Outputs and AI Data Testing

Challenges and Solutions in Data Mesh — Part 3 | by Bernd Wessely | Jun, 2024

Analysing Interactions with Friedman’s H-stat and Python

Deep Learning Illustrated, Part 5: Long Short-Term Memory (LSTM) | by Shreya Rao | Jun, 2024

Creating a Streamlit App for Satellite Imagery Visualization: A Step-by-Step Guide

Related articles

Google Clarifies Organization Merchant Returns Structured Data

How Einstein Copilot Sharpens LLM Outputs and AI Data Testing

Challenges and Solutions in Data Mesh — Part 3 | by Bernd Wessely | Jun, 2024

Analysing Interactions with Friedman’s H-stat and Python

About

Latest news

Popular news