Understanding Hierarchical Clustering

4 min readJun 9, 2021

What is Clustering?

Clustering is a technique that groups similar objects such that the objects in the same group are more similar to each other than the objects in the other groups. The group of similar objects is called a Cluster. It falls under the category of unsupervised learning. Meaning, there is no labeled class or target variable for a given dataset.

Dendrograms

Before jumping into Hierarchical Clustering we must have a basic understanding of Dendrograms as they are used to visualize how the clustering is working.

The dendrogram is a tree-like structure that is mainly used to store each step as a memory that the Hierarchical Clustering algorithm performs. In the dendrogram plot, the Y-axis shows the Euclidean distances between the data points, and the x-axis shows all the data points of the given dataset.

Hierarchical Clustering

Hierarchical Clustering creates clusters in a hierarchical tree-like structure (also called a Dendrogram). Meaning, a subset of similar data is created in a tree-like structure in which the root node corresponds to the entire data, and branches are created from the root node to form several clusters. It is an alternative approach to k-means clustering for identifying groups in a data set.

Hierarchical Clustering is of two types.

Divisive Hierarchical Clustering: Also termed a top-down clustering approach. In this technique, entire data or observation is assigned to a single cluster. The cluster is further split until there is one cluster for each data or observation.
Agglomerative Hierarchical Clustering: It is popularly known as a bottom-up approach, wherein each data or observation is treated as its cluster. A pair of clusters are combined until all clusters are merged into one big cluster that contains all the data.

Now for agglomeration methods a question arises that, how do we measure the dissimilarity between two cluster observations? To answer this question we use different algorithms like:

Maximum or complete linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
Minimum or single linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
Mean or average linkage clustering: Computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters. Can vary in the compactness of the clusters it creates.
Centroid linkage clustering: Computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p, one element for each variable) and the centroid for cluster 2.
Ward’s minimum variance method: Minimizes the total within-cluster variance. At each step, the pair of clusters with the smallest between-cluster distance are merged. Tends to produce more compact clusters.

Determining Optimal Clusters

Hierarchical clustering provides a fully connected dendrogram representing the cluster relationships but, we may still need to choose the preferred number of clusters to extract. Fortunately, we can execute approaches similar to those discussed for k-means clustering. The following compares results provided by the elbow, silhouette, and gap statistic methods.

In the above case, there is no definitively clear optimal number of clusters. Although, we have borrowed down the list to just three (8, 9, 10) which can be compared easily.

Originally published at http://docs.google.com.