Agglomerative clustering algorithm in R

4 min readNov 29, 2022

Usually when we talk about clustering, we’re going to think about K-means and Hierarchical Clustering, but here we’re going to talk about Hierarchical Clustering. Hierarchical Cluster analysis can be divided into discrete, top-down (DI), which is a method of separating data into detailed clusters, as opposed to Aggolomerative (Bottom-Up Method) (AG), which starts with individual data and binds similar data into clusters.

https://rpubs.com/Evan_Jung/hierarchical_clustering

What the researcher has to think about here is what methodology to approach. You can make a choice after comparing the two approaches. However, in general, AGs perform well when they want to find clusters of smaller units, and DIs perform well when they want to find larger clusters.

Hierarchical clustering is an algorithm (a binary tree configuration of data) that calculates the distance between all two clusters in a given data.
1. Bundle the data at the closest distance together (repeat)
2. Proceed until it eventually merges into one cluster
3. Align with natural hierarchies in the form of Dendrogram.

https://bioinformaticsandme.tistory.com/145

Linkage method

Complete: The longest connection method defines the longest distance between two clusters as the intercluster distance.
Single : Shortest connection, which defines the shortest distance between two clusters as the intercluster distance.
Ward.D : A method proposed by Ward to merge clusters based on the sum of squared deviations within clusters rather than on the distance between clusters. Define that the loss of information in merging clusters is minimal.
Ward.D2 : The Ward.D method uses standardized values and uses power values instead of absolute values.
Average : The mean connection method defines the average distance between all individuals in each cluster.
Mcquitty: Weighted Pair Group Method with Arithmetic Means (WPGMA), in which two nearest clusters combine to form a single group, the distance from the other clusters is obtained by the arithmetic mean.
Median: The median distance between clusters is defined as the median of all samples in a cluster.
Centroid: The distance between two clusters is defined as the distance between the centers of two clusters. Here, s,t represents the center point of each cluster.

Cluster linkage

single : Use the distance between two clusters between the nearest data
complete : Use the distance between two clusters between the farthest data
average : Leverage the average distance between all data in a cluster and all data in another cluster

Clustering algorithm in R

library(cluster) 

# use daisy() function to calculate gower distance
  gower.dist <- daisy(carAccident[, 2:9], metric = c("gower"))
  class(gower.dist)

# Using "complete" linkage - agglomerative
  agg_clust_c <- hclust(gower_distance, method = "complete")
  plot(agg_clust_c, main = "Agglomerative, complete linkages")

# Using "complete" linkage - divisive
divisive_clust <- diana(as.matrix(gower_distance), 
                  diss = TRUE, keep.diss = TRUE)
plot(divisive_clust, main = "Divisive")

# divide dendrogram into two clusters
  rect.hclust(divisive_clust, k=2)
  group <- cutree(divisive_clust, k =2)
  
  table(group)

Hierarchical cluster analysis shows cluster analysis results as dendrogram in tree structure format as shown above.

Advantages

It is easy to examine the structural relationship between the entire clusters
There is no need to determine the number of clusters in advance

Disadvantages

Calculation is slow when the data set is very large
Rearranging data, or excluding several observations, can result in completely different results
Once an object is assigned to a particular cluster, it can sometimes lead to incorrect cluster results because it has the property of never being included in another cluster again
Initial classification of outliers distorts the overall structure

Algorithms

Form n clusters of objects using each object as one cluster
Summarize the two nearest clusters by calculating the distance between each cluster
Repeat 2 times to continue merging clusters until all objects become one cluster
Dist() function is frequently used for distance calculation
The hclust() function, which combines two nearby clusters, is often used for hierarchical cluster analysis

ref.

https://rpubs.com/Evan_Jung/hierarchical_clustering

https://bioinformaticsandme.tistory.com/145

Agglomerative clustering algorithm in R

Linkage method

Cluster linkage

Clustering algorithm in R

Advantages

Disadvantages

Algorithms

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Yoonseul Choi

No responses yet