Clustering is an important data mining tool for statistics and machine learning. It belongs to the class of unsupervised learning algorithms and its main function is to group together objects that share similar features into clusters. Here I present a short demonstration of how a clustering algorithm can be applied in R and what it may be used for. To this end, I present a real use case from a research project on musical taste I conducted at the Max Planck Institute for Empirical Aesthetics.
The multiple existing clustering methods can be roughly divided into two groups: Those where the number of clusters is pre-defined (e.g. in k-means clustering) and those where the number is defined by the algorithm (and additional heuristics). One approach belonging to the latter is hierarchical clustering, which is useful, when the number and structure of clusters is unknown. An agglomerative hierarchical clustering algorithm for example starts with each observation unit as one cluster, then it groups the two most similar objects into a cluster of two, then a cluster of two into a cluster of four and so on, until all objects are combined in one big cluster. The advantage of hierarchical clustering is that the number of clusters is not predefined but is instead evaluated based on the data structure and the output of the clustering procedure.
The procedure of the clustering proces is as follows: From a given set of features, a measure of proximity for each object is derived (based on a predefined and possibly engineered feature matrix) resulting in a distance matrix. There are several options for calculating the distances but the most common and straightforward is Euclidean distance. The analysis yields “clusters” (groups or types of things) that share a unique set of features.
The data for this case was collected in an online survey on musical taste, where 1003 participants indicated their listening frequency for 22 musical styles. The first thing one might want to do is reducing the number of variables by identifying underlying dimensions of musical taste. One common approach to do this is principal component analysis. The idea is to determine n Dimensions that optimally represent the information that was originally contained in 22 variables. Since this post is mainly about clustering, I am not going into further detail here about dimensionality reduction. For the data here six dimensions of musical taste were identified and used for the analysis. The data looked like this:
## # A tibble: 8 x 5 ## JAZZ CLASSICAL HOUSE POP FOLK ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2.3 2.3 4.3 2.8 4.0 ## 2 4.3 3.2 1.7 2.4 3.5 ## 3 2.8 2.3 2.3 4.3 4.3 ## 4 4.0 4.3 3.2 1.7 4.0 ## 5 5.1 2.8 2.4 4.3 1.7 ## 6 3.2 4.0 5.1 3.2 2.3 ## 7 1.7 2.8 2.4 4.3 1.7 ## 8 2.4 4.0 5.1 3.2 3.2
Next, based on the music taste dimensions, a distance matrix (dist.mat) was computed with a Euclidean distance measure.
dist.mat <- musical_taste %>% na.omit() %>% # eliminate rows with missing values dist(method = 'euclidean') %>% # compute euclidean distance matrix
The distance matrix was then passed to the hierarchical clustering algorithm (using the hclust() function from the stats package) and the result was plotted. There are several heuristics to determine the optimal number of clusters, which is why the NbClust function provides 30 of them. The great thing is that, although being computationally expensive, the function provides a summary of all 30 indices and gives a recommendation for the numbers of cluster based on all of them. In the present case the optimal number of clusters was three.
# Applying the clustering algorithm dist.mat %>% hclust(method = 'ward') %>% #applying hierarchical clustering to distance matrix plot(hang = -1) # plot dist cluster, labels at the same level # Determining the optimal number of clusters library(NbClust) NbClust(data = muscal_taste , diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = "ward", index = "all") # Partition the data according to the cluster solution groups <- cutree(dist.cluster, k = 3) rect.hclust(dist.cluster, k = 3, border = 'red')
The figure below shows a cluster-dendrogram of the hierarchical agglomeration, which starts at the bottom with each case as a cluster until the top where all cases are merged into one cluster. The red lines mark the division of the data set into three clusters.
Following the clustering of participants into three groups their specific profiles of musical taste were analyzed. This was done by computing summary statistics for each cluster on each of the six dimension of musical taste. Because the algorithm clustered the participants based on their scores on the musical taste dimensions, this is the first thing you would want to inspect. After that you may want to look at the results (plotted below) and try to make sense of it, which is how the names of the clusters came about.
The “engaged listeners” cluster shows the greatest musical engagement with four of the six dimensions with peaks on Jazz and Classical. The “rock listeners” seem to only enjoy rock and the “conventional listeners” exhibit a medium engagement with peaks at Classical, House, and Pop.
In summary, the agglomerative hierarchical clustering approach was useful here because the exact number of clusters was previously unknown. It yielded a typology of music listeners with distinct profiles of musical engagement. These types were further analyzed in subsequent analyses in terms of individual differences. They may have also been used for marketing purposes, i.e. to inform decision making processes or target specific audiences and do already provide basic information about how musical taste is configured.
If you found this short demonstration of a hierchical clustering algorithm interesting you can find more information about it in the full publication here. Note that the two figures above were previously published in this publication in the journal Frontiers in Psychology.comments powered by Disqus