Cluster analysis is a method of organizing data into representative groups based upon similar characteristics. Each member of the cluster has more in common with other members of the same cluster than with members of the other groups. The most representative point within the group is called the centroid. Usually, this is the mean of the values of the points of data in the cluster.
Organize the data. If the data consists of a single variable, a histogram might be appropriate. If two variables are involved, graph the data on a coordinate plane. For example, if you were looking at the height and weight of school children in a classroom, plot the points of data for each child on a graph, with the weight being the horizontal axis and the height being the vertical axis. If more than two variables are involved, matrices may be needed to display the data.
Group the data into clusters. Each cluster should consist of the points of data closest to it. In the height and weight example, group any points of data that appear to be close together. The number of clusters, and whether every point of data has to be in a cluster, may depend upon the purposes of the study.
For each cluster, add the values of all members. For example, if a cluster of data consisted of the points (80, 56), (75, 53), (60, 50), and (68,54), the sum of the values would be (283, 213).
Divide the total by the number of members of the cluster. In the example above, 283 divided by four is 70.75, and 213 divided by four is 53.25, so the centroid of the cluster is (70.75, 53.25).
Plot the cluster centroids and determine whether any points are closer to a centroid of another cluster than they are to the centroid of their own cluster. If any points are closer to a different centroid, redistribute them to the cluster containing the closer centroid.
Repeat Steps 3, 4 and 5 until all points of data are in the cluster containing the centroid to which they are closest.
If the centroid has to be a particular point of data instead of a midpoint between the data, then the median may be used to determine it, instead of the mean.