The Rand index or Rand measure (named after William M. Rand) in statistics, and in particular in data clustering, is a measure of the similarity between two data clusterings. A form of the Rand index may be defined that is adjusted for the chance grouping of elements, this is the adjusted Rand index. From a mathematical standpoint, Rand index is related to the accuracy, but is applicable even when class labels are not used.
Given a set of elements and two partitions of to compare, , a partition of S into r subsets, and , a partition of S into s subsets, define the following:
The Rand index, , is:
Intuitively, can be considered as the number of agreements between and and as the number of disagreements between and .
The Rand index has a value between 0 and 1, with 0 indicating that the two data clusters do not agree on any pair of points and 1 indicating that the data clusters are exactly the same.
In mathematical terms, a, b, c, d are defined as follows:
Adjusted Rand index
The adjusted Rand index is the corrected-for-chance version of the Rand index. Though the Rand Index may only yield a value between 0 and +1, the Adjusted Rand Index can yield negative values if the index is less than the expected index.
The contingency table
Given a set of elements, and two groupings (e.g. clusterings) of these points, namely and , the overlap between and can be summarized in a contingency table where each entry denotes the number of objects in common between and : .
The adjusted form of the Rand Index, the Adjusted Rand Index, is , more specifically
where are values from the contingency table.