Fibonacci retracement

2013-11-03T21:39:08Z

62.227.224.100: /* External links */

{{orphan|date=July 2010}}

'''Cluster labeling''' is closely related to the concept of [[text clustering]]. This process tries to select descriptive labels for the clusters obtained through a clustering algorithm such as Flat Clustering and [[Hierarchical Clustering]]. For example, a cluster of documents that talks about various [[internet protocols]] can be best labeled as "Internet Protocols". Typically, the labels are obtained by examining the contents of the documents in a cluster. A good label not only summarizes the central concept of a cluster but also uniquely differentiates it from other clusters in the collection.

==Differential Cluster Labeling==
Differential cluster labeling labels a cluster by comparing the terms in one cluster with the terms occurring in other clusters. The techniques used for [[feature selection]] in [[information retrieval]], such as [[mutual information]] and [[Pearson's chi-squared test|chi-squared feature selection]], can also be applied to differential cluster labeling. Terms having very low frequency are not the best in representing the whole cluster and can be omitted in labeling a cluster. By omitting those rare terms and using a differential test, one can achieve the best results with differential cluster labeling.<ref>Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. ''Introduction to Information Retrieval''. Cambridge: Cambridge UP, 2008. ''Cluster Labeling''. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/cluster-labeling-1.html>.</ref>

===Mutual Information===

{{Main|Mutual information}}

In the fields of [[probability theory]] and [[information theory]], mutual information measures the degree of dependence of two [[random variables]]. The mutual information of two variables X and Y is defined as:

<math>I(X, Y) = \sum_{x\in X}{ \sum_{y\in Y} {p(x, y)log_2\left(\frac{p(x, y)}{p_1(x)p_2(y)}\right)}}</math>

where ''p(x, y)'' is the [[joint probability|joint probability distribution]] of the two variables, ''p1(x)'' is the probability distribution of X, and ''p2(y)'' is the probability distribution of Y.

In the case of cluster labeling, the variable X is associated with membership in a cluster, and the variable Y is associated with the presence of a term.<ref>Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. ''Introduction to Information Retrieval''. Cambridge: Cambridge UP, 2008. ''Mutual Information''. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html>.</ref> Both variables can have values of 0 or 1, so the equation can be rewritten as follows:

<math>I(C, T) = \sum_{c\in {0, 1}}{ \sum_{t\in {0, 1}} {p(C = c, T = t)log_2\left(\frac{p(C = c, T = t)}{p(C = c)p(T = t)}\right)}}</math>

In this case, ''p(C = 1)'' represents the probability that a randomly selected document is a member of a particular cluster, and ''p(C = 0)'' represents the probability that it isn't. Similarly, ''p(T = 1)'' represents the probability that a randomly selected document contains a given term, and ''p(T = 0)'' represents the probability that it doesn't. The [[joint probability|joint probability distribution function]] ''p(C, T)'' represents the probability that two events occur simultaneously. For example, ''p(0, 0)'' is the probability that a document isn't a member of cluster ''c'' and doesn't contain term ''t''; ''p(0, 1)'' is the probability that a document isn't a member of cluster ''c'' and does contain term ''t''; and so on.

====Example====
The following example calculates the mutual information between the cluster "commerce" and the term "tariff":
{| class="wikitable"
|-
!
! Documents in "commerce"
! Documents not in "commerce"
|-
| Documents containing "tariff"
| 60
| 10,000
|-
| Documents not containing "tariff"
| 200
| 500,000
|}
Total number of documents = (60 + 200 + 10,000 + 500,000) = 510,260

P (C = 1, T = 1) = 60/510,260

P (C = 1, T = 0) = 200/510,260

P (C = 0, T = 1) = 10,000/510,260

P (C = 0, T = 0) = 500,000/510,260

P (C = 1) = (# of documents in the cluster) / (total number of documents) = (60 + 200) / 510,260 = 260/510,260

P (C = 0) = (# of documents not in the cluster) / (total number of documents) = (10,000 + 500,000) / 510,260 = 510,000/510,260

P (T = 1) = (# of documents containing the term) / (total number of documents) = (60 + 10,000) / 510,260 = 10,060/510,260

P (T = 0) = (# of documents not containing the term) / (total number of documents) = (200 + 500,000) / 510,260 = 500,200/510,260

Plugging these probabilities into the above equation gives us the following MI value:

<math>I(C, T) = \frac{60}{510,260} log_2\left(\frac{60/510,260}{260/510,260 * 10,060/510,260}\right)</math>
<math>+ \frac{200}{510,260} log_2\left(\frac{200/510,260}{260/510,260 * 500,200/510,260}\right)</math>
<math>+ \frac{10,000}{510,260} log_2\left(\frac{10,000/510,260}{510,000/510,260 * 10,060/510,260}\right)</math>
<math>+ \frac{500,000}{510,260} log_2\left(\frac{500,000/510,260}{510,000/510,260 * 500,200/510,260}\right)</math>
<math>= \frac{60}{510,260} log_2\left(\frac{60*510,260}{260 * 10,060}\right) + \frac{200}{510,260} log_2\left(\frac{200*510,260}{260 * 500,200}\right)</math>
<math>+ \frac{10,000}{510,260} log_2\left(\frac{10,000*510,260}{510,000 * 10,060}\right) + \frac{500,000}{510,260} log_2\left(\frac{500,000*510,260}{510,000 * 500,200}\right)</math>

 = 0.000417322 - 0.000137100 - 0.000154725 + 0.000155158

 = 0.000280655

Therefore, the mutual information between the cluster "commerce" and the term "tariff" is 0.000280655. We can create a label for the cluster "commerce" by calculating the mutual information of each term in the cluster, and selected the k terms with the highest MI value.

===Chi-Squared Selection===
{{Main|Pearson's chi-squared test}}
The Pearson's chi-squared test can be used to calculate the probability that the occurrence of an event matches the initial expectations. In particular, it can be used to determine whether two events, A and B, are [[statistically independent]]. The value of the chi-squared statistic is:

<math>X^2 = \sum_{a \in A}{\sum_{b \in B}{\frac{(O_{a,b} - E_{a, b})^2}{E_{a, b}}}}</math>

where ''Oa,b'' is the ''observed'' frequency of a and b co-occurring, and ''Ea,b'' is the ''expected'' frequency of co-occurrence.

In the case of cluster labeling, the variable A is associated with membership in a cluster, and the variable B is associated with the presence of a term. Both variables can have values of 0 or 1, so the equation can be rewritten as follows:

<math>X^2 = \sum_{a \in {0,1}}{\sum_{b \in {0,1}}{\frac{(O_{a,b} - E_{a, b})^2}{E_{a, b}}}}</math>

For example, ''O1,0'' is the observed number of documents that are in a particular cluster but don't contain a certain term, and ''E1,0'' is the expected number of documents that are in a particular cluster but don't contain a certain term.
Our initial assumption is that the two events are independent, so the expected probabilities of co-occurrence can be calculated by multiplying individual probabilities:<ref>Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. ''Introduction to Information Retrieval''. Cambridge: Cambridge UP, 2008. ''Chi2 Feature Selection''. Stanford Natural Language Processing Group. Web. 25 Nov. 2009. <http://nlp.stanford.edu/IR-book/html/htmledition/feature-selectionchi2-feature-selection-1.html>.</ref>

''E1,0 = N * P(C = 1) * P(T = 0)''

where N is the total number of documents in the collection.

====Example====
Using the same data for the mutual information example, we can calculate the expected probabilities and plug them into the equation to calculate the chi-squared statistic:
{| class="wikitable"
|-
!
! Documents in "commerce"
! Documents not in "commerce"
|-
| Documents containing "tariff"
| 60
| 10,000
|-
| Documents not containing "tariff"
| 200
| 500,000
|}

P (C = 1) = (# of documents in the cluster) / (total number of documents) = (60 + 200) / 510,260 = 260/510,260

P (C = 0) = (# of documents not in the cluster) / (total number of documents) = (10,000 + 500,000) / 510,260 = 510,000/510,260

P (T = 1) = (# of documents containing the term) / (total number of documents) = (60 + 10,000) / 510,260 = 10,060/510,260

P (T = 0) = (# of documents not containing the term) / (total number of documents) = (200 + 500,000) / 510,260 = 500,200/510,260

<math>E_{0, 0} = 510,260 * \frac {510,000}{510,260} * \frac {500,200}{510,260}</math>

= 499,945

<math>E_{0,1}= 510,260 * \frac {510,000}{510,260} * \frac {10.060}{510,260}</math>

= 10,055

<math>E_{1,0}= 510,260 * \frac {260}{510,260} * \frac {500,200}{510,260}</math>

= 254.9

<math>E_{1,1}= 510,260 * \frac {260}{510,260} * \frac {10,060}{510,260}</math>

= 5.13

<math>X^2 = \frac{(500,000 - 499,945)^2}{499,945} + \frac{(10,000 - 10,055)^2}{10,055} + \frac{(200 - 254.9)^2}{254.9} + \frac{(60 - 5.13)^2}{5.13}</math>

= 599

Since each variable can have two values, the number of [[degrees of freedom (statistics)|degrees of freedom]] is (2 - 1)(2 - 1) = 1. The [[chi-squared distribution]] for one degree of freedom states that the probability of observing a value greater than 10.83 is less than 0.001. Therefore, we can reject the [[null hypothesis]], which states that the two events are independent. Since the term "tariff" and the cluster "commerce" are dependent, we can assume that the term is a good label for the cluster.

==Cluster-Internal Labeling==
Cluster-internal labeling selects labels that only depend on the contents of the cluster of interest. No comparison is made with the other clusters.
Cluster-internal labeling can use a variety of methods, such as finding terms that occur frequently in the centroid or finding the document that lies closest to the centroid.

===Centroid Labels===
{{Main|Vector space model}}
A frequently used model in the field of [[information retrieval]] is the vector space model, which represents documents as vectors. The entries in the vector correspond to terms in the [[vocabulary]]. Binary vectors have a value of 1 if the term is present within a particular document and 0 if it is absent. Many vectors make use of weights that reflect the importance of a term in a document, and/or the importance of the term in a document collection. For a particular cluster of documents, we can calculate the [[centroid]] by finding the [[arithmetic mean]] of all the document vectors. If an entry in the centroid vector has a high value, then the corresponding term occurs frequently within the cluster. These terms can be used as a label for the cluster.
One downside to using centroid labeling is that it can pick up words like "place" and "word" that have a high frequency in written text, but have little relevance to the contents of the particular cluster.

===Title Labels===
An alternative to centroid labeling is title labeling. Here, we find the document within the cluster that has the smallest [[Euclidean distance]] to the centroid, and use its title as a label for the cluster. One advantage to using document titles is that they provide additional information that would not be present in a list of terms. However, they also have the potential to mislead the user, since one document might not be representative of the entire cluster.

===External knowledge labels===
Cluster labeling can be done indirectly using external knowledge such as pre-categorized knowledge such as the one of Wikipedia.<ref>David Carmel, Haggai Roitman, Naama Zwerdling. [http://portal.acm.org/citation.cfm?doid=1571941.1571967 Enhancing cluster labeling using wikipedia.] SIGIR 2009: 139-146</ref> In such methods, a set of important cluster text features are first extracted from the cluster documents. These features then can be used to retrieve the (weighted) K-nearest categorized documents from which candidates for cluster labels can be extracted. The final step involves the ranking of such candidates. Suitable methods are such that are based on a voting or a fusion process which is determined using the set of categorized documents and the original cluster features.

==External links==
* [http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-clustering-1.html Hierarchical Clustering]
* [http://erulemaking.ucsur.pitt.edu/doc/papers/dgo06-labeling.pdf Automatically Labeling Hierarchical Clusters]

==References==
<references/>

{{DEFAULTSORT:Cluster Labeling}}
[[Category:Information retrieval]]

formulasearchengine - User contributions [en]

Fibonacci retracement