Times interest earned: Difference between revisions

Revision as of 12:42, 17 November 2013

Term Discrimination is a way to rank keywords in how useful they are for Information Retrieval.

Overview

This is a method similar to tf-idf but it deals with finding keywords suitable for information retrieval and ones that are not. Please refer to Vector Space Model first.

This method uses the concept of Vector Space Density that the less dense an occurrence matrix is, the better an information retrieval query will be.

An optimal index term is one that can distinguish two different documents from each other and relate two similar documents. On the other hand, a sub-optimal index term can not distinguish two different document from two similar documents.

The discrimination value is the difference in the occurrence matrix's vector-space density versus the same matrix's vector-space without the index term's density.

Let:
 $A$  be the occurrence matrix
 $A_{k}$  be the occurrence matrix without the index term  $k$ 
and  $Q (A)$  be density of  $A$ .
Then:
The discrimination value of the index term  $k$  is: 
 $D V_{k} = Q (A) - Q (A_{k})$

How to compute

Given an occurrency matrix: $A$ and one keyword: $k$

Find the global document centroid: $C$ (this is just the average document vector)
Find the average euclidean distance from every document vector, $D_{i}$ to $C$
Find the average euclidean distance from every document vector, $D_{i}$ to $C$ IGNORING $k$
The difference between the two values in the above step is the discrimination value for keyword $K$

A higher value is better because including the keyword will result in better information retrieval.

Qualitative Observations

Keywords that are sparse should be poor discriminators because they have poor recall, whereas keywords that are frequent should be poor discriminators because they have poor precision.

References

G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Communications of the ACM, vol. 18, nr. 11, pages 613–620. (The article in which the vector space model was first presented)

Can, F., Ozkarahan, E. A (1987), "Computation of term/document discrimination values by use of the cover coefficient concept." Journal of the American Society for Information Science, vol. 38, nr. 3, pages 171-183.

@@ Line 1: / Line 1: @@
-I'm a 50 years old and study at the university (Education Science).<br>In my free time I learn Korean. I've been  there and look forward to go there anytime soon. I like to read, preferably on my ipad. I like to watch The Big Bang Theory and Arrested Development as well as documentaries about nature. I enjoy Radio-Controlled Car Racing.<br><br>my blog :: [http://bikeracinggames.co.in/profile/omb83.html wordpress backup plugin]
+'''Term Discrimination''' is a way to rank keywords in how useful they are for [[Information Retrieval]].
+== Overview ==
+This is a method similar to [[tf-idf]] but it deals with finding keywords suitable for [[information retrieval]] and ones that are not.  Please refer to [[Vector Space Model]] first.
+This method uses the concept of ''Vector Space Density'' that the less dense an [[occurrence matrix]] is, the better an information retrieval query will be.
+An optimal index term is one that can distinguish two different documents from each other and relate two similar documents.  On the other hand, a sub-optimal index term can not distinguish two different document from two similar documents.
+The discrimination value is the difference in the occurrence matrix's vector-space density versus the same matrix's vector-space without the index term's density.
+ Let:
+ <math>A</math> be the occurrence matrix
+ <math>A_k</math> be the occurrence matrix without the index term <math>k</math>
+ and <math>Q(A)</math> be density of <math>A</math>.
+ Then:
+ The discrimination value of the index term <math>k</math> is:
+ <math>DV_k = Q(A) - Q(A_k)</math>
+== How to compute ==
+Given an [[occurrency matrix]]: <math>A</math> and one keyword: <math>k</math>
+* Find the global document [[centroid]]: <math>C</math> (this is just the average document vector)
+* Find the average [[euclidean distance]] from every document vector, <math>D_i</math> to <math>C</math>
+* Find the average euclidean distance from every document vector, <math>D_i</math> to <math>C</math> ''IGNORING'' <math>k</math>
+* The difference between the two values in the above step is the ''discrimination value'' for keyword <math>K</math>
+A higher value is better because including the keyword will result in better information retrieval.
+== Qualitative Observations ==
+Keywords that are ''[[Sparse matrix|sparse]]'' should be poor discriminators because they have poor ''[[Precision and recall|recall]],''
+whereas
+keywords that are ''frequent'' should be poor discriminators because they have poor ''[[Precision and recall|precision]].''
+== References ==
+* [[Gerard Salton|G. Salton]], A. Wong, and C. S. Yang (1975), "[http://www.cs.uiuc.edu/class/fa05/cs511/Spring05/other_papers/p613-salton.pdf A Vector Space Model for Automatic Indexing]," ''Communications of the ACM'', vol. 18, nr. 11, pages 613–620. ''(The article in which the vector space model was first presented)''
+* Can, F., Ozkarahan, E. A (1987), "Computation of term/document discrimination values by use of the cover coefficient concept." ''Journal of the American Society for Information Science'', vol. 38, nr. 3, pages 171-183.
+[[Category:Information retrieval]]

Times interest earned: Difference between revisions

Revision as of 12:42, 17 November 2013

Contents

Overview

How to compute

Qualitative Observations

References

Navigation menu

Times interest earned: Difference between revisions

Revision as of 12:42, 17 November 2013

Overview

How to compute

Qualitative Observations

References

Navigation menu

Search