|
|
Line 1: |
Line 1: |
| {{lowercase|title=w-shingling}}
| | Greetings! I am Marvella and I feel comfy when people use over the counter std test ([http://kgmcscs.realmind.net/xe/467281 what is it worth]) full title. Body building is what my family and I enjoy. Hiring is my occupation. Her family members lives in Minnesota. |
| | |
| In [[natural language processing]] a '''w-shingling''' is a set of unique "shingles"—contiguous [[subsequence]]s of [[token (parser)|tokens]] in a [[document]]—that can be used to gauge the [[document classification|similarity of two documents]]. The w denotes the number of tokens in each shingle in the set.
| |
| | |
| The document, "a rose is a rose is a rose" can be [[tokenization|tokenized]] as follows:
| |
| | |
| :(a,rose,is,a,rose,is,a,rose)
| |
| | |
| The set of all contiguous sequences of 4 tokens ([[N-gram]]s, here: 4-grams) is
| |
| | |
| :{ (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is), (a,rose,is,a), (rose,is,a,rose) } = { (a,rose,is,a), (rose,is,a,rose), (is,a,rose,is) }
| |
| | |
| == Resemblance ==
| |
| | |
| For a given shingle size, the degree to which two documents ''A'' and ''B'' resemble each other can be expressed as the ratio of the magnitudes of their shinglings' [[intersection (set theory)|intersection]] and [[union (set theory)|union]], or
| |
| | |
| :<math>r(A,B)={{|S(A)\cap S(B)|}\over {|S(A)\cup S(B)|}}</math>
| |
| | |
| where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical. This definition is identical with the [[Jaccard coefficient]] describing similarity and diversity of sample sets.
| |
| | |
| == See also ==
| |
| | |
| *[[Concept mining]] offers an alternative method for document similarity calculation with more computational complexity, but where the measure more closely models a human's perception of document similarity.
| |
| *[[N-gram]]
| |
| *[[k-mer]]
| |
| *[[MinHash]]
| |
| *[[Rolling hash]]
| |
| *[[Rabin fingerprint]]
| |
| *[[Vector space model]]
| |
| *[[Bag-of-words model]]
| |
| | |
| == References ==
| |
| *(Manber 1993) ''Finding Similar Files in a Large File System''. Does not yet use the term "shingling". Available [http://webglimpse.net/pubs/TR93-33.pdf as PDF]
| |
| *(Broder, Glassman, Manasse, and Zweig 1997) ''Syntactic [[Cluster analysis|Clustering]] of the Web''. SRC Technical Note #1997-015. Available at HTML [http://www.std.org/~msm/common/clustering.html here]
| |
| | |
| ==External links==
| |
| *[http://codingplayground.blogspot.com/2008/06/shingling-and-text-clustering.html An implementation of the shingling algorithm in C++]
| |
| [[Category:Natural language processing]]
| |
Greetings! I am Marvella and I feel comfy when people use over the counter std test (what is it worth) full title. Body building is what my family and I enjoy. Hiring is my occupation. Her family members lives in Minnesota.