|
|
| Line 1: |
Line 1: |
| The '''Count–min sketch''' (or '''CM sketch''') is a [[Randomized algorithm|probabilistic]] [[Complexity class|sub-linear space]] [[streaming algorithm]] which can be used to summarize a data stream in many different ways. The algorithm was invented in 2003 by [[Graham Cormode]] and [[S. Muthu Muthukrishnan]].<ref>{{cite journal|last=Cormode|first=Graham|coauthors=S. Muthukrishnan|title=An Improved Data Stream Summary: The Count-Min Sketch and its Applications|journal=J. Algorithms|year=2004|volume=55|pages=29–38|accessdate=14 October 2011}}</ref>
| | Hello! My name is Homer. <br>It is a little about myself: I live in Sweden, my city of Ilsbo. <br>It's called often Northern or cultural capital of . I've married 1 years ago.<br>I have two children - a son (Marta) and the daughter (Kathy). We all like Fishing.<br><br>My webpage; [http://www.taxiforsure.com/bangalore-to-mysore/ bangalore to mysore car rental] |
| | |
| Count–min sketches are somewhat similar to [[Bloom filter]]s; the main distinction is that Bloom filters represent sets, while CM sketches represent [[multiset]]s and [[Frequency (statistics)|frequency tables]].
| |
| | |
| == Algorithm ==
| |
| | |
| === Setup ===
| |
| The data structure is parameterized by the constants <math>w</math> and <math>d</math> which determine the time and space needs and the probability of error of the queries. The algorithm needs a [[Array_data_structure#Two-dimensional_arrays|two dimensional array]], called here ''count'', with <math>w</math> columns and <math>d</math> rows. A series of <math>d</math> hash functions must be randomly drawn from a [[Pairwise independence|pairwise independent]] [[hash function]] family, each associated with a row in the array.
| |
| | |
| For later convenience we assign <math>w = \lceil e/\epsilon \rceil</math> and <math> d = \lceil \ln{1/\delta} \rceil</math>, where the error in answering a query is within a factor of <math>\epsilon</math> with probability <math>\delta</math>.
| |
| | |
| === Update ===
| |
| When a new value <math>a</math> arrives we update as follows: <math>\forall j : 1 \leq j \leq d</math>, <math> count[j,h_j(a)] \leftarrow count[j,h_j(a)]%d + 1 </math>. That is, for each row we take the corresponding hash function, apply it to the newly received value and add one to the column corresponding to the hash value.
| |
| | |
| === Query ===
| |
| The array can then be used to estimate any of several different statistics at any point. If we want to estimate, for instance, the number of times <math> a_i</math> for a specific value <math>i</math> appeared so far in the stream we would compute <math>\hat a_i=\min_j count[j,h_j(i)]</math> (this assume all added values are positive). This estimate has the guarantee that <math>\hat a_i \leq a_i + \epsilon |a| </math> with probability <math>1-\delta</math>.
| |
| | |
| Small modifications to the data structure can be used to sketch other different stream statistics.
| |
| | |
| == External links ==
| |
| * [http://www.corelab.ece.ntua.gr/courses/ds.grad/count-min.ppt Powerpoint presentation on the algorithm]
| |
| * [https://sites.google.com/site/countminsketch/home/faq Count–min FAQ]
| |
| | |
| ==See also==
| |
| * [[Bloom filter]]
| |
| * [[Feature hashing]]
| |
| * [[Locality-sensitive hashing]]
| |
| * [[MinHash]]
| |
| | |
| == References ==
| |
| <references />
| |
| | |
| {{DEFAULTSORT:Count-min sketch}}
| |
| [[Category:Hashing]]
| |
| [[Category:Probabilistic data structures]]
| |
Hello! My name is Homer.
It is a little about myself: I live in Sweden, my city of Ilsbo.
It's called often Northern or cultural capital of . I've married 1 years ago.
I have two children - a son (Marta) and the daughter (Kathy). We all like Fishing.
My webpage; bangalore to mysore car rental