Frenkel defect: Difference between revisions

Revision as of 17:02, 7 November 2013

A rolling hash is a hash function where the input is hashed in a window that moves through the input.

A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a moving average function can be computed much more quickly than other low-pass filters.

One of the main applications is the Rabin-Karp string search algorithm, which uses the rolling hash described below.

Another popular application is rsync program which uses a checksum based on Mark Adler's adler-32 as its rolling hash.

Another application is the Low Bandwidth Network Filesystem (LBFS), which uses a Rabin fingerprint as its rolling hash.

At best, rolling hash values are pairwise independent^[1] or strongly universal. They cannot be 3-wise independent, for example.

Rabin-Karp rolling hash

The Rabin-Karp string search algorithm is normally used with a very simple rolling hash function that only uses multiplications and additions:

$H=c_{1}a^{k-1}+c_{2}a^{k-2}+c_{3}a^{k-3}+...+c_{k}a^{0}$ where $a$ is a constant and $c_{1},...,c_{k}$ are the input characters.

In order to avoid manipulating huge $H$ values, all math is done modulo $n$ . The choice of $a$ and $n$ is critical to get good hashing; see linear congruential generator for more discussion.

Removing and adding chars simply involves adding or subtracting the first or last term. Shifting all chars by one position to the left requires multiplying the entire sum $H$ by $a$ . Shifting all chars by one position to the right requires dividing the entire sum $H$ by $a$ . Note that in modulo arithmetic, $a$ can be chosen to have a multiplicative inverse $a^{-1}$ by which $H$ can be multiplied to get the result of the division without actually performing a division.

Content based slicing using Rabin-Karp hash

One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network and a simple byte addition at the front of the file would cause all the fixed size windows to become updated, while in reality, only the first ‘chunk’ has been modified.

The simplest approach to calculate the dynamic chunks is to calculate the rolling hash and if it matches a pattern (like the lower N bits are all zeroes) then it’s a chunk boundary. This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but nothing else.

When the boundaries are known, the chunks need to be compared by their hash values to detect which one was modified and needs transfer across the network.^[2]

Cyclic polynomial

Hashing by cyclic polynomial^[3]—sometimes called Buzhash—is also simple, but it has the benefit of avoiding multiplications, using barrel shifts instead. It is a form of tabulation hashing: it presumes that there is some hash function $h$ from characters to integers in the interval $[0,2^{L})$ . This hash function might be simply an array or a hash table mapping characters to random integers. Let the function $s$ be a cyclic binary rotation (or barrel shift): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., $s(10011)=00111$ . Let $\oplus$ be the bit-wise exclusive or. The hash values are defined as

$H=s^{k-1}(h(c_{1}))\oplus s^{k-2}(h(c_{2}))\oplus \ldots \oplus s(h(c_{k-1}))\oplus h(c_{k})$

where the multiplications by powers of two can be implemented by binary shifts. The result is a number in $[0,2^{L})$ .

Computing the hash values in a rolling fashion is done as follows. Let $H$ be the previous hash value. Rotate $H$ once: $H\leftarrow s(H)$ . If $c_{1}$ is the character to be removed, rotate it $k$ times: $s^{k}(h(c_{1}))$ . Then simply set

$H\leftarrow s(H)\oplus s^{k}(h(c_{1}))\oplus h(c_{k+1})$

where $c_{k+1}$ is the new character.

Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first $L-k+1$ bits. That is, take the result $H$ and dismiss any $k-1$ consecutive bits.^[1] In practice, this can be achieved by an integer division $H\rightarrow H\div 2^{k-1}$ .

Computational complexity

All rolling hash functions are linear in the number of characters, but their complexity with respect to the length of the window ( $k$ ) varies. Rabin-Karp rolling hash requires the multiplications of two $k$ -bit numbers, integer multiplication is in $O(k\log k2^{O(\log ^{*}k)})$ .^[4] Hashing ngrams by cyclic polynomials can be done in linear time.^[1]

Software

ngramhashing is a Free software C++ implementation of several rolling hash functions
rollinghashjava is an Apache licensed Java implementation of rolling hash functions

External links

MIT 6.006: Introduction to Algorithms 2011- Lecture Notes - Rolling Hash

Footnotes

↑ ^1.0 ^1.1 ^1.2 Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010. arXiv:0705.4676
↑ Template:Cite web
↑ Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997
↑ M. Fürer, Faster integer multiplication, in: STOC ’07, 2007, pp. 57–66.

[lemirekaser-1] 1.0 ^1.1 ^1.2 Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010. arXiv:0705.4676

[2] Template:Cite web

[3] Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997

[4] M. Fürer, Faster integer multiplication, in: STOC ’07, 2007, pp. 57–66.

[1]

[2]

[3]

[4]

@@ Line 1: / Line 1: @@
-The name of the writer is Numbers. Minnesota has always been his house but his wife desires them to move. Playing baseball is the pastime he will never quit doing. Since she was 18 she's been operating as a meter reader but she's usually needed her personal company.<br><br>Also visit my website: [http://srgame.co.kr/qna/21862 std home test]
+A '''rolling hash''' is a [[hash function]] where the input is hashed in a window that moves through the input.
+A few hash functions allow a rolling hash to be computed very quickly—the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window—similar to the way a [[moving average]] function can be computed much more quickly than other low-pass filters.
+One of the main applications is the [[Rabin-Karp string search algorithm]], which uses the rolling hash described below.
+Another popular application is [[rsync]] program which uses a checksum based on Mark Adler's [[adler-32]] as its rolling hash.
+Another application is the Low Bandwidth Network Filesystem (LBFS), which uses a [[Rabin fingerprint]] as its rolling hash.
+At best, rolling hash values are [[pairwise independent]]<ref name="lemirekaser">Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, Computer Speech & Language 24 (4), pages 698-710, 2010. [[arXiv:0705.4676]]</ref> or strongly [[universal hashing|universal]]. They cannot be [[k-independent hashing|3-wise independent]], for example.
+== Rabin-Karp rolling hash ==
+The [[Rabin-Karp string search algorithm]] is normally used with a very simple rolling hash function that only uses multiplications and additions:
+<math>H = c_1 a^{k-1} + c_2 a^{k-2} + c_3 a^{k-3} + ... + c_k a^{0}</math>
+where <math>a</math> is a constant and <math>c_1, ..., c_k</math> are the input characters.
+In order to avoid manipulating huge <math>H</math> values, all math is done [[modular arithmetic|modulo]] <math>n</math>. The choice of <math>a</math> and <math>n</math> is critical to get good hashing; see [[linear congruential generator]] for more discussion.
+Removing and adding chars simply involves adding or subtracting the first or last term. Shifting all chars by one position to the left requires multiplying the entire sum <math>H</math> by <math>a</math>. Shifting all chars by one position to the right requires dividing the entire sum <math>H</math> by <math>a</math>. Note that in modulo arithmetic, <math>a</math> can be chosen to have a [[multiplicative inverse]] <math>a^{-1}</math> by which <math>H</math> can be multiplied to get the result of the division without actually performing a division.
+== Content based slicing using Rabin-Karp hash ==
+One of the interesting use cases of the rolling hash function is that it can create dynamic, content-based chunks of a stream or file. This is especially useful when it is required to send only the changed chunks of a large file over a network and a simple byte addition at the front of the file would cause all the fixed size windows to become updated, while in reality, only the first ‘chunk’ has been modified.
+The simplest approach to calculate the dynamic chunks is to calculate the rolling hash and if it matches a pattern (like the lower N bits are all zeroes) then it’s a chunk boundary. This approach will ensure that any change in the file will only affect its current and possibly the next chunk, but nothing else.
+When the boundaries are known, the chunks need to be compared by their hash values to detect which one was modified and needs transfer across the network.<ref>{{cite web | first = Adam | last = Horvath | url = http://blog.teamleadnet.com/2012/10/rabin-karp-rolling-hash-dynamic-sized.html | title = Rabin Karp rolling hash - dynamic sized chunks based on hashed content | date = October 24, 2012 }}</ref>
+== Cyclic polynomial ==
+Hashing by cyclic polynomial<ref>Jonathan D. Cohen, [http://www.cparity.com/projects/AcmClassification/samples/256168.pdf Recursive Hashing Functions for n-Grams], ACM Trans. Inf. Syst. 15 (3), 1997</ref>&mdash;sometimes called Buzhash&mdash;is also simple, but it has the benefit of avoiding multiplications, using [[Barrel shifter|barrel shifts]] instead. It is a form of [[tabulation hashing]]: it presumes that there is some hash function <math>h</math> from characters to integers in the interval <math>[0,2^L)</math>. This hash function might be simply an array or a [[hash table]] mapping characters to random integers. Let the function <math>s</math> be a cyclic binary rotation (or [[Barrel shifter|barrel shift]]): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., <math>s(10011)=00111</math>. Let <math>\oplus</math> be the bit-wise [[exclusive or]]. The hash values are defined as
+<math> H = s^{k-1}(h( c_1 )) \oplus s^{k-2}( h( c_2) )  \oplus \ldots \oplus  s( h( c_{k-1}) ) \oplus   h( c_k)</math>
+where the multiplications by powers of two can be implemented by binary shifts. The result is a number in <math>[0,2^L)</math>.
+Computing the hash values in a rolling fashion is done as follows. Let <math>H</math> be the previous hash value. Rotate <math>H</math> once: <math>H\leftarrow s(H)</math>.  If <math>c_1</math> is the character to be removed, rotate it <math>k</math> times:  <math>s^{k}(h( c_1 ))</math>. Then simply set
+<math>H\leftarrow s(H) \oplus s^{k}(h( c_1 )) \oplus h(c_{k+1})</math>
+where <math>c_{k+1}</math> is the new character.
+Hashing by cyclic polynomials is strongly universal or pairwise independent: simply keep the first <math>L-k+1</math> bits. That is, take the result <math>H</math> and dismiss any <math>k-1</math> consecutive bits.<ref name="lemirekaser" /> In practice, this can be achieved by an integer division <math>H \rightarrow H \div 2^{k-1}</math>.
+== Computational complexity ==
+All rolling hash functions are linear in the number of characters, but their complexity with respect to the length of the window (<math>k</math>) varies. Rabin-Karp rolling hash requires the multiplications of two <math>k</math>-bit numbers, [[integer multiplication]] is in <math>O(k \log k 2^{O(\log^*k)})</math>.<ref>M. Fürer, Faster integer multiplication, in: STOC ’07, 2007, pp. 57–66.</ref> Hashing [[ngram]]s by cyclic polynomials can be done in linear time.<ref name="lemirekaser" />
+== Software ==
+*[http://code.google.com/p/ngramhashing/ ngramhashing] is a [[Free software]] C++ implementation of several rolling hash functions
+*[http://code.google.com/p/rollinghashjava/ rollinghashjava] is an Apache licensed Java implementation of rolling hash functions
+==See also==
+*[[MinHash]]
+*[[w-shingling]]
+== External links ==
+*[http://courses.csail.mit.edu/6.006/spring11/rec/rec06.pdf MIT 6.006: Introduction to Algorithms 2011- Lecture Notes - Rolling Hash]
+== Footnotes ==
+<references/>
+[[Category:Hash functions]]

Frenkel defect: Difference between revisions

Revision as of 17:02, 7 November 2013

Contents

Rabin-Karp rolling hash

Content based slicing using Rabin-Karp hash

Cyclic polynomial

Computational complexity

Software

See also

External links

Footnotes

Navigation menu

Frenkel defect: Difference between revisions

Revision as of 17:02, 7 November 2013

Rabin-Karp rolling hash

Content based slicing using Rabin-Karp hash

Cyclic polynomial

Computational complexity

Software

See also

External links

Footnotes

Navigation menu

Search