Möbius plane: Difference between revisions
en>Yobot m →External links: WP:CHECKWIKI error fixes + general fixes using AWB (7914) |
en>Yecril the wikilink was irrelevant |
||
Line 1: | Line 1: | ||
In [[computer science]] and [[information theory]], '''Tunstall coding''' is a form of [[entropy coding]] used for [[lossless data compression]]. | |||
== History == | |||
Tunstall coding was the subject of Brian Parker Tunstall's PhD thesis in 1967, while at Georgia Institute of Technology. The subject of that thesis was "Synthesis of noiseless compression codes" <ref>{{cite book|last=Tunstall, Brian Parker|first=|title=Synthesis of noiseless compression codes|accessdate=2013-01-20|date=12,1967|publisher=[[Georgia Institute of Technology]]}}</ref> | |||
Its design is a precursor to [[Lempel-Ziv]]. | |||
== Properties == | |||
Unlike [[variable-length code]]s, which include [[Huffman coding|Huffman]] and [[Lempel–Ziv|Lempel–Ziv coding]], | |||
Tunstall coding is a [[code]] which maps source symbols to a fixed number of bits.<ref>http://www.rle.mit.edu/rgallager/documents/notes1.pdf, Study of Tunstall's algorithm at [[MIT]]</ref> | |||
Unlike [[Typical set|typical set encoding]], Tunstall coding parses a stochastic source with codewords of variable length. | |||
It can be shown<ref>[http://ipg.epfl.ch/lib/exe/fetch.php?media=en:courses:2013-2014:itc:tunstall.pdf], Study of Tunstall's algorithm from [[EPFL]]'s Information Theory department</ref> | |||
that, for a large enough dictionary, the number of bits per source letter can be infinitely close to <math>H(U)</math>, the [[Entropy (information theory)|entropy]] of the source. | |||
== Algorithm == | |||
The algorithm requires as input an input alphabet <math>\mathcal{U}</math>, along with a distribution of probabilities for each word input. | |||
It also requires an arbitrary constant <math>C</math>, which is an upper bound to the size of the dictionary that it will compute. | |||
The dictionary in question, <math>D</math>, is constructed as a tree of probabilities, in which each edge is associated to a letter from the input alphabet. | |||
The algorithm goes like this: | |||
D := tree of <math>|\mathcal{U}|</math> leaves, one for each letter in <math>\mathcal{U}</math>. | |||
While <math>|D| < C</math>: | |||
Convert most probable leaf to tree with <math>|\mathcal{U}|</math> leaves. | |||
== Example == | |||
Let's imagine that we wish to encode the string "hello, world". | |||
Let's further assume (somewhat unrealistically) that the input alphabet <math>\mathcal{U}</math> | |||
contains only characters from the string "hello, world" — that is, 'h', 'e', 'l', ',', ' ', 'w', 'o', 'r', 'd'. | |||
We can therefore compute the probability of each character based on its statistical appearance in the input string. | |||
For instance, the letter L appears thrice in a string of 12 characters: its probability is <math>3 \over 12</math>. | |||
We initialize the tree, starting with a tree of <math>|\mathcal{U}|=9</math> leaves. Each word is therefore directly associated to a letter of the alphabet. | |||
The 9 words that we thus obtain can be encoded into a fixed-sized output of <math>\lceil \log_2(9) \rceil = 4</math> bits. | |||
[[File:Tunstall-1.png|Tunstall "hello, world" example — one iteration]] | |||
We then take the leaf of highest probability (here, <math>w_1</math>), and convert it to yet another tree of <math>|\mathcal{U}|=9</math> leaves, one for each character. | |||
We re-compute the probabilities of those leaves. For instance, the sequence of two letters L happens once. | |||
Given that there are three occurrences of letters followed by an L, the resulting probability is <math>{1 \over 3} \cdot {3 \over 12} = {1 \over 12}</math>. | |||
We obtain 17 words, which can each be encoded into a fixed-sized output of <math>\lceil \log_2(17) \rceil = 5</math> bits. | |||
[[File:Tunstall-2.png|Tunstall "hello, world" example — two iterations]] | |||
Note that we could iterate further, increasing the number of words by <math>|\mathcal{U}|-1=8</math> every time. | |||
== Limitations == | |||
Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is. | |||
This issue is shared with [[Huffman coding]]. | |||
Its requiring a fixed-length block output makes it lesser than [[Lempel-Ziv]], which has a similar dictionary-based design, but with a variable-sized block output. | |||
== References == | |||
{{reflist}} | |||
{{Compression methods}} | |||
[[Category:Lossless compression algorithms]] |
Revision as of 15:47, 1 February 2014
In computer science and information theory, Tunstall coding is a form of entropy coding used for lossless data compression.
History
Tunstall coding was the subject of Brian Parker Tunstall's PhD thesis in 1967, while at Georgia Institute of Technology. The subject of that thesis was "Synthesis of noiseless compression codes" [1]
Its design is a precursor to Lempel-Ziv.
Properties
Unlike variable-length codes, which include Huffman and Lempel–Ziv coding, Tunstall coding is a code which maps source symbols to a fixed number of bits.[2]
Unlike typical set encoding, Tunstall coding parses a stochastic source with codewords of variable length.
It can be shown[3] that, for a large enough dictionary, the number of bits per source letter can be infinitely close to , the entropy of the source.
Algorithm
The algorithm requires as input an input alphabet , along with a distribution of probabilities for each word input. It also requires an arbitrary constant , which is an upper bound to the size of the dictionary that it will compute. The dictionary in question, , is constructed as a tree of probabilities, in which each edge is associated to a letter from the input alphabet. The algorithm goes like this:
D := tree of leaves, one for each letter in . While : Convert most probable leaf to tree with leaves.
Example
Let's imagine that we wish to encode the string "hello, world". Let's further assume (somewhat unrealistically) that the input alphabet contains only characters from the string "hello, world" — that is, 'h', 'e', 'l', ',', ' ', 'w', 'o', 'r', 'd'. We can therefore compute the probability of each character based on its statistical appearance in the input string. For instance, the letter L appears thrice in a string of 12 characters: its probability is .
We initialize the tree, starting with a tree of leaves. Each word is therefore directly associated to a letter of the alphabet. The 9 words that we thus obtain can be encoded into a fixed-sized output of bits.
We then take the leaf of highest probability (here, ), and convert it to yet another tree of leaves, one for each character. We re-compute the probabilities of those leaves. For instance, the sequence of two letters L happens once. Given that there are three occurrences of letters followed by an L, the resulting probability is .
We obtain 17 words, which can each be encoded into a fixed-sized output of bits.
Note that we could iterate further, increasing the number of words by every time.
Limitations
Tunstall coding requires the algorithm to know, prior to the parsing operation, what the distribution of probabilities for each letter of the alphabet is. This issue is shared with Huffman coding.
Its requiring a fixed-length block output makes it lesser than Lempel-Ziv, which has a similar dictionary-based design, but with a variable-sized block output.
References
43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.
- ↑ 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.
My blog: http://www.primaboinca.com/view_profile.php?userid=5889534 - ↑ http://www.rle.mit.edu/rgallager/documents/notes1.pdf, Study of Tunstall's algorithm at MIT
- ↑ [1], Study of Tunstall's algorithm from EPFL's Information Theory department