Bounded variation: Difference between revisions

Revision as of 21:15, 1 January 2014

The minimum description length (MDL) principle is a formalization of Occam's razor in which the best hypothesis for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978.^[1] It is an important concept in information theory and computational learning theory.^[2]^[3]^[4]

Overview

Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet.

[The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998)^[5]

To select the hypothesis that captures the most regularity in the data, scientists look for the hypothesis with which the best compression can be achieved. In order to do this, a code is fixed to compress the data, most generally with a (Turing-complete) computer language. A program to output the data is written in that language; thus the program effectively represents the data. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is the central idea of Ray Solomonoff's idealized theory of inductive inference.

Inference

However, this mathematical theory does not provide a practical way of reaching an inference. The most important reasons for this are:

Kolmogorov complexity is uncomputable: there exists no algorithm that, when input an arbitrary sequence of data, outputs the shortest program that produces the data.
Kolmogorov complexity depends on what computer language is used. This is an arbitrary choice, but it does influence the complexity up to some constant additive term. For that reason, constant terms tend to be disregarded in Kolmogorov complexity theory. In practice, however, where often only a small amount of data is available, such constants may have a very large influence on the inference results: good results cannot be guaranteed when one is working with limited data.

MDL attempts to remedy these, by:

Restricting the set of allowed codes in such a way that it becomes possible (computable) to find the shortest codelength of the data, relative to the allowed codes, and
Choosing a code that is reasonably efficient, whatever the data at hand. This point is somewhat elusive and much research is still going on in this area.

Rather than "programs", in MDL theory one usually speaks of candidate hypotheses, models or codes. The set of allowed codes is then called the model class. (Some authors refer to the model class as the model.) The code is then selected for which the sum of the description of the code and the description of the data using the code is minimal.

One of the important properties of MDL methods is that they provide a natural safeguard against overfitting, because they implement a tradeoff between the complexity of the hypothesis (model class) and the complexity of the data given the hypothesis.^[3] An illustration is given in the following example.

Example of MDL

A coin is flipped 1,000 times and the numbers of heads and tails are recorded. Consider two model classes:

The first is a code that represents outcomes with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1,000 bits.
The second consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1,000 bits.

For this reason a naive statistical method might choose the second model as a better explanation for the data. However, an MDL approach would construct a single code based on the hypothesis, instead of just using the best one. To do this, it is simplest to use a two-part code in which the element of the model class with the best performance is specified. Then the data is specified using that code. A lot of bits are needed to specify which code to use; thus the total codelength based on the second model class could be larger than 1,000 bits. Therefore the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.

MDL Notation

Central to MDL theory is the one-to-one correspondence between code length functions and probability distributions. (This follows from the Kraft-McMillan inequality.) For any probability distribution $P$ , it is possible to construct a code $C$ such that the length (in bits) of $C(x)$ is equal to $-\log _{2}P(x)$ ; this code minimizes the expected code length. Vice versa, given a code $C$ , one can construct a probability distribution $P$ such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code reduces to searching for a good probability distribution, and vice versa.

Related concepts

MDL is very strongly connected to probability theory and statistics through the correspondence between codes and probability distributions mentioned above. This has led researchers such as David MacKay to view MDL as equivalent to Bayesian inference: code length of the model and code length of model and data together in MDL correspond to prior probability and marginal likelihood respectively in the Bayesian framework.^[6]

While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov normalized maximum likelihood code, which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, Rissanen stresses that we should make no assumptions about the true data generating process Template:Disambiguation needed: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense.^[7]^[8] In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the Kolmogorov structure function.

According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe priors that would lead to poor results. The priors that are acceptable from an MDL point of view also tend to be favored in so-called objective Bayesian analysis; there, however, the motivation is usually different.^[9]

Other Systems

MDL was not the first information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called Minimum Message Length (MML). The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation:

MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's beliefs about the data generating process in the form of a prior distribution. MDL avoids assumptions about the data generating process.
Both methods make use of two-part codes: the first part always represents the information that one is trying to learn, such as the index of a model class (model selection), or parameter values (parameter estimation); the second part is an encoding of the data given the information in the first part. The difference between the methods is that, in the MDL literature, it is advocated that unwanted parameters should be moved to the second part of the code, where they can be represented with the data by using a so-called one-part code, which is often more efficient than a two-part code. In the original description of MML, all parameters are encoded in the first part, so all parameters are learned.

References

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

@@ Line 1: / Line 1: @@
+The '''minimum description length (MDL) principle''' is a formalization of [[Occam's razor]] in which the best [[hypothesis]] for a given set of data is the one that leads to the best [[data compression|compression of the data]]. MDL was introduced by [[Jorma Rissanen]] in 1978.<ref>{{cite doi|10.1016/0005-1098(78)90005-5}}</ref> It is an important concept in [[information theory]] and [[computational learning theory]].<ref name="mdl">{{cite web
+|url=http://www.mdl-research.org/
+|title=Minimum Description Length
+|last=
+|first=
+|date=
+|publisher=[[University of Helsinki]]
+|accessdate=2010-07-03}}</ref><ref name="mit">{{cite news
+|url=http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11155
+|title=the Minimum Description Length principle
+|last=Grünwald
+|first=P.
+|date=June 2007
+|publisher=[[MIT Press]]
+|accessdate=2010-07-03}}</ref><ref name="catalog">{{cite news
+|url=http://mitpress.mit.edu/catalog/item/default.asp?sid=4C100C6F-2255-40FF-A2ED-02FC49FEBE7C&ttype=2&tid=10478
+|title=Advances in Minimum Description Length: Theory and Applications
+|last=Grünwald
+|first=P
+|date=April 2005
+|publisher=[[MIT Press]]
+|accessdate=2010-07-03}}</ref>
+==Overview==
+Any set of data can be represented by a string of [[symbols]] from a finite (say, [[binary numeral system|binary]]) [[alphabet]].
-Contented to meet you! Each of our name is Eusebio on top of that I think it voice overs quite good when  say it. My your own house is now in Vermont and I don't software on changing it. Software developing has been my holiday weekend job for a if. To bake is the only [http://Www.Dict.cc/englisch-deutsch/activity.html activity] my wife doesn't agree to. I'm not good at webdesign but you might would need to check my website: http://prometeu.net<br><br>Here is my blog post :: [http://prometeu.net clash of clans hacker v 1.3]
+<blockquote> [The MDL Principle] is based on the following insight: any regularity in a given set of data can be used to [[data compression|compress the data]], i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998)<ref name="peter">{{cite news
+|url=http://homepages.cwi.nl/~pdg/ftp/mdlintro.pdf
+|title=MDL Tutorial
+|last=Grünwald
+|first=Peter
+|date=
+|publisher=
+|accessdate=2010-07-03}}</ref></blockquote>
+To select the hypothesis that captures the most regularity in the data, scientists look for the hypothesis with which the best compression can be achieved. In order to do this, a [[code]] is fixed to compress the data, most generally with a ([[Turing-complete]]) [[computer]] [[formal language|language]].  A [[computer program|program]] to output the data is written in that language; thus the program effectively represents the data.  The length of the shortest program that outputs the data is called the [[Kolmogorov complexity]] of the data.  This is the central idea of [[Ray Solomonoff|Ray Solomonoff's]] idealized theory of [[inductive inference]].
+===Inference===
+However, this mathematical theory does not provide a practical way of reaching an inference. The most important reasons for this are:
+* Kolmogorov complexity is [[computability theory|uncomputable]]: there exists no algorithm that, when input an arbitrary sequence of data, outputs the shortest program that produces the data.
+* Kolmogorov complexity depends on what [[computer language]] is used. This is an arbitrary choice, but it does influence the complexity up to some constant additive term. For that reason, constant terms tend to be disregarded in Kolmogorov complexity theory. In practice, however, where often only a small amount of data is available, such constants may have a very large influence on the inference results: good results cannot be guaranteed when one is working with limited data.
+MDL attempts to remedy these, by:
+* Restricting the set of allowed codes in such a way that it becomes possible (computable) to find the shortest codelength of the data, relative to the allowed codes, and
+* Choosing a code that is reasonably efficient, whatever the data at hand. This point is somewhat elusive and much research is still going on in this area.
+Rather than "programs", in MDL theory one usually speaks of candidate hypotheses, models or codes. The set of allowed codes is then called the model class. (Some authors refer to the model class as the model.) The code is then selected for which the sum of the description of the code and the description of the data using the code is minimal.
+One of the important properties of MDL methods is that they provide a natural safeguard against [[overfitting]], because they implement a tradeoff between the complexity of the hypothesis (model class) and the complexity of the data given the hypothesis.<ref name="mit"/> An illustration is given in the following example.
+==Example of MDL==
+A coin is flipped 1,000 times and the numbers of heads and tails are recorded. Consider two model classes:
+*The first is a code that represents outcomes with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1,000 bits.
+*The second consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1,000 bits.
+For this reason a naive statistical method might choose the second model as a better explanation for the data. However, an MDL approach would construct a single code based on the hypothesis, instead of just using the best one. To do this, it is simplest to use a two-part code in which the element of the model class with the best performance is specified. Then the data is specified using that code. A lot of bits are needed to specify which code to use; thus the total codelength based on the second model class could be larger than 1,000 bits. Therefore the conclusion when following an MDL approach is inevitably that there is not enough evidence to support the hypothesis of the biased coin, even though the best element of the second model class provides better fit to the data.
+==MDL Notation==
+Central to MDL theory is the [[one-to-one correspondence]] between code length [[function (mathematics)|functions]] and [[probability distribution]]s. (This follows from the [[Kraft-McMillan theorem|Kraft-McMillan inequality]].) For any probability distribution <math>P</math>, it is possible to construct a code <math>C</math> such that the length (in bits) of <math>C(x)</math> is equal to <math>-\log_2 P(x)</math>; this code minimizes the expected code length. Vice versa, given a code <math>C</math>, one can construct a probability distribution <math>P</math> such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code reduces to searching for a good probability distribution, and vice versa.
+== Related concepts ==
+MDL is very strongly connected to [[probability theory]] and [[statistics]] through the correspondence between codes and probability distributions mentioned above. This has led researchers such as David MacKay to view MDL as equivalent to [[Bayesian inference]]: code length of the model and code length of model and data together in MDL correspond to [[prior probability]] and [[marginal likelihood]] respectively in the Bayesian framework.<ref name="mackay">{{cite news
+|url=http://www.inference.phy.cam.ac.uk/mackay/itila/
+|title=Information Theory, Inference, and Learning Algorithms
+|last=MacKay
+|first=David
+|year=2003
+|publisher=[[Cambridge University Press]]
+|accessdate=2010-07-03}}</ref>
+While Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov ''normalized maximum likelihood code'', which plays a central role in current MDL theory, but has no equivalent in Bayesian inference.  Furthermore, Rissanen stresses that we should make no assumptions about the ''true'' [[data generating process]]{{disambiguation needed|date=December 2013}}: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense.<ref name="cwi">{{cite news
+|url=http://www.mdl-research.org/jorma.rissanen/
+|title=Homepage of Jorma Rissanen
+|last=Rissanen
+|first=Jorma
+|date=
+|publisher=
+|accessdate=2010-07-03}}</ref><ref name="springer">{{cite news
+|url=http://www.springer.com/computer/foundations/book/978-0-387-36610-4
+|title=Information and Complexity in Statistical Modeling
+|last=Rissanen
+|first=J.
+|year=2007
+|publisher=Springer
+|accessdate=2010-07-03}}</ref> In the last mentioned reference Rissanen bases the mathematical underpinning of MDL on the [[Kolmogorov structure function]].
+According to the MDL philosophy, Bayesian methods should be dismissed if they are based on unsafe [[Prior probability|priors]] that would lead to poor results. The priors that are acceptable from an MDL point of view also tend to be favored in so-called ''[[objective Bayesian]]'' analysis; there, however, the motivation is usually different.<ref name="volker">{{cite news
+|url=http://volker.nannen.com/pdf/short_introduction_to_model_selection.pdf
+|title=A short introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length.
+|last=Nannen
+|first=Volker
+|date=
+|publisher=
+|accessdate=2010-07-03}}</ref>
+===Other Systems===
+MDL was not the first [[information theory|information-theoretic]] approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called [[Minimum Message Length]] (MML). The difference between MDL and MML is a source of ongoing confusion. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation:
+* MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's beliefs about the data generating process in the form of a prior distribution. MDL avoids assumptions about the data generating process.
+* Both methods make use of two-part codes: the first part always represents the information that one is trying to learn, such as the index of a model class ([[model selection]]), or parameter values ([[parameter estimation]]); the second part is an encoding of the data given the information in the first part. The difference between the methods is that, in the MDL literature, it is advocated that unwanted parameters should be moved to the second part of the code, where they can be represented with the data by using a so-called [[one-part code]], which is often more efficient than a two-part code. In the original description of MML, all parameters are encoded in the first part, so all parameters are learned.
+==References==
+{{Reflist}}
+==Further reading==
+* [http://www.mdl-research.org/ Minimum Description Length on the Web], by the University of Helsinki. Features readings, demonstrations, events and links to MDL researchers.
+* [http://www.mdl-research.org/jorma.rissanen/ Homepage of Jorma Rissanen], containing lecture notes and other recent material on MDL.
+* [http://www.cwi.nl/~pdg/ Homepage of Peter Grünwald], containing his very good tutorial on MDL.
+*J. Rissanen, [http://www.springer.com/computer/foundations/book/978-0-387-36610-4 Information and Complexity in Statistical Modeling], Springer, 2007.
+*[http://mitpress.mit.edu/catalog/item/default.asp?sid=4C100C6F-2255-40FF-A2ED-02FC49FEBE7C&ttype=2&tid=1047 ISBN][http://mitpress.mit.edu/catalog/item/default.asp?sid=4C100C6F-2255-40FF-A2ED-02FC49FEBE7C&ttype=2&tid=10478 0-262-07262-9].
+* [[David MacKay (scientist)|David MacKay]], [http://www.inference.phy.cam.ac.uk/mackay/itila/ Information Theory, Inference, and Learning Algorithms], Cambridge University Press, 2003.
+{{Statistics}}
+{{Least Squares and Regression Analysis}}
+{{DEFAULTSORT:Minimum Description Length}}
+[[Category:Algorithmic information theory]]

Bounded variation: Difference between revisions

Revision as of 21:15, 1 January 2014

Contents

Overview

Inference

Example of MDL

MDL Notation

Related concepts

Other Systems

References

Further reading

Navigation menu

Bounded variation: Difference between revisions

Revision as of 21:15, 1 January 2014

Overview

Inference

Example of MDL

MDL Notation

Related concepts

Other Systems

References

Further reading

Navigation menu

Search