Talk:Minimum description length
The following comment pertains to the previous version of this page, of which only traces remain in the current version. Any new comments are welcome! I'll check this discussion page now and then for the next couple of weeks (september 2004).
see the disccussion Minimum message length which may be relevant to stress the difference between MML and MDL (perhaps the easiest is in the historical sense / pratical issues) - Meduz 12:15, 4 August 2006 (UTC)
I'm sure this means something to somebody, but it needs more context to help out those of us that aren't in whatever field this is (the theory of computation, maybe?).
- what kind of "model" are we talking about?
- what does "in the light of data" mean?
- Who are Wallace and Boulton? Shannon?
- Lots of stuff needs to be wikified (i.e. turned into Wikipedia links)
- Can the second paragraph be clarified? What's the take-home message that it's trying to say?
- What's the "MDL community"?
One more question:
Is it really true that the same priors tend to be favored in "objective Bayesian" analysis as in MDLtheory?
Yes, this is true: The problem with Bayesian probability is that, since the prior is subjective, any probabilities you compute no longer necessarily correspond to frequencies in a repeated experiment, as in classical frequentist probability theory. Objective Bayesians try to get the best of both worlds by using priors that, while not in themselves corresponding to frequencies of any kind, converge as quickly as possible to posterior distributions under which any predictions closely correspond to actual frequencies that you observe in real life experiments. This means that priors should use as little assumptions about the data as possible. Also the prior should be invariant under the functional form of the distribution function, for example: you could parameterize the normal distribution using and , but that shouldn't change which distributions are more likely according to the prior. For these reasons, Jeffreys' prior is often advocated by Objective Bayesians. It turns out that the Bayesian probability using Jeffreys' prior behaves asymptotically the same as the NML probability that is favoured by MDL people (read: Rissanen).
All this is quite involved and I don't think people will be helped if stuff like this is included on the main page. Also, my apologies for the lack of references, I will try to provide a few on request...
I have slightly changed the introductory part of this article: it suggested, incorrectly, that MDL is basically the same thing as inference based on Kolmogorov complexity. While it is inspired on Solomonoff's theory, one of the basic properties of MDL is that it is computable, which it would not be if general computer programs were used to model the data.
This text in the article: "Central to MDL theory is the 1-1 correspondence between code length functions and probability distributions. (The lemma involved is the Kraft-McMillan inequality.)" linked to the function disambiguation page. I altered the link to point to function (mathematics). I'm fairly certain this is correct, but since the article does reference Turing machines, I'm hoping the text does not refer to subroutines by the word "function." Correct me if I am wrong. Volfy 02:07, 23 November 2005 (UTC)
No, I really meant ordinary mathematical functions, so the cross reference is correct.
I made a number of new changes to the page. Most changes are minor. The most important changes are: I included a bit about MDL's built-in protection from overfitting and I cleared up the relation to Bayesian inference and MML. I removed the cross-reference to 'reduction (mathematics)' because the phrase 'reduces to' was not intended to be taken in a technical-mathematical way. I also removed the suggestion to merge the topic with MML: the page already contains a very clear reference and the two subjects really are not the same thing. Although I do feel that the exact differences should be explained in more detail. I'll get one of my colleagues to help with this shortly.
another minor change
I added a little more on the differences between MML and MDL. I also reverted the first sentence to its previous state. Occam's razor is a general principle in the philosophy of science which has nothing to with Bayesian inference so I object to the term 'Bayesian Occam's razor', which suggests that MDL implements a specific kind of Occam's razor which has something to do with Bayes' rule. This is not the case. MDL is proposed directly as an implementation of Occam's razor and developed independently of Bayesian statistics. It is true that there is a very strong connection between MDL and Bayes; both implement Occam's razor for instance, but that doesn't mean that MDL should be defined in terms of Bayesian statistics. I welcome real improvements to the text, which is obviously far from perfect, but this doesn't really help. Finally, as a courtesy to me, if you change the first line, would you please consider motivating your changes on the discussion page?
Coin example, never selecting a biased coin model?
I think this section needs a citation, clarification, or confirmation by an expert (preferably by citation). It looks like original research as it stands, and I tend to think the conclusion is incorrect.
The simplest counter-example is that of a loss-less graphic compression algorithm. These map pixel data into a computably-feasible model space, and using techniques such as run-length encoding often effectively exploit the fact that not all images appear with equal likelihood, and that color bias and spacial consistency often allow a reduction. This amounts to a model selection. Even a random image of 1000 pixels which are selected according to a severe bias (95% black and 5% white) will be compressible to less than 1000 bits, because at those odds run-length encoding works well, as does an "all are black but these" model. With those languages, should you encounter a picture that is 50/50 random, you may need give up a constant number of bits to say "no better compression possible", but making these practical trade-offs is one of the advantages of MDL versus Kolomorov computation: as the article says, in MDL the actual constants matter.