|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| <!-- EDITORS! Please see [[Wikipedia:WikiProject Probability#Standards]] for a discussion of standards used for probability distribution articles such as this one.
| | 1 Hi there, I am Rheba. I've always loved local environment may have Guam and may never continue. One of the leading things the actual planet world to be with her is canoeing but she doesn't provide the time not long ago. She happens to be a stock control and order filler but she plans on changing it's. She is running and maintaining a blog here: http://www.youtube.com/watch?v=mdRNNhSkWbA<br><br>Here is my blog :: [http://www.youtube.com/watch?v=mdRNNhSkWbA want to look younger] |
| -->{{Probability distribution|
| |
| name =Zipf's law|
| |
| type =mass|
| |
| pdf_image =[[Image:Zipf distribution PMF.png|325px|Plot of the Zipf PMF for ''N'' = 10]]<br /><small>Zipf PMF for ''N'' = 10 on a log–log scale. The horizontal axis is the index ''k'' . (Note that the function is only defined at integer values of ''k''. The connecting lines do not indicate continuity.)</small>|
| |
| cdf_image =[[Image:Zipf distribution CMF.png|325px|Plot of the Zipf CDF for N=10]]<br /><small>Zipf CDF for ''N'' = 10. The horizontal axis is the index ''k'' . (Note that the function is only defined at integer values of ''k''. The connecting lines do not indicate continuity.)</small>|
| |
| parameters =<math>s>0\,</math> ([[real number|real]])<br /><math>N \in \{1,2,3\ldots\}</math> ([[integer]])|
| |
| support =<math>k \in \{1,2,\ldots,N\}</math>|
| |
| pdf =<math>\frac{1/k^s}{H_{N,s}}</math>|
| |
| cdf =<math>\frac{H_{k,s}}{H_{N,s}}</math>|
| |
| mean =<math>\frac{H_{N,s-1}}{H_{N,s}}</math>|
| |
| median =|
| |
| mode =<math>1\,</math>|
| |
| variance =|
| |
| skewness =|
| |
| kurtosis =|
| |
| entropy =<math>\frac{s}{H_{N,s}}\sum_{k=1}^N\frac{\ln(k)}{k^s}
| |
| +\ln(H_{N,s})</math>|
| |
| mgf =<math>\frac{1}{H_{N,s}}\sum_{n=1}^N \frac{e^{nt}}{n^s}</math>|
| |
| char =<math>\frac{1}{H_{N,s}}\sum_{n=1}^N \frac{e^{int}}{n^s}</math>|
| |
| }}
| |
| | |
| '''Zipf's law''' {{IPAc-en|ˈ|z|ɪ|f}}, an [[empirical law]] formulated using [[mathematical statistics]], refers to the fact that many types of data studied in the [[physical science|physical]] and [[social science|social]] sciences can be approximated with a Zipfian distribution, one of a family of related discrete [[power law]] [[probability distribution]]s. The law is named after the American [[linguistics|linguist]] [[George Kingsley Zipf]] (1902–1950), who first proposed it (Zipf 1935, 1949), though the French stenographer [[Jean-Baptiste Estoup]] (1868–1950) appears to have noticed the regularity before Zipf.<ref>Christopher D. Manning, Hinrich Schütze ''Foundations of Statistical Natural Language Processing'', MIT Press (1999), ISBN 978-0-262-13360-9, p. 24</ref> It was also noted in 1913 by German physicist [[Felix Auerbach]]<ref name=Auerbach1913>Auerbach F (1913) Das Gesetz der Bevölkerungskonzentration. Petermanns Geogr Mitt 59: 74–76</ref> (1856–1933). | |
| | |
| ==Motivation==
| |
| | |
| Zipf's law states that given some [[Text corpus|corpus]] of [[natural language]] utterances, the frequency of any word is [[inversely proportional]] to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the [[Brown Corpus]] of American English text, the word "[[English articles#Definite article|the]]" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.<ref>{{citation|contribution=An introduction to textual econometrics|first1=Stephen|last1=Fagan|first2=Ramazan|last2=Gençay|pages=133–153|title=Handbook of Empirical Economics and Finance|editor1-first=Aman|editor1-last=Ullah|editor2-first=David E. A.|editor2-last=Giles|publisher=CRC Press|year=2010|isbn=9781420070361}}. [http://books.google.com/books?hl=en&lr=&id=QAUv9R6bJzwC&oi=fnd&pg=PA139 P. 139]: "For example, in the Brown Corpus, consisting of over one million words, half of the word volume consists of repeated uses of only 135 words."</ref>
| |
| | |
| The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913.<ref name="Auerbach1913">Auerbach F. (1913) Das Gesetz der Bevölkerungskonzentration. Petermann’s Geographische Mitteilungen 59, 74–76</ref> Empirically, a data set can be tested to see if Zipf's law applies by running the regression log R = a - b log n where R is the rank of the datum, n is its value and a and b are constants. Zipf's law applies when b = 1. When this regression is applied to cities, a better fit has been found with b = 1.07. While Zipf's law holds for the upper tail of the distribution, the entire distribution of cities is log-normal and follows [[Gibrat's law]].<ref name="Eeckhout2004">Eeckhout J. (2004), Gibrat's law for (All) Cities. American Economic Review 94(5), 1429-1451.</ref> Both laws are consistent because a log-normal tail can typically not be distinguished from a [[Pareto distribution|Pareto]] (Zipf) tail.
| |
| | |
| ==Theoretical review==
| |
| Zipf's law is most easily observed by [[graph of a function|plotting]] the data on a [[log-log]] graph, with the axes being [[logarithm|log]] (rank order) and log (frequency). For example, the word "the" (as described above) would appear at ''x'' = log(1), ''y'' = log(69971). The data conform to Zipf's law to the extent that the plot is [[linear equation|linear]].
| |
| | |
| Formally, let:
| |
| * ''N'' be the number of elements;
| |
| * ''k'' be their rank;
| |
| * ''s'' be the value of the exponent characterizing the distribution.
| |
| Zipf's law then predicts that out of a population of ''N'' elements, the frequency of elements of rank ''k'', ''f''(''k'';''s'',''N''), is:
| |
| | |
| :<math>f(k;s,N)=\frac{1/k^s}{\sum_{n=1}^N (1/n^s)}.</math>
| |
| | |
| Zipf's law holds if the number of occurrences of each element are independent and identically distributed random variables with power law distribution <math>p(f) {{=}}\alpha f^{-1-1/s}.</math><ref>Adamic, Lada A.[http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html "Zipf, Power-laws, and Pareto - a ranking tutorial"]</ref>
| |
| | |
| In the example of the frequency of words in the English language, ''N'' is the number of words in the English language and, if we use the classic version of Zipf's law, the exponent ''s'' is 1. ''f''(''k''; ''s'',''N'') will then be the fraction of the time the ''k''th most common word occurs.
| |
| | |
| The law may also be written:
| |
| | |
| :<math>f(k;s,N)=\frac{1}{k^s H_{N,s}}</math>
| |
| | |
| where ''H<sub>N,s</sub>'' is the ''N''th generalized [[harmonic number]].
| |
| | |
| The simplest case of Zipf's law is a "<sup>1</sup>⁄<sub>''f''</sub> function". Given a set of Zipfian distributed frequencies, sorted from most common to least common, the second most common frequency will occur ½ as often as the first. The third most common frequency will occur ⅓ as often as the first. The ''n''<sup>th</sup> most common frequency will occur <sup>1</sup>⁄<sub>''n''</sub> as often as the first. However, this cannot hold exactly, because items must occur an integer number of times; there cannot be 2.5 occurrences of a word. Nevertheless, over fairly wide ranges, and to a fairly good approximation, many natural phenomena obey Zipf's law.
| |
| | |
| Mathematically, the sum of all relative frequencies in a Zipf distribution is equal to the [[Harmonic series (mathematics)|harmonic series]], and
| |
| | |
| :<math>\sum_{n=1}^\infty \frac{1}{n}=\infty.\!</math>
| |
| | |
| In human languages, word frequencies have a very heavy-tailed distribution, and can therefore be modeled reasonably well by a Zipf distribution with an ''s'' close to 1.
| |
| | |
| As long as the exponent ''s'' exceeds 1, it is possible for such a law to hold with infinitely many words, since if ''s'' > 1 then
| |
| :<math>\zeta (s) = \sum_{n=1}^\infty \frac{1}{n^s}<\infty. \!</math> | |
| where ζ is [[Riemann zeta function|Riemann's zeta function]].
| |
| | |
| ==Statistical explanation==
| |
| It is not known why Zipf's law holds for most languages.<ref>[[Léon Brillouin]], ''La science et la théorie de l'information'', 1959, réédité en 1988, traduction anglaise rééditée en 2004</ref> However, it may be partially explained by the statistical analysis of randomly generated texts. Wentian Li has shown that in a document in which each character has been chosen randomly from a uniform distribution of all letters (plus a space character), the "words" follow the general trend of Zipf's law (appearing approximately linear on log-log plot).<ref>{{cite journal |author=Wentian Li |title=Random Texts Exhibit Zipf's-Law-Like Word Frequency Distribution |url=http://www.nslij-genetics.org/wli/pub/ieee92_pre.pdf |journal=[[IEEE Transactions on Information Theory]] |volume=38 |issue=6 |year=1992 |pages=1842–1845 |doi=10.1109/18.165464}}</ref> [[Vitold Belevitch]] in a paper, ''On the Statistical Laws of Linguistic Distribution'' offered a mathematical derivation. He took a large class of well-behaved [[statistical distribution]]s (not only the [[normal distribution]]) and expressed them in terms of rank. He then expanded each expression into a [[Taylor series]]. In every case Belevitch obtained the remarkable result that a first-order truncation of the series resulted in Zipf's law. Further, a second-order truncation of the Taylor series resulted in [[Zipf–Mandelbrot law|Mandelbrot's law]].<ref>[[Peter G. Neumann|Neumann, Peter G.]] [http://www.csl.sri.com/users/neumann/#12a "Statistical metalinguistics and Zipf/Pareto/Mandelbrot"], ''SRI International Computer Science Laboratory'', accessed and [http://www.webcitation.org/5z2UByabR archived] 29 May 2011.</ref><ref>{{cite journal
| |
| | author = Belevitch V
| |
| | title = On the statistical laws of linguistic distributions
| |
| | journal = Annales de la Société Scientifique de Bruxelles
| |
| | volume = 73
| |
| | series = I
| |
| | date = 18 December 1959
| |
| | pages = 310–326.
| |
| }}</ref>
| |
| | |
| Zipf himself proposed that neither speakers nor hearers using a given language want to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution.<ref>{{cite book
| |
| | author = Zipf GK
| |
| | title = Human Behavior and the Principle of Least Effort
| |
| | location = Cambridge, Massachusetts
| |
| | publisher = Addison-Wesley
| |
| | year =1949
| |
| | page = 1
| |
| }}</ref><ref>{{cite journal
| |
| | author = Ramon Ferrer i Cancho and Ricard V. Sole
| |
| | year=2003
| |
| | title=Least effort and the origins of scaling in human language
| |
| | url=http://www.pnas.org/content/100/3/788.abstract?sid=cc7fae18-87c9-4b67-863a-4195bb47c1d1
| |
| | journal=[[Proceedings of the National Academy of Sciences of the United States of America]]
| |
| | volume=100 | pages=788–791| issue=3
| |
| | doi = 10.1073/pnas.0335980100
| |
| | pmid = 12540826 |pmc=298679}}</ref>
| |
| | |
| ==Related laws==
| |
| | |
| [[Image:Wikipedia-n-zipf.png|thumb|left|320px|A plot of word frequency in Wikipedia (November 27, 2006). The plot is in [[log-log]] coordinates. ''x'' is rank of a word in the frequency table; ''y'' is the total number of the word’s occurrences. Most popular words are "the", "of" and "and", as expected. Zipf's law corresponds to the upper linear portion of the curve, roughly following the green (1/''x'') line.]]
| |
| | |
| ''Zipf's law'' now refers more generally to frequency distributions of "rank data," in which the relative frequency of the ''n''th-ranked item is given by the [[Zeta distribution]], 1/(''n''<sup>''s''</sup>ζ(''s'')), where the parameter ''s'' > 1 indexes the members of this family of [[probability distribution]]s. Indeed, ''Zipf's law'' is sometimes synonymous with "zeta distribution," since probability distributions are sometimes called "laws". This distribution is sometimes called the '''Zipfian''' or '''Yule''' distribution.
| |
| | |
| A generalization of Zipf's law is the [[Zipf–Mandelbrot law]], proposed by [[Benoît Mandelbrot]], whose frequencies are:
| |
| | |
| :<math>f(k;N,q,s)=[\mbox{constant}]/(k+q)^s.\,</math>
| |
| | |
| The "constant" is the reciprocal of the [[Hurwitz zeta function]] evaluated at ''s''.
| |
| | |
| Zipfian distributions can be obtained from [[Pareto distribution]]s by an exchange of variables.<ref name="Galien"/>
| |
| | |
| The Zipf distribution is sometimes called the '''discrete Pareto distribution'''<ref>{{cite book|title=Univariate Discrete Distributions|edition=second|year=1992|author=N. L. Johnson, S. Kotz, and A. W. Kemp|publisher=John Wiley & Sons, Inc.|location=New York|isbn=0-471-54897-9|ref=harv}}, p. 466.</ref> because it is analogous to the continuous [[Pareto distribution]] in the same way that the [[Uniform distribution (discrete)|discrete uniform distribution]] is analogous to the [[Uniform distribution (continuous)|continuous uniform distribution]].
| |
| | |
| The tail frequencies of the [[Yule–Simon distribution]] are approximately
| |
| | |
| :<math>f(k;\rho) \approx [\mbox{constant}]/k^{\rho+1}</math>
| |
| | |
| for any choice of ''ρ'' > 0.
| |
| | |
| In the [[parabolic fractal distribution]], the logarithm of the frequency is a quadratic polynomial of the logarithm of the rank. This can markedly improve the fit over a simple power-law relationship.<ref name="Galien">{{cite web |url=http://home.zonnet.nl/galien8/factor/factor.html |title=Factorial randomness: the Laws of Benford and Zipf with respect to the first digit distribution of the factor sequence from the natural numbers |author=Johan Gerard van der Galien |date=2003-11-08}}</ref> Like fractal dimension, it is possible to calculate Zipf dimension, which is a useful parameter in the analysis of texts.<ref>Ali Eftekhari (2006) Fractal geometry of texts. ''Journal of Quantitative Linguistic'' 13(2-3): 177 – 193.</ref>
| |
| | |
| It has been argued that [[Benford's law]] is a special bounded case of Zipf's law,<ref name="Galien"/> with the connection between these two laws being explained by their both originating from scale invariant functional relations from statistical physics and critical phenomena.<ref>L. Pietronero, E. Tosatti, V. Tosatti, A. Vespignani (2001) Explaining the uneven distribution of numbers in nature: The laws of Benford and Zipf. ''Physica A'' 293: 297 – 304.</ref> The ratios of probabilities in Benford's law are not constant.
| |
| {| class="wikitable" style="text-align: center; font-size:8pt;"
| |
| |-
| |
| !<math>n</math>
| |
| !Benford's law: <math>P(n) = </math><br/><math>\log_{10}(n+1)-\log_{10}(n)</math>
| |
| !<math>\tfrac{\log(P(n)/P(n-1))}{\log(n/(n-1))}</math>
| |
| |-
| |
| | 1
| |
| | 0.30103000
| |
| |
| |
| |-
| |
| | 2
| |
| | 0.17609126
| |
| | -0.7735840
| |
| |-
| |
| | 3
| |
| | 0.12493874
| |
| | -0.8463832
| |
| |-
| |
| | 4
| |
| | 0.09691001
| |
| | -0.8830605
| |
| |-
| |
| | 5
| |
| | 0.07918125
| |
| | -0.9054412
| |
| |-
| |
| | 6
| |
| | 0.06694679
| |
| | -0.9205788
| |
| |-
| |
| | 7
| |
| | 0.05799195
| |
| | -0.9315169
| |
| |-
| |
| | 8
| |
| | 0.05115252
| |
| | -0.9397966
| |
| |-
| |
| | 9
| |
| | 0.04575749
| |
| | -0.9462848
| |
| |}
| |
| Zipf's distribution is also applied to estimate the emergent value of networked systems and also service oriented environments.
| |
| | |
| == See also ==
| |
| {{Div col}}
| |
| * [[Bradford's law]]
| |
| * [[Demographic gravitation]]
| |
| * [[Frequency list]]
| |
| * [[Heaps' law]]
| |
| * [[Hapax legomenon]]
| |
| * [[Lorenz curve]]
| |
| * [[Lotka's law]]
| |
| * [[Pareto distribution]]
| |
| * [[Pareto principle]], a.k.a. the "80-20 rule"
| |
| * [[Principle of least effort]]
| |
| * [[Rank-size distribution]]
| |
| * [[King effect]]
| |
| {{Div col end}}
| |
| | |
| ==References==
| |
| {{reflist}}
| |
| | |
| ==Further reading==
| |
| Primary:
| |
| * [[George K. Zipf]] (1949) ''Human Behavior and the Principle of Least Effort''. Addison-Wesley.
| |
| * George K. Zipf (1935) ''The Psychobiology of Language''. Houghton-Mifflin. (see citations at http://citeseer.ist.psu.edu/context/64879/0 )
| |
| Secondary:
| |
| * Lada Adamic. ''Zipf, Power-laws, and Pareto - a ranking tutorial''. http://www.hpl.hp.com/research/idl/papers/ranking/ranking.html
| |
| * Alexander Gelbukh and Grigori Sidorov (2001) [http://www.gelbukh.com/CV/Publications/2001/CICLing-2001-Zipf.htm "Zipf and Heaps Laws’ Coefficients Depend on Language"]. Proc. [[CICLing]]-2001, ''Conference on Intelligent Text Processing and Computational Linguistics'', February 18–24, 2001, Mexico City. Lecture Notes in Computer Science N 2004, ISSN 0302-9743, ISBN 3-540-41687-0, Springer-Verlag: 332–335.
| |
| * Damián H. Zanette (2006) "[http://xxx.arxiv.org/abs/cs.CL/0406015 Zipf's law and the creation of musical context,]" ''Musicae Scientiae 10'': 3-18.
| |
| * Kali R. (2003) "The city as a giant component: a random graph approach to Zipf's law," ''Applied Economics Letters 10'': 717-720(4)
| |
| *{{cite journal |last= Gabaix|first= Xavier|authorlink= Xavier Gabaix|date=August 1999|title= Zipf's Law for Cities: An Explanation |journal= Quarterly Journal of Economics|volume= 114|issue= 3|pages= 739–67|issn= 0033-5533|url= http://pages.stern.nyu.edu/~xgabaix/papers/zipf.pdf|doi= 10.1162/003355399556133}}
| |
| * Axtell, Robert L; [http://www.sciencemag.org/content/293/5536/1818.short Zipf distribution of US firm sizes], Science, 293, 5536, 1818, 2001, American Association for the Advancement of Science
| |
| | |
| ==External links==
| |
| {{commons category}}
| |
| *{{Cite news | last = Steven | first = Strogatz | authorlink = Steven Strogatz | title = Guest Column: Math and the City | date = 2009-05-29 | url = http://judson.blogs.nytimes.com/2009/05/19/math-and-the-city/ | accessdate = 2009-05-29 | postscript = <!--None--> | work=The New York Times}}—An article on Zipf's law applied to city populations
| |
| *[http://www.theatlantic.com/issues/2002/04/rauch.htm Seeing Around Corners (Artificial societies turn up Zipf's law)]
| |
| *[http://planetmath.org/encyclopedia/ZipfsLaw.html PlanetMath article on Zipf's law]
| |
| *[http://www.hubbertpeak.com/laherrere/fractal.htm Distributions de type "fractal parabolique" dans la Nature (French, with English summary)]
| |
| *[http://www.newscientist.com/article.ns?id=mg18524904.300 An analysis of income distribution]
| |
| *[http://www.lexique.org/listes/liste_mots.txt Zipf List of French words]
| |
| *[http://1.1o1.in/en/webtools/semantic-depth Zipf list for English, French, Spanish, Italian, Swedish, Icelandic, Latin, Portuguese and Finnish from Gutenberg Project and online calculator to rank words in texts]
| |
| *[http://uk.arxiv.org/abs/physics/9901035 Citations and the Zipf–Mandelbrot's law]
| |
| *[http://demonstrations.wolfram.com/ZipfsLawForUSCities/ Zipf's Law for U.S. Cities] by Fiona Maclachlan, [[Wolfram Demonstrations Project]].
| |
| * {{MathWorld |title=Zipf's Law |urlname=ZipfsLaw}}
| |
| *[http://www.geoffkirby.co.uk/ZIPFSLAW.pdf Zipf's Law examples and modelling (1985)]
| |
| *[http://www.nature.com/nature/journal/v474/n7350/full/474164a.html Complex systems: Unzipping Zipf's law (2011)]
| |
| *[http://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/ Benford’s law, Zipf’s law, and the Pareto distribution] by Terence Tao.
| |
| | |
| {{ProbDistributions|discrete-finite}}
| |
| | |
| [[Category:Discrete distributions]]
| |
| [[Category:Computational linguistics]]
| |
| [[Category:Power laws]]
| |
| [[Category:Statistical laws]]
| |
| [[Category:Empirical laws]]
| |
| [[Category:Tails of probability distributions]]
| |
| [[Category:Quantitative linguistics]]
| |
| [[Category:Bibliometrics]]
| |