|
|
Line 1: |
Line 1: |
| {{Merge from|Density estimation|date=September 2013}}
| | The author is called Irwin. His family members life in South Dakota but his wife desires them to transfer. What I love performing is playing baseball but I haven't made a dime with it. My day job is a librarian.<br><br>Here is my homepage; home std test ([http://www.sooart.co.kr/?document_srl=387911 visit the up coming post]) |
| {{Merge from|Multivariate kernel density estimation|date=September 2010}}
| |
| [[File:Kernel density.svg|thumb|right|250px|Kernel density estimation of 100 [[normal distribution|normally distributed]] [[random number generator|random numbers]] using different smoothing bandwidths.]]
| |
| In [[statistics]], '''kernel density estimation (KDE)''' is a [[non-parametric statistics|non-parametric]] way to [[density estimation|estimate]] the [[probability density function]] of a [[random variable]]. Kernel density estimation is a fundamental data smoothing problem where inferences about the [[statistical population|population]] are made, based on a finite data [[statistical sample|sample]]. In some fields such as [[signal processing]] and [[econometrics]] it is also termed the ''Parzen–Rosenblatt window'' method, after [[Emanuel Parzen]] and [[Murray Rosenblatt]], who are usually credited with independently creating it in its current form.<ref name="Ros1956">{{cite doi |10.1214/aoms/1177728190}}</ref><ref name="Par1962">{{cite doi |10.1214/aoms/1177704472}}</ref>
| |
| | |
| ==Definition==
| |
| Let (''x''<sub>1</sub>, ''x''<sub>2</sub>, …, ''x<sub>n</sub>'') be an [[iid]] sample drawn from some distribution with an unknown [[probability density function|density]] ''ƒ''. We are interested in estimating the shape of this function ''ƒ''. Its ''kernel density estimator'' is
| |
| : <math>
| |
| \hat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) \quad = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big),
| |
| </math>
| |
| where ''K''(•) is the [[kernel (statistics)|kernel]] — a symmetric but not necessarily positive function that integrates to one — and {{nowrap|''h'' > 0}} is a [[smoothing]] parameter called the ''bandwidth''. A kernel with subscript ''h'' is called the ''scaled kernel'' and defined as {{nowrap|''K<sub>h</sub>''(''x'') {{=}} 1/''h K''(''x/h'')}}. Intuitively one wants to choose ''h'' as small as the data allow, however there is always a trade-off between the bias of the estimator and its variance; more on the choice of bandwidth below.
| |
| | |
| A range of [[kernel (statistics)|kernel function]]s are commonly used: [[Uniform kernel#Kernel functions in common use|uniform, triangular, biweight, triweight, Epanechnikov]], [[normal distribution|normal]], and others. The Epanechnikov kernel is optimal in a minimum variance sense,<ref>{{cite journal |doi=10.1137/1114019 |author=Epanechnikov, V.A. |title=Non-parametric estimation of a multivariate probability density |journal=Theory of Probability and its Applications |volume=14 |pages=153–158 |year=1969}}</ref> though the loss of efficiency is small for the kernels listed previously,<ref name="WJ1995">{{Cite book| author1=Wand, M.P |author2=Jones, M.C. |title=Kernel Smoothing |publisher=Chapman & Hall/CRC |location=London |year=1995 |isbn=0-412-55270-1}}</ref> and due to its convenient mathematical properties, the normal kernel is often used {{nowrap|''K''(''x'') {{=}} ''ϕ''(''x'')}}, where ''ϕ'' is the [[standard normal]] density function.
| |
| | |
| The construction of a kernel density estimate finds interpretations in fields outside of density estimation. For example, in [[thermodynamics]], this is equivalent to the amount of heat generated when [[heat kernel]]s (the fundamental solution to the [[heat equation]]) are placed at each data point locations ''x<sub>i</sub>''. Similar methods are used to construct [[discrete Laplace operator]]s on point clouds for [[manifold learning]].
| |
| | |
| ===An example===
| |
| | |
| Kernel density estimates are closely related to [[histograms]], but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points: ''x''<sub>1</sub> = −2.1, ''x''<sub>2</sub> = −1.3, ''x''<sub>3</sub> = −0.4, ''x''<sub>4</sub> = 1.9, ''x''<sub>5</sub> = 5.1, ''x''<sub>6</sub> = 6.2. For the histogram, first the horizontal axis is divided into sub-intervals or bins which cover the range of the data. In this case, we have 6 bins each of width 2. Whenever a data point falls inside this interval, we place a box of height 1/12. If more than one data point falls inside the same bin, we stack the boxes on top of each other.
| |
| | |
| For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points ''x<sub>i</sub>''. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.<ref>{{cite journal |author=Scott, D. |title=On optimal and data-based histograms |journal=Biometrika |year=1979 |volume=66 |pages=605–610 |doi=10.1093/biomet/66.3.605 |issue=3}}</ref>
| |
| | |
| [[File:Comparison of 1D histogram and KDE.png|thumb|center|500px|alt=Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.|Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.]]
| |
| | |
| ==Bandwidth selection==
| |
| | |
| [[File:Comparison of 1D bandwidth selectors.png|thumb|Kernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Green: KDE with h=2. Black: KDE with h=0.337.]]
| |
| | |
| The bandwidth of the kernel is a [[free parameter]] which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated [[Random number generator|random sample]] from the standard [[normal distribution]] (plotted at the blue spikes in the [[Carpet plot|rug plot]] on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is undersmoothed since it contains too many spurious data artifacts arising from using a bandwidth ''h'' = 0.05 which is too small. The green curve is oversmoothed since using the bandwidth ''h'' = 2 obscures much of the underlying structure. The black curve with a bandwidth of ''h'' = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density.
| |
| | |
| The most common optimality criterion used to select this parameter is the expected ''L''<sub>2</sub> [[risk function]], also termed the [[mean integrated squared error]]
| |
| | |
| : <math>\operatorname{MISE} (h) = E \int (\hat{f}_h(x) - f(x))^2 \, dx.</math>
| |
| | |
| Under weak assumptions on ''ƒ'' and ''K'',<ref name="Ros1956"/><ref name="Par1962"/>
| |
| MISE (''h'') = AMISE(''h'') + ''o(1/(nh) + h<sup>4</sup>)'' where ''o'' is the [[little o notation]].
| |
| The AMISE is the Asymptotic MISE which consists of the two leading terms
| |
| | |
| :<math>\operatorname{AMISE}(h) = \frac{R(K)}{nh} + \frac{1}{4} m_2(K)^2 h^4 R(f'')</math>
| |
| | |
| where <math>R(g) = \int g(x)^2 \, dx</math> for a function ''g'', <math>m_2(K) = \int x^2 K(x) \, dx</math>
| |
| and ''ƒ'''' is the second derivative of ''ƒ''. The minimum of this AMISE is the solution to this differential equation
| |
| | |
| :<math> \frac{\partial}{\partial h} \operatorname{AMISE}(h) = -\frac{R(K)}{nh^2} + m_2(K)^2 h^3 R(f'') = 0 </math>
| |
| | |
| or
| |
| | |
| :<math>h_{\operatorname{AMISE}} = \frac{ R(K)^{1/5}}{m_2(K)^{2/5}R(f'')^{1/5} n^{1/5}}.</math>
| |
| | |
| Neither the AMISE nor the ''h''<sub>AMISE</sub> formulas are able to be used directly since they involve the unknown density function ''ƒ'' or its second derivative ''ƒ'''', so a variety of automatic, data-based methods have been developed for selecting the bandwidth. Many review studies have been carried out to compare their efficacities,<ref>{{cite journal |author1=Park, B.U. |author2=Marron, J.S. |year=1990 |title=Comparison of data-driven bandwidth selectors |journal=Journal of the American Statistical Association |volume=85 |issue=409 |pages=66–72 |jstor=2289526 |doi=10.1080/01621459.1990.10475307}}</ref><ref>{{cite journal |author1=Park, B.U. |author2=Turlach, B.A. |year=1992 |title=Practical performance of several data driven bandwidth selectors (with discussion) |journal=Computational Statistics |volume=7 |pages=251–270}}</ref><ref>{{cite journal|author1=Cao, R. |author2=Cuevas, A. |author3=Manteiga, W. G. |year=1994 |title=A comparative study of several smoothing methods in density estimation |journal=Computational Statistics and Data Analysis |volume=17 |pages=153–176 |doi=10.1016/0167-9473(92)00066-Z|issue=2}}</ref><ref>{{cite journal |doi=10.2307/2291420 |author1=Jones, M.C. |author2=Marron, J.S. |author3=Sheather, S. J. |year=1996 |title=A brief survey of bandwidth selection for density estimation| journal=Journal of the American Statistical Association |volume=91 |issue=433 |pages=401–407 |jstor=2291420}}</ref><ref>{{cite journal |author=Sheather, S.J. |year=1992 |title=The performance of six popular bandwidth selection methods on some real data sets (with discussion) |journal=Computational Statistics |volume=7 |pages=225–250, 271–281}}</ref> with the general consensus that the plug-in selectors<ref name="SJ91">{{cite journal |author1=Sheather, S.J. |author2=Jones, M.C. |year=1991 |title=A reliable data-based bandwidth selection method for kernel density estimation |journal=Journal of the Royal Statistical Society, Series B |volume=53 |issue=3 |pages=683–690 |jstor=2345597}}</ref> and [[cross validation]] selectors<ref>{{cite journal |author=Rudemo, M. |year=1982 |title=Empirical choice of histograms and kernel density estimators |journal=Scandinavian Journal of Statistics |volume=9 |issue=2 |pages=65–78 |jstor=4615859}}</ref><ref>{{cite journal |author=Bowman, A.W. |year=1984 |title=An alternative method of cross-validation for the smoothing of density estimates |journal=Biometrika |volume=71 |pages=353–360 |doi=10.1093/biomet/71.2.353 |issue=2}}</ref><ref>{{cite journal |author1=Hall, P. |author2=Marron, J.S. |author3=Park, B.U. |year=1992 |title=Smoothed cross-validation |journal=Probability Theory and Related Fields |volume=92 |pages=1–20 |doi=10.1007/BF01205233}}</ref> are the most useful over a wide range of data sets.
| |
| | |
| Substituting any bandwidth ''h'' which has the same asymptotic order ''n''<sup>−1/5</sup> as ''h''<sub>AMISE</sub> into the AMISE
| |
| gives that AMISE(''h'') = ''O''(''n''<sup>−4/5</sup>), where ''O'' is the [[big o notation]]. It can be shown that, under weak assumptions, there cannot exist a non-parametric estimator that converges at a faster rate than the kernel estimator.<ref>{{Cite journal|doi=10.1214/aos/1176342997|last=Wahba|first=G.|title=Optimal convergence properties of variable knot, kernel, and orthogonal series methods for density estimation|url=http://projecteuclid.org/euclid.aos/1176342997|journal=[[Annals of Statistics]]|year=1975|volume=3|issue=1|pages=15–29}}</ref> Note that the ''n''<sup>−4/5</sup> rate is slower than the typical ''n''<sup>−1</sup> convergence rate of parametric methods.
| |
| | |
| If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (pointwise estimator), this produces a particularly powerful method termed [[variable kernel density estimation|adaptive or variable bandwidth kernel density estimation]].
| |
| | |
| === Practical estimation of the bandwidth ===
| |
| | |
| If Gaussian basis functions are used to approximate [[univariate]] data, and the underlying density being estimated is Gaussian then it can be shown that the optimal choice for ''h'' is<ref name="SI1998">{{Cite book| last=Silverman |first= B.W. | authorlink = Bernard Silverman |title=Density Estimation for Statistics and Data Analysis |publisher=Chapman & Hall/CRC |location=London |year=1998 |isbn=0-412-24620-1| page=48}}</ref>
| |
| | |
| :<math>h = \left(\frac{4\hat{\sigma}^5}{3n}\right)^{\frac{1}{5}} \approx 1.06 \hat{\sigma} n^{-1/5},</math>
| |
| | |
| where <math>\hat{\sigma}</math> is the standard deviation of the samples.
| |
| This approximation is termed the ''normal distribution approximation'', Gaussian approximation, or ''[[Bernard Silverman|Silverman]]'s rule of thumb''.
| |
| | |
| == Relation to the characteristic function density estimator ==
| |
| Given the sample (''x''<sub>1</sub>, ''x''<sub>2</sub>, …, ''x<sub>n</sub>''), it is natural to estimate the [[characteristic function (probability theory)|characteristic function]] {{nowrap|''φ''(''t'') {{=}} E[''e''<sup>''itX''</sup>]}} as
| |
| : <math>
| |
| \hat\varphi(t) = \frac{1}{n} \sum_{j=1}^n e^{itx_j}
| |
| </math>
| |
| Knowing the characteristic function it is possible to find the corresponding probability density function through the [[inverse Fourier transform]] formula. One difficulty with applying this inversion formula is that it leads to a diverging integral since the estimate <math style="vertical-align:-.3em">\scriptstyle\hat\varphi(t)</math> is unreliable for large ''t''’s. To circumvent this problem, the estimator <math style="vertical-align:-.3em">\scriptstyle\hat\varphi(t)</math> is multiplied by a damping function {{nowrap|''ψ<sub>h</sub>''(''t'') {{=}} ''ψ''(''ht'')}}, which is equal to 1 at the origin, and then falls to 0 at infinity. The “bandwidth parameter” ''h'' controls how fast we try to dampen the function <math style="vertical-align:-.3em">\scriptstyle\hat\varphi(t)</math>. In particular when ''h'' is small, then ''ψ<sub>h</sub>''(''t'') will be approximately one for a large range of ''t''’s, which means that <math style="vertical-align:-.3em">\scriptstyle\hat\varphi(t)</math> remains practically unaltered in the most important region of''t''’s.
| |
| | |
| The most common choice for function ''ψ'' is either the uniform function {{nowrap|''ψ''(''t'') {{=}} '''1'''{−1 ≤ ''t'' ≤ 1}}}, which effectively means truncating the interval of integration in the inversion formula to {{nowrap|[−1/''h'', 1/''h'']}}, or the [[gaussian function]] {{nowrap|''ψ''(''t'') {{=}} ''e''<sup>''−π t''<sup>2</sup></sup>}}. Once the function ''ψ'' has been chosen, the inversion formula may be applied, and the density estimator will be
| |
| : <math>\begin{align}
| |
| \hat{f}(x) &= \frac{1}{2\pi} \int_{-\infty}^{+\infty} \hat\varphi(t)\psi_h(t) e^{-itx}dt
| |
| = \frac{1}{2\pi} \int_{-\infty}^{+\infty} \frac{1}{n} \sum_{j=1}^n e^{it(x_j-x)} \psi(ht) dt \\
| |
| &= \frac{1}{nh} \sum_{j=1}^n \frac{1}{2\pi} \int_{-\infty}^{+\infty} e^{-i(ht)\frac{x-x_j}{h}} \psi(ht) d(ht)
| |
| = \frac{1}{nh} \sum_{j=1}^n K\Big(\frac{x-x_j}{h}\Big),
| |
| \end{align}</math>
| |
| where ''K'' is the inverse Fourier transform of the damping function ''ψ''. Thus the kernel density estimator coincides with the characteristic function density estimator.
| |
| | |
| ==Statistical implementation==
| |
| | |
| A non-exhaustive list of software implementations of kernel density estimators includes:
| |
| | |
| * In [[Analytica (software)|Analytica]] release 4.4, the ''Smoothing'' option for PDF results uses KDE, and from expressions it is available via the built-in <code>Pdf</code> function.
| |
| * In [[C (programming language)|C]]/[[C++]], [http://www.umiacs.umd.edu/~morariu/figtree/ FIGTree] is a library that can be used to compute kernel density estimates using normal kernels. MATLAB interface available.
| |
| * In [[C++]], [http://libagf.sf.net libagf] is a library for [[variable kernel density estimation]].
| |
| * In [[CrimeStat]], kernel density estimation is implemented using five different kernel functions – normal, uniform, quartic, negative exponential, and triangular. Both single- and dual-kernel density estimate routines are available. Kernel density estimation is also used in interpolating a Head Bang routine, in estimating a two-dimensional Journey-to-crime density function, and in estimating a three-dimensional Bayesian Journey-to-crime estimate.
| |
| * In [[ESRI]] products, kernel density mapping is managed out of the Spatial Analyst toolbox and uses the Epanechnikov kernel.
| |
| * In [[Microsoft Excel|Excel]], the Royal Society of Chemistry has created an add-in to run kernel density estimation based on their [http://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/Software/kerneldensities.asp Analytical Methods Committee Technical Brief 4].
| |
| * In [[gnuplot]], kernel density estimation is implemented by the <code>smooth kdensity</code> option, the datafile can contain a weight and bandwidth for each point, or the bandwidth can be set automatically<ref>{{cite book |last=Janert |first=Philipp K |title=Gnuplot in action : understanding data with graphs |year=2009 |publisher=Manning Publications |location=Connecticut, USA |isbn=978-1-933988-39-9 }} See section 13.2.2 entitled ''Kernel density estimates''.</ref> according to "Silverman's rule of thumb" (see above).
| |
| * In [[Haskell (programming language)|Haskell]], kernel density is implemented in the [http://hackage.haskell.org/package/statistics statistics] package.
| |
| * In [[Java (programming language)|Java]], the [[Weka (machine learning)]] package provides [http://weka.sourceforge.net/doc.stable/weka/estimators/KernelEstimator.html weka.estimators.KernelEstimator], among others.
| |
| * In [[JavaScript]], the visualization package [[D3js|D3.js]] offers a KDE package in its science.stats package.
| |
| * In [[JMP (statistical software)|JMP]], The Fit Y by X platform can be used to estimate univariate and bivariate kernel densitities.
| |
| * In [[MATLAB]], kernel density estimation is implemented through the <code>ksdensity</code> function (Statistics Toolbox). This function does not provide an automatic data-driven bandwidth but uses a [[rule of thumb]], which is optimal only when the target density is normal. A free MATLAB software package which implements an automatic bandwidth selection method<ref name="bo10">{{Cite journal |author1=Botev, Z.I. |author2=Grotowski, J.F. |author3=Kroese, D.P. |title=Kernel density estimation via diffusion |journal=[[Annals of Statistics]] |volume= 38 |issue=5 |pages=2916–2957 |year=2010 |doi=10.1214/10-AOS799}}</ref> is available from the MATLAB Central File Exchange for [http://www.mathworks.com/matlabcentral/fileexchange/14034 1 dimensional data] and for [http://www.mathworks.com/matlabcentral/fileexchange/17204 2 dimensional data].
| |
| * In [[Mathematica]], numeric kernel density estimation is implemented by the function <code>SmoothKernelDistribution</code> [http://reference.wolfram.com/mathematica/ref/SmoothKernelDistribution.html here] and symbolic estimation is implemented using the function <code>KernelMixtureDistribution</code> [http://reference.wolfram.com/mathematica/ref/KernelMixtureDistribution.html here] both of which provide data-driven bandwidths.
| |
| * In [[Minitab]], the Royal Society of Chemistry has created a macro to run kernel density estimation based on their [http://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/Software/kerneldensities.asp Analytical Methods Committee Technical Brief 4].
| |
| * In the [[NAG Numerical Library|NAG Library]], kernel density estimation is implemented via the <code>g10ba</code> routine (available in both the Fortran<ref>{{cite web |last=The Numerical Algorithms Group |first=|title=NAG Library Routine Document: nagf_smooth_kerndens_gauss (g10baf) |date=|work=NAG Library Manual, Mark 23 |url=http://www.nag.co.uk/numeric/fl/nagdoc_fl23/pdf/G10/g10baf.pdf |accessdate=2012-02-16 }}</ref> and the C<ref>{{cite web |last=The Numerical Algorithms Group |first=|title=NAG Library Routine Document: nag_kernel_density_estim (g10bac) |date=|work=NAG Library Manual, Mark 9 |url=http://www.nag.co.uk/numeric/CL/nagdoc_cl09/pdf/G10/g10bac.pdf |accessdate=2012-02-16 }}</ref> versions of the Library).
| |
| * In [[GNU Octave|Octave]], kernel density estimation is implemented by the <code>kernel_density</code> option (econometrics package).
| |
| * In [[Perl]], an implementation can be found in the [http://search.cpan.org/~janert/Statistics-KernelEstimation-0.05 Statistics-KernelEstimation module]
| |
| * In [[Python (programming language)|Python]], there is an implementation in the stats scipy package: [http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html#scipy.stats.gaussian_kde Scipy Stats Package]
| |
| * In [[R (programming language)|R]], it is implemented through the <code>density</code> and the <code>bkde</code> function in the [http://cran.r-project.org/web/packages/KernSmooth/index.html KernSmooth library] (both included in the base distribution), the <code>kde</code> function in the [http://cran.r-project.org/web/packages/ks/index.html ks library], the <code>dkden</code> and <code>dbckden</code> functions in the [http://cran.r-project.org/web/packages/evmix/index.html evmix library] (latter for boundary corrected kernel density estimation for bounded support), the <code>npudens</code> function in the [http://cran.r-project.org/web/packages/np/index.html np library] (numeric and categorical data), the <code>sm.density</code> function in the [http://cran.r-project.org/web/packages/sm/index.html sm library]. For an implementation of the <code>kde.R</code> function, which does not require installing any packages or libraries, see [http://www-etud.iro.umontreal.ca/~botev/kde.R kde.R].
| |
| * In [[SAS (software)|SAS]], <code> proc kde </code> can be used to estimate univariate and bivariate kernel densities.
| |
| * In [[SciPy]], <code>scipy.stats.gaussian_kde</code> can be used to perform gaussian kernel density estimation in arbitrary dimensions, including bandwidth estimation.
| |
| * In [[Stata]], it is implemented through <code>kdensity</code>; for example <code>histogram x, kdensity</code>. Alternatively a free Stata module KDENS is available from [http://ideas.repec.org/c/boc/bocode/s456410.html here] allowing a user to estimate 1D or 2D density functions.
| |
| | |
| ===Example in MATLAB-Octave===
| |
| [[File:1D kernel density estimate.svg|thumb|alt=Kernel density estimate of synthetic data|Kernel density estimate of synthetic data.]]
| |
| | |
| For this example, the data are a synthetic sample of 50 points drawn from the standard normal and 50 points from a normal distribution with mean 3.5 and variance 1. The automatic bandwidth selection and density estimation with normal kernels is carried out by [http://www.mathworks.com/matlabcentral/fileexchange/14034 kde.m]. This function implements a novel automatic bandwidth selector that does not rely on the commonly used Gaussian plug-in [[rule of thumb]] heuristic.<ref name="bo10"/>
| |
| | |
| <source lang="matlab" style="overflow:auto;">
| |
| randn('seed',8192);
| |
| x = [randn(50,1); randn(50,1)+3.5];
| |
| [h, fhat, xgrid] = kde(x, 401);
| |
| figure;
| |
| hold on;
| |
| plot(xgrid, fhat, 'linewidth', 2, 'color', 'black');
| |
| plot(x, zeros(100,1), 'b+');
| |
| xlabel('x')
| |
| ylabel('Density function')
| |
| hold off;
| |
| </source>
| |
| | |
| ===Example in R===
| |
| | |
| [[File:Old Faithful Geyser (waiting time) KDE with plugin bandwidth.png|thumb|alt=Kernel density estimate of waiting times of the Old Faithful Geyser|Kernel density estimate of waiting times of the Old Faithful Geyser.]]
| |
| | |
| This example is based on the [[Old Faithful Geyser]], a tourist attraction located in Yellowstone National Park. This famous dataset containing 272 records consists of two variables, eruption duration, and waiting time until next eruption, both in minutes, included in the base distribution of R. We analyse the waiting times, using the ks library since it has a wide range of visualisation options. The bandwidth function is <code>hpi</code> which in turn calls the <code>dpik</code> function in the <code>KernSmooth</code> library: these functions implement the plug-in selector.<ref name="SJ91"/> The kernel density estimate using the normal kernel is computed using <code>kde</code> which calls <code>bkde</code> from <code>KernSmooth</code>. The <code>plot</code> function allows the addition of the data points as a rug plot on the horizontal axis. The bimodal structure in the density estimate of the waiting times is clearly seen, in contrast to the rug plot where this structure is not apparent.
| |
|
| |
| <source lang="rsplus" style="overflow:auto;">
| |
| library(KernSmooth)
| |
| attach(faithful)
| |
| fhat <- bkde(x=waiting)
| |
| plot (fhat, xlab="x", ylab="Density function")
| |
| </source>
| |
| | |
| ===Example in Python===
| |
| | |
| [[File:Plot of kernel density estimate using SciPy.png|thumb|alt=Using <code>gaussian_kde</code> in the SciPy package for Python to generate a kernel density estimate of data sampled from a mixture of normals |Kernel density estimate of a mixture of normals.]]
| |
| | |
| To demonstrate how kernel density estimation is performed in Python, we simulate some data from a mixture of normals, where 50 observations are generated from a normal distribution with mean zero and standard deviation 3 and another 50 from a normal with mean 4 and standard deviation 1.
| |
| | |
| <source lang="python" style="overflow:auto;">
| |
| import numpy as np
| |
| | |
| x1 = np.random.normal(0, 3, 50)
| |
| x2 = np.random.normal(4, 1, 50)
| |
| x = np.r_[x1, x2]
| |
| </source>
| |
| | |
| The <code>gaussian_kde</code> function from the SciPy package implements a kernel-density estimate using Gaussian kernels, and includes automatic determination of bandwidth. By default, <code>gaussian_kde</code> uses Scott's rule to select the appropriate bandwidth.<ref>[http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html scipy.stats.gaussian_kde], SciPy.org</ref>
| |
| | |
| <source lang="python" style="overflow:auto;">
| |
| from scipy.stats import kde
| |
| import matplotlib.pyplot as plt
| |
| | |
| density = kde.gaussian_kde(x)
| |
| xgrid = np.linspace(x.min(), x.max(), 100)
| |
| plt.hist(x, bins=8, normed=True)
| |
| plt.plot(xgrid, density(xgrid), 'r-')
| |
| plt.show()
| |
| </source>
| |
| | |
| The plot shows both a histogram of the simulated data, along with a red line that shows the Gaussian KDE.
| |
| | |
| ==See also==
| |
| {{Commons category|Kernel density estimation}}
| |
| *[[Kernel (statistics)]]
| |
| *[[Kernel smoothing]]
| |
| *[[Kernel regression]]
| |
| *[[Density estimation]] (with presentation of other examples)
| |
| *[[Mean-shift]]
| |
| *[[Scale space]] The triplets {(''x'', ''h'', KDE with bandwidth ''h'' evaluated at ''x'': all ''x'', ''h'' > 0} form a [[scale space]] representation of the data.
| |
| *[[Multivariate kernel density estimation]]
| |
| *[[Variable kernel density estimation]]
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| ==External links==
| |
| * [http://www.mvstat.net/tduong/research/seminars/seminar-2001-05 Introduction to kernel density estimation] A short tutorial which motivates kernel density estimators as an improvement over histograms.
| |
| * [http://2000.jukuin.keio.ac.jp/shimazaki/res/kernel.html Kernel Bandwidth Optimization] A free online tool that instantly generates an optimized kernel density estimate of your data.
| |
| * [http://www.wessa.net/rwasp_density.wasp Free Online Software (Calculator)] computes the Kernel Density Estimation for any data series according to the following Kernels: Gaussian, Epanechnikov, Rectangular, Triangular, Biweight, Cosine, and Optcosine.
| |
| * [http://pcarvalho.com/things/kerneldensityestimation/index.html Kernel Density Estimation Applet] An online interactive example of kernel density estimation. Requires .NET 3.0 or later.
| |
| | |
| {{DEFAULTSORT:Kernel density estimation}}
| |
| [[Category:Estimation of densities]]
| |
| [[Category:Non-parametric statistics]]
| |