|
|
| Line 1: |
Line 1: |
| {{for|the signal processing concept|spectral density estimation}}
| | Advertising Manager Deshawn from Chalk River, has hobbies and interests including snooker, how can i get pregnant now and tombstone rubbing. Recently has made a journey to Historic Villages of Korea: Hahoe and Yangdong. |
| {{merge to|Kernel density estimation|date=September 2013}}
| |
| {{Multiple issues|
| |
| {{howto|date=August 2012}}
| |
| {{refimprove|date=August 2012}}
| |
| }}
| |
| | |
| [[File:KernelDensityGaussianAnimated.gif|thumb|350px|Demonstration of density estimation using [[kernel smoothing]]: The true density is mixture of two Gaussians centered around 0 and 3, shown with solid blue curve. In each frame, 100 samples are generated from the distribution, shown in red. Centered on each sample, a Gaussian kernel is drawn in gray. Averaging the Gaussians yields the density estimate shown in the dashed black curve.]]
| |
| | |
| In [[probability]] and [[statistics]],
| |
| '''density estimation''' is the construction of an estimate, based on observed [[data]], of an unobservable underlying [[probability density function]]. The unobservable density function is thought of as the density according to which a large population is distributed; the data are usually thought of as a random sample from that population.
| |
| | |
| A variety of approaches to density estimation are used, including [[Parzen window]]s and a range of [[data clustering]] techniques, including [[vector quantization]]. The most basic form of density estimation is a rescaled [[histogram]].
| |
| | |
| == Example of density estimation ==
| |
| | |
| We will consider records of the incidence of [[diabetes]]. The following is quoted verbatim from the [[data set]] description:
| |
| | |
| :A population of women who were at least 21 years old, of [[Pima people|Pima]] Indian heritage and living near Phoenix, Arizona, was tested for [[diabetes mellitus]] according to [[World Health Organization]] criteria. The data were collected by the US National Institute of Diabetes and Digestive and Kidney Diseases. We used the 532 complete records.<ref>{{cite web|url=http://stat.ethz.ch/R-manual/R-patched/library/MASS/html/Pima.tr.html|title=Diabetes in Pima Indian Women - R documentation}}</ref><ref>{{cite journal|author=Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C. and Johannes, R. S.|year=1988|title=Using the ADAP learning algorithm to forecast the onset of diabetes mellitus|journal=Proceedings of the Symposium on Computer Applications in Medical Care (Washington, 1988)|editor=R. A. Greenes|pages=261–265|place=Los Alamitos, CA|publisher=IEEE Computer Society Press|url=http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2245318/}}</ref>
| |
| | |
| In this example,
| |
| we construct three density estimates for "glu" ([[Blood plasma|plasma]] [[glucose]] concentration),
| |
| one [[Conditional probability|conditional]] on the presence of diabetes,
| |
| the second conditional on the absence of diabetes,
| |
| and the third not conditional on diabetes.
| |
| The conditional density estimates are then used to construct the probability of diabetes conditional on "glu".
| |
| | |
| The "glu" data were obtained from the MASS package<ref>{{cite web|url=http://cran.r-project.org/web/packages/MASS/index.html|title=Support Functions and Datasets for Venables and Ripley's MASS}}</ref> of the [[R programming language]]. Within R, <tt>?Pima.tr</tt> and <tt>?Pima.te</tt> give a fuller account of the data.
| |
| | |
| The [[mean]] of "glu" in the diabetes cases is 143.1 and the standard deviation is 31.26.
| |
| The mean of "glu" in the non-diabetes cases is 110.0 and the standard deviation is 24.29.
| |
| From this we see that, in this data set, diabetes cases are associated with greater levels of "glu".
| |
| This will be made clearer by plots of the estimated density functions.
| |
| | |
| The first figure shows density estimates of ''p''(glu | diabetes=1), ''p''(glu | diabetes=0), and ''p''(glu).
| |
| The density estimates are kernel density estimates using a Gaussian kernel.
| |
| That is,
| |
| a Gaussian density function is placed at each data point,
| |
| and the sum of the density functions is computed over the range of the data.
| |
| | |
| [[File:P glu given diabetes.png|thumb|center|360px|Estimated density of ''p'' (glu | diabetes=1) (red), ''p'' (glu | diabetes=0) (blue), and ''p'' (glu) (black)]]
| |
| | |
| From the density of "glu" conditional on diabetes,
| |
| we can obtain the probability of diabetes conditional on "glu" via [[Bayes' rule]].
| |
| For brevity, "diabetes" is abbreviated "db." in this formula.
| |
| | |
| :<math> p(\mbox{diabetes}=1|\mbox{glu})
| |
| = \frac{p(\mbox{glu}|\mbox{db.}=1)\,p(\mbox{db.}=1)}{p(\mbox{glu}|\mbox{db.}=1)\,p(\mbox{db.}=1) + p(\mbox{glu}|\mbox{db.}=0)\,p(\mbox{db.}=0)}
| |
| </math>
| |
| | |
| The second figure shows the estimated posterior probability ''p''(diabetes=1 | glu).
| |
| From these data,
| |
| it appears that an increased level of "glu" is associated with diabetes.
| |
| | |
| [[File:P diabetes given glu.png|thumb|center|360px|Estimated probability of ''p''(diabetes=1 | glu)]]
| |
| | |
| === Script for example ===
| |
| | |
| The following R commands will create the figures shown above. These commands can be entered at the command prompt by using cut and paste.
| |
| <source lang="rsplus">
| |
| | |
| library (MASS)
| |
| data(Pima.tr)
| |
|
| |
| data(Pima.te)
| |
|
| |
| Pima <- rbind (Pima.tr, Pima.te)
| |
| glu <- Pima[,'glu']
| |
|
| |
| d0 <- Pima[,'type'] == 'No'
| |
| d1 <- Pima[,'type'] == 'Yes'
| |
| base.rate.d1 <- sum(d1)/(sum(d1) + sum(d0))
| |
|
| |
| glu.density <- density (glu)
| |
| glu.d0.density <- density (glu[d0])
| |
| glu.d1.density <- density (glu[d1])
| |
|
| |
| approxfun (glu.d0.density$x, glu.d0.density$y) -> glu.d0.f
| |
| approxfun (glu.d1.density$x, glu.d1.density$y) -> glu.d1.f
| |
|
| |
| p.d.given.glu <- function (glu, base.rate.d1)
| |
| {
| |
| p1 <- glu.d1.f(glu) * base.rate.d1
| |
| p0 <- glu.d0.f(glu) * (1 - base.rate.d1)
| |
| p1/(p0+p1)
| |
| }
| |
|
| |
| x <- 1:250
| |
| y <- p.d.given.glu (x, base.rate.d1)
| |
| plot (x, y, type='l', col='red', xlab='glu', ylab='estimated p(diabetes|glu)')
| |
|
| |
| plot (density(glu[d0]), col='blue', xlab='glu', ylab='estimate p(glu),
| |
| p(glu|diabetes), p(glu|not diabetes)', main=NA)
| |
| lines (density(glu[d1]), col='red')
| |
|
| |
| </source>
| |
| | |
| Note that the above conditional density estimator uses bandwidths that are optimal for unconditional densities. Alternatively, one
| |
| could use the method of Hall, Racine and Li (2004)<ref name=hallracineli/> and the R np package<ref>{{cite web|url=http://cran.r-project.org/web/packages/np/index.html|title=The np package - An R package that provides a variety of nonparametric and semiparametric kernel methods that seamlessly handle a mix of continuous, unordered, and ordered factor data types}}</ref>
| |
| for automatic (data-driven) bandwidth selection that is
| |
| optimal for conditional density estimates; see the np vignette<ref>{{cite web|url=http://cran.r-project.org/web/packages/np/vignettes/np.pdf|title=The np Package|author=Tristen Hayfield and Jeffrey S. Racine}}</ref> for an introduction to the np package. The following R commands use the <tt>npcdens()</tt> function to deliver optimal smoothing. Note that the response "Yes"/"No" is a factor.
| |
| | |
| <source lang="rsplus">
| |
| library(np)
| |
|
| |
| fy.x <- npcdens(type~glu,nmulti=1,data=Pima)
| |
|
| |
| Pima.eval <- data.frame(type=factor("Yes"),
| |
| glu=seq(min(Pima$glu),max(Pima$glu),length=250))
| |
|
| |
| plot (x, y, type='l', lty=2, col='red', xlab='glu',
| |
| ylab='estimated p(diabetes|glu)')
| |
|
| |
| lines(Pima.eval$glu,predict(fy.x,newdata=Pima.eval),col="blue")
| |
|
| |
| legend(0,1,c("Unconditional bandwidth", "Conditional bandwidth"),
| |
| col=c("red","blue"),lty=c(2,1))
| |
| </source>
| |
| | |
| The third figure uses optimal smoothing via the method of Hall, Racine, and Li<ref name=hallracineli>{{cite journal|author=Peter Hall, Jeffrey S. Racine and Qi Li|title=Cross-Validation and the Estimation of Conditional Probability Densities|journal=Journal of The American Statistical Association|volume=99|issue=468|pages=1015–1026|year=2004|url=http://econpapers.repec.org/article/besjnlasa/v_3a99_3ay_3a2004_3ap_3a1015-1026.htm}}</ref> indicating that the unconditional density bandwidth used in the second figure above yields a conditional density estimate that may be somewhat undersmoothed.
| |
| | |
| [[File:Glu opt.png|thumb|center|360px|Estimated probability of ''p'' (diabetes=1 | glu)]]
| |
| | |
| == See also ==
| |
| * [[Kernel density estimation]]
| |
| * [[Mean integrated squared error]]
| |
| * [[Histogram]]
| |
| * [[Multivariate kernel density estimation]]
| |
| * [[Spectral density estimation]]
| |
| * [[Kernel embedding of distributions]]
| |
| | |
| == References ==
| |
| {{reflist}}
| |
| '''Sources'''
| |
| * {{cite book|author=Brian D. Ripley|title=Pattern Recognition and Neural Networks|place=Cambridge|publisher=Cambridge University Press|year=1996|url=http://books.google.de/books/about/Pattern_Recognition_and_Neural_Networks.html?hl=de&id=2SzT2p8vP1oC|isbn=978-0521460866}}
| |
| * [[Trevor Hastie]], [[Robert Tibshirani]], and Jerome Friedman. ''The Elements of Statistical Learning''. New York: Springer, 2001. ISBN 0-387-95284-5. ''(See Chapter 6.)''
| |
| * Qi Li and Jeffrey S. Racine. ''Nonparametric Econometrics: Theory and Practice''. Princeton University Press, 2007, ISBN 0-691-12161-3. ''(See Chapter 1.)''
| |
| * D.W. Scott. ''Multivariate Density Estimation. Theory, Practice and Visualization''. New York: Wiley, 1992.
| |
| * [[Bernard Silverman|B.W. Silverman]]. ''Density Estimation''. London: Chapman and Hall, 1986. ISBN 978-0-412-24620-3
| |
| | |
| ==External links==
| |
| * [http://www.creem.st-and.ac.uk/software.php CREEM: Centre for Research Into Ecological and Environmental Modelling] Downloads for free density estimation software packages [http://www.ruwpa.st-and.ac.uk/distance/ ''Distance 4''] (from Research Unit for Wildlife Population Assessment "RUWPA") and [http://www.ruwpa.st-and.ac.uk/estimating.abundance/ ''WiSP''].
| |
| * [http://www.ics.uci.edu/~mlearn/MLSummary.html UCI Machine Learning Repository Content Summary] ''(See "Pima Indians Diabetes Database" for the original data set of 732 records, and additional notes.)''
| |
| * [http://www.mathworks.com/matlabcentral/fileexchange/authors/27236 Free MATLAB code for one and two dimensional density estimation]
| |
| * [http://libagf.sourceforge.net libAGF] C++ software for [[variable kernel density estimation]].
| |
| | |
| [[Category:Estimation of densities]]
| |
| [[Category:Non-parametric statistics]]
| |