|
|
(201 intermediate revisions by more than 100 users not shown) |
Line 1: |
Line 1: |
| {{context|date=May 2012}}
| | This is a preview for the new '''MathML rendering mode''' (with SVG fallback), which is availble in production for registered users. |
| In [[statistics]] and [[machine learning]], a '''Bayesian interpretation of regularization''' for [[kernel methods]] is often useful. Kernel methods are central to both the [[Regularization (mathematics)|regularization]] and the [[Bayesian probability|Bayesian]] point of view in machine learning. In regularization they are a natural choice for the [[Statistical learning theory#Formal Description|hypothesis space]] and the regularization functional through the notion of [[reproducing kernel Hilbert space]]s. In Bayesian probability they are a key component of [[Gaussian processes]], where the kernel function is known as the covariance function. Kernel methods have traditionally been used in [[supervised learning]] problems where the ''input space'' is usually a ''space of vectors'' while the ''output space'' is a ''space of scalars''. More recently these methods have been extended to problems that deal with [[Kernel methods for vector output|multiple outputs]] such as in [[multi-task learning]].<ref name=AlvRosLaw11>{{cite journal|last=Álvarez|first=Mauricio A.|author2=Rosasco, Lorenzo |author3=Lawrence, Neil D. |title=Kernels for Vector-Valued Functions: A Review|journal=ArXiv e-prints|date=June 2011}}</ref>
| |
|
| |
|
| In this article we analyze the connections between the regularization and the Bayesian point of view for kernel methods in the case of scalar outputs. A mathematical equivalence between the regularization and the Bayesian point of view is easily proved in cases where the reproducing kernel Hilbert space is ''finite-dimensional''. The infinite-dimensional case raises subtle mathematical issues; we will consider here the finite-dimensional case. We start with a brief review of the main ideas underlying kernel methods for scalar learning, and briefly introduce the concepts of regularization and Gaussian processes. We then show how both points of view arrive at essentially equivalent estimators, and show the connection that ties them together.
| | If you would like use the '''MathML''' rendering mode, you need a wikipedia user account that can be registered here [[https://en.wikipedia.org/wiki/Special:UserLogin/signup]] |
| | * Only registered users will be able to execute this rendering mode. |
| | * Note: you need not enter a email address (nor any other private information). Please do not use a password that you use elsewhere. |
|
| |
|
| ==The Supervised Learning Problem==
| | Registered users will be able to choose between the following three rendering modes: |
|
| |
|
| The classical [[supervised learning]] problem requires estimating the output for some new input point <math>\mathbf{x}'</math> by learning a scalar-valued estimator <math>\hat{f}(\mathbf{x}')</math> on the basis of a training set <math>S</math> consisting of <math>n</math> input-output pairs, <math>S = (\mathbf{X},\mathbf{Y}) = (\mathbf{x}_1,y_1),\ldots,(\mathbf{x}_n,y_n)</math>.<ref name=Vap98>{{cite book|last=Vapnik|first=Vladimir|title=Statistical learning theory|year=1998|publisher=Wiley|isbn=9780471030034|url=http://books.google.com/books?id=GowoAQAAMAAJ&q=statistical+learning+theory&dq=statistical+learning+theory&hl=en&sa=X&ei=HruyT66kOoKhgwf3reSXCQ&ved=0CDsQ6AEwAA}}</ref> Given a symmetric and positive bivariate function <math>k(\cdot,\cdot)</math> called a ''kernel'', one of the most popular estimators in machine learning is given by
| | '''MathML''' |
| | :<math forcemathmode="mathml">E=mc^2</math> |
|
| |
|
| {{NumBlk|:|<math>
| | <!--'''PNG''' (currently default in production) |
| \hat{f}(\mathbf{x}') = \mathbf{k}^\top(\mathbf{K} + \lambda n \mathbf{I})^{-1} \mathbf{Y},
| | :<math forcemathmode="png">E=mc^2</math> |
| </math>|{{EquationRef|1}}}} | |
|
| |
|
| where <math>\mathbf{K} \equiv k(\mathbf{X},\mathbf{X})</math> is the [[Gramian matrix|kernel matrix]] with entries <math>\mathbf{K}_{ij} = k(\mathbf{x}_i,\mathbf{x}_j)</math>, <math> \mathbf{k} = [k(\mathbf{x}_1,\mathbf{x}'),\ldots,k(\mathbf{x}_n,\mathbf{x}')]^\top</math>, and <math>\mathbf{Y} = [y_1,\ldots,y_n]^\top</math>. We will see how this estimator can be derived both from a regularization and a Bayesian perspective.
| | '''source''' |
| | :<math forcemathmode="source">E=mc^2</math> --> |
|
| |
|
| ==A Regularization Perspective== | | <span style="color: red">Follow this [https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-rendering link] to change your Math rendering settings.</span> You can also add a [https://en.wikipedia.org/wiki/Special:Preferences#mw-prefsection-rendering-skin Custom CSS] to force the MathML/SVG rendering or select different font families. See [https://www.mediawiki.org/wiki/Extension:Math#CSS_for_the_MathML_with_SVG_fallback_mode these examples]. |
|
| |
|
| The main assumption in the regularization perspective is that the set of functions <math>\mathcal{F}</math> is assumed to belong to a reproducing kernel Hilbert space <math>\mathcal{H}_k</math>.<ref name=Vap98 /><ref name=Wah90 /><ref name=SchSmo02>{{cite book|last=Schölkopf|first=Bernhard|title=Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond|year=2002|publisher=MIT Press|isbn=9780262194754|author2=Smola, Alexander J.}}</ref><ref name=GirPog90>{{cite journal|last=Girosi|first=F.|author2=Poggio, T.|title=Networks and the best approximation property|journal=Biological Cybernetics|year=1990|volume=63|issue=3|pages=169–176|publisher=Springer|doi=10.1007/bf00195855}}</ref>
| | ==Demos== |
|
| |
|
| ===Reproducing Kernel Hilbert Space=== | | Here are some [https://commons.wikimedia.org/w/index.php?title=Special:ListFiles/Frederic.wang demos]: |
|
| |
|
| A [[reproducing kernel Hilbert space]] (RKHS) <math>\mathcal{H}_k</math> is a [[Hilbert space]] of functions defined by a [[Symmetry in mathematics|symmetric]], [[positive-definite function]] <math>k : \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R}</math> called the ''reproducing kernel'' such that the function <math>k(\mathbf{x},\cdot)</math> belongs to <math>\mathcal{H}_k</math> for all <math>\mathbf{x} \in \mathcal{X}</math>.<ref name=Aro50>{{cite journal|last=Aronszajn|first=N|title=Theory of Reproducing Kernels|journal=Transactions of the American Mathematical Society|date=May 1950|volume=68|issue=3|pages=337–404|doi=10.2307/1990404}}</ref><ref name=Sch64>{{cite journal|last=Schwartz|first=Laurent|title=Sous-espaces hilbertiens d’espaces vectoriels topologiques et noyaux associés (noyaux reproduisants)|journal=Journal d'analyse mathématique|year=1964|volume=13|issue=1|pages=115–256|publisher=Springer|doi=10.1007/bf02786620}}</ref><ref name=CucSma01>{{cite journal|last=Cucker|first=Felipe|author2=Smale, Steve|title=On the mathematical foundations of learning|journal=Bulletin of the American Mathematical Society|date=October 5, 2001|volume=39|issue=1|pages=1–49|doi=10.1090/s0273-0979-01-00923-5}}</ref> There are three main properties make an RKHS appealing:
| |
|
| |
|
| 1. The ''reproducing property'', which gives name to the space, | | * accessibility: |
| | ** Safari + VoiceOver: [https://commons.wikimedia.org/wiki/File:VoiceOver-Mac-Safari.ogv video only], [[File:Voiceover-mathml-example-1.wav|thumb|Voiceover-mathml-example-1]], [[File:Voiceover-mathml-example-2.wav|thumb|Voiceover-mathml-example-2]], [[File:Voiceover-mathml-example-3.wav|thumb|Voiceover-mathml-example-3]], [[File:Voiceover-mathml-example-4.wav|thumb|Voiceover-mathml-example-4]], [[File:Voiceover-mathml-example-5.wav|thumb|Voiceover-mathml-example-5]], [[File:Voiceover-mathml-example-6.wav|thumb|Voiceover-mathml-example-6]], [[File:Voiceover-mathml-example-7.wav|thumb|Voiceover-mathml-example-7]] |
| | ** [https://commons.wikimedia.org/wiki/File:MathPlayer-Audio-Windows7-InternetExplorer.ogg Internet Explorer + MathPlayer (audio)] |
| | ** [https://commons.wikimedia.org/wiki/File:MathPlayer-SynchronizedHighlighting-WIndows7-InternetExplorer.png Internet Explorer + MathPlayer (synchronized highlighting)] |
| | ** [https://commons.wikimedia.org/wiki/File:MathPlayer-Braille-Windows7-InternetExplorer.png Internet Explorer + MathPlayer (braille)] |
| | ** NVDA+MathPlayer: [[File:Nvda-mathml-example-1.wav|thumb|Nvda-mathml-example-1]], [[File:Nvda-mathml-example-2.wav|thumb|Nvda-mathml-example-2]], [[File:Nvda-mathml-example-3.wav|thumb|Nvda-mathml-example-3]], [[File:Nvda-mathml-example-4.wav|thumb|Nvda-mathml-example-4]], [[File:Nvda-mathml-example-5.wav|thumb|Nvda-mathml-example-5]], [[File:Nvda-mathml-example-6.wav|thumb|Nvda-mathml-example-6]], [[File:Nvda-mathml-example-7.wav|thumb|Nvda-mathml-example-7]]. |
| | ** Orca: There is ongoing work, but no support at all at the moment [[File:Orca-mathml-example-1.wav|thumb|Orca-mathml-example-1]], [[File:Orca-mathml-example-2.wav|thumb|Orca-mathml-example-2]], [[File:Orca-mathml-example-3.wav|thumb|Orca-mathml-example-3]], [[File:Orca-mathml-example-4.wav|thumb|Orca-mathml-example-4]], [[File:Orca-mathml-example-5.wav|thumb|Orca-mathml-example-5]], [[File:Orca-mathml-example-6.wav|thumb|Orca-mathml-example-6]], [[File:Orca-mathml-example-7.wav|thumb|Orca-mathml-example-7]]. |
| | ** From our testing, ChromeVox and JAWS are not able to read the formulas generated by the MathML mode. |
|
| |
|
| <math>
| | ==Test pages == |
| f(\mathbf{x}) = \langle f,k(\mathbf{x},\cdot) \rangle_k, \quad \forall \ f \in \mathcal{H}_k,
| |
| </math>
| |
|
| |
|
| where <math>\langle \cdot,\cdot \rangle_k</math> is the inner product in <math>\mathcal{H}_k</math>.
| | To test the '''MathML''', '''PNG''', and '''source''' rendering modes, please go to one of the following test pages: |
| | *[[Displaystyle]] |
| | *[[MathAxisAlignment]] |
| | *[[Styling]] |
| | *[[Linebreaking]] |
| | *[[Unique Ids]] |
| | *[[Help:Formula]] |
|
| |
|
| 2. Functions in an RKHS are in the closure of the linear combination of the kernel at given points,
| | *[[Inputtypes|Inputtypes (private Wikis only)]] |
| | | *[[Url2Image|Url2Image (private Wikis only)]] |
| <math>
| | ==Bug reporting== |
| f(\mathbf{x}) = \sum_i k(\mathbf{x}_i,\mathbf{x})c_i
| | If you find any bugs, please report them at [https://bugzilla.wikimedia.org/enter_bug.cgi?product=MediaWiki%20extensions&component=Math&version=master&short_desc=Math-preview%20rendering%20problem Bugzilla], or write an email to math_bugs (at) ckurs (dot) de . |
| </math>.
| |
| | |
| This allows the construction in a unified framework of both linear and generalized linear models.
| |
| | |
| 3. The norm in an RKHS can be written as
| |
| | |
| <math>\|f\|_k = \sum_{i,j} k(\mathbf{x}_i,\mathbf{x}_j) c_i c_j
| |
| </math>
| |
| | |
| and is a natural measure of how ''complex'' the function is.
| |
| | |
| ===The Regularized Functional===
| |
| | |
| The estimator is derived as the minimizer of the regularized functional
| |
| | |
| {{NumBlk|:|<math>
| |
| \frac{1}{n} \sum_{i=1}^{n}(f(\mathbf{x}_i)-y_i)^2 + \lambda \|f\|_k^2,
| |
| </math>|{{EquationRef|2}}}}
| |
| | |
| where <math>f \in \mathcal{H}_k</math> and <math>\|\cdot\|_k</math> is the norm in <math>\mathcal{H}_k</math>. The first term in this functional, which measures the average of the squares of the errors between the <math>f(\mathbf{x}_i)</math> and the <math>y_i</math>, is called the ''empirical risk'' and represents the cost we pay by predicting <math>f(\mathbf{x}_i)</math> for the true value <math>y_i</math>. The second term in the functional is the squared norm in a RKHS multiplied by a weight <math>\lambda</math> and serves the purpose of stabilizing the problem<ref name=Wah90 /><ref name=GirPog90 /> as well as of adding a trade-off between fitting and complexity of the estimator.<ref name=Vap98 /> The weight <math>\lambda</math>, called the ''regularizer'', determines the degree to which instability and complexity of the estimator should be penalized (higher penalty for increasing value of <math>\lambda</math>).
| |
| | |
| ===Derivation of the Estimator===
| |
| | |
| The explicit form of the estimator in equation ({{EquationNote|1}}) is derived in two steps. First, the representer theorem<ref name=KimWha70>{{cite journal|last=Kimeldorf|first=George S.|author2=Wahba, Grace|title=A correspondence between Bayesian estimation on stochastic processes and smoothing by splines|journal=The Annals of Mathematical Statistics|year=1970|volume=41|issue=2|pages=495–502|doi=10.1214/aoms/1177697089}}</ref><ref name=SchHerSmo01>{{cite journal|last=Schölkopf|first=Bernhard|author2=Herbrich, Ralf |author3=Smola, Alex J. |title=A Generalized Representer Theorem|journal=COLT/EuroCOLT 2001, LNCS|year=2001|volume=2111/2001|pages=416–426|doi=10.1007/3-540-44581-1_27}}</ref><ref name=DevEtal04>{{cite journal|last=De Vito|first=Ernesto|author2=Rosasco, Lorenzo |author3=Caponnetto, Andrea |author4=Piana, Michele |author5= Verri, Alessandro |title=Some Properties of Regularized Kernel Methods|journal=Journal of Machine Learning Research|date=October 2004|volume=5|pages=1363–1390}}</ref> states that the minimizer of the functional ({{EquationNote|2}}) can always be written as a linear combination of the kernels centered at the training-set points,
| |
| | |
| {{NumBlk|:|<math>
| |
| \hat{f}(\mathbf{x}') = \sum_{i=1}^n c_i k(\mathbf{x}_i,\mathbf{x}') = \mathbf{k}^\top \mathbf{c},
| |
| </math>|{{EquationRef|3}}}}
| |
| | |
| for some <math>\mathbf{c} \in \mathbb{R}^n</math>. The explicit form of the coefficients <math>\mathbf{c} = [c_1,\ldots,c_n]^\top</math> can be found by substituting for <math>f(\cdot)</math> in the functional ({{EquationNote|2}}). For a function of the form in equation ({{EquationNote|3}}), we have that
| |
| | |
| <math>\begin{align}
| |
| \|f\|_k^2 & = \langle f,f \rangle_k, \\
| |
| & = \left\langle \sum_{i=1}^N c_i k(\mathbf{x}_i,\cdot), \sum_{j=1}^N c_j k(\mathbf{x}_j,\cdot) \right\rangle_k, \\
| |
| & = \sum_{i=1}^N \sum_{j=1}^N c_i c_j \langle k(\mathbf{x}_i,\cdot), k(\mathbf{x}_j,\cdot) \rangle_k, \\
| |
| & = \sum_{i=1}^N \sum_{j=1}^N c_i c_j k(\mathbf{x}_i,\mathbf{x}_j), \\
| |
| & = \mathbf{c}^\top \mathbf{K} \mathbf{c}.
| |
| \end{align}</math>
| |
| | |
| We can rewrite the functional ({{EquationNote|2}}) as
| |
| | |
| <math>
| |
| \frac{1}{n} \| \mathbf{y} - \mathbf{K} \mathbf{c} \|^2 + \lambda \mathbf{c}^\top \mathbf{K} \mathbf{c}.
| |
| </math>
| |
| | |
| This functional is convex in <math>\mathbf{c}</math> and therefore we can find its minimum by setting the gradient with respect to <math>\mathbf{c}</math> to zero,
| |
| | |
| <math>\begin{align}
| |
| -\frac{1}{n} \mathbf{K} (\mathbf{Y} - \mathbf{K} \mathbf{c}) + \lambda \mathbf{K} \mathbf{c} & = 0, \\
| |
| (\mathbf{K} + \lambda n \mathbf{I}) \mathbf{c} & = \mathbf{Y}, \\
| |
| \mathbf{c} & = (\mathbf{K} + \lambda n \mathbf{I})^{-1} \mathbf{Y}.
| |
| \end{align}</math>
| |
| | |
| Substituting this expression for the coefficients in equation ({{EquationNote|3}}), we obtain the estimator stated previously in equation ({{EquationNote|1}}),
| |
| | |
| <math>
| |
| \hat{f}(\mathbf{x}') = \mathbf{k}^\top(\mathbf{K} + \lambda n \mathbf{I})^{-1} \mathbf{Y}.
| |
| </math>
| |
| | |
| ==A Bayesian Perspective==
| |
| | |
| The notion of a kernel plays a crucial role in Bayesian probability as the covariance function of a stochastic process called the ''[[Gaussian process]]''.
| |
| | |
| ===A Review of Bayesian Probability===
| |
| | |
| As part of the Bayesian framework, the Gaussian process specifies the [[Prior probability|''prior distribution'']] that describes the prior beliefs about the properties of the function being modeled. These beliefs are updated after taking into account observational data by means of a [[Likelihood function|''likelihood function'']] that relates the prior beliefs to the observations. Taken together, the prior and likelihood lead to an updated distribution called the [[Posterior probability|''posterior distribution'']] that is customarily used for predicting test cases.
| |
| | |
| ===The Gaussian Process=== | |
| | |
| A [[Gaussian process]] (GP) is a stochastic process in which any finite number of random variables that are sampled follow a joint [[Multivariate normal distribution|Normal distribution]].<ref name=RasWil06 /> The mean vector and covariance matrix of the Gaussian distribution completely specify the GP. GPs are usually used as a priori distribution for functions, and as such the mean vector and covariance matrix can be viewed as functions, where the covariance function is also called the ''kernel'' of the GP. Let a function <math>f</math> follow a Gaussian process with mean function <math>m</math> and kernel function <math>k</math>,
| |
| | |
| <math>
| |
| f \sim \mathcal{GP}(m,k).
| |
| </math>
| |
| | |
| In terms of the underlying Gaussian distribution, we have that for any finite set <math>\mathbf{X} = \{\mathbf{x}_i\}_{i=1}^{n}</math> if we let <math>f(\mathbf{X}) = [f(\mathbf{x}_1),\ldots,f(\mathbf{x}_n)]^\top</math> then
| |
| | |
| <math>
| |
| f(\mathbf{X}) \sim \mathcal{N}(\mathbf{m},\mathbf{K}),
| |
| </math>
| |
| | |
| where <math>\mathbf{m} = m(\mathbf{X}) = [m(\mathbf{x}_1),\ldots,m(\mathbf{x}_N)]^\top</math> is the mean vector and <math>\mathbf{K} = k(\mathbf{X},\mathbf{X})</math> is the covariance matrix of the multivariate Gaussian distribution.
| |
| | |
| ===Derivation of the Estimator===
| |
| {{see|Minimum mean square error#Linear MMSE estimator for linear observation process}}
| |
| In a regression context, the likelihood function is usually assumed to be a Gaussian distribution and the observations to be independent and identically distributed (iid),
| |
| | |
| <math>
| |
| p(y|f,\mathbf{x},\sigma^2) = \mathcal{N}(f(\mathbf{x}),\sigma^2).
| |
| </math>
| |
| | |
| This assumption corresponds to the observations being corrupted with zero-mean Gaussian noise with variance <math>\sigma^2</math>. The iid assumption makes it possible to factorize the likelihood function over the data points given the set of inputs <math>\mathbf{X}</math> and the variance of the noise <math>\sigma^2</math>, and thus the posterior distribution can be computed analytically. For a test input vector <math>\mathbf{x}'</math>, given the training data <math>S = \{\mathbf{X},\mathbf{Y}\}</math>, the posterior distribution is given by
| |
| | |
| <math>
| |
| p(f(\mathbf{x}')|S,\mathbf{x}',\boldsymbol{\phi}) = \mathcal{N}(m(\mathbf{x}'),\sigma^2(\mathbf{x}')),
| |
| </math>
| |
| | |
| where <math>\boldsymbol{\phi}</math> denotes the set of parameters which include the variance of the noise <math>\sigma^2</math> and any parameters from the covariance function <math>k</math> and where
| |
| | |
| <math>\begin{align}
| |
| m(\mathbf{x}') & = \mathbf{k}^\top (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{Y}, \\
| |
| \sigma^2(\mathbf{x}') & = k(\mathbf{x}',\mathbf{x}') - \mathbf{k}^\top (\mathbf{K} + \sigma^2 \mathbf{I})^{-1} \mathbf{k}.
| |
| \end{align}</math>
| |
| | |
| ==The Connection Between Regularization and Bayes==
| |
| | |
| A connection between regularization theory and Bayesian theory can only be achieved in the case of ''finite dimensional RKHS''. Under this assumption, regularization theory and Bayesian theory are connected through Gaussian process prediction.<ref name=Wah90>{{cite book|last=Wahba|first=Grace|title=Spline models for observational data|year=1990|publisher=SIAM}}</ref><ref name=RasWil06>{{cite book|last=Rasmussen|first=Carl Edward|title=Gaussian Processes for Machine Learning|year=2006|publisher=The MIT Press|isbn=0-262-18253-X|url=http://www.gaussianprocess.org/gpml/|author2=Williams, Christopher K. I.}}</ref>
| |
| | |
| In the finite dimensional case, every RKHS can be described in terms of a feature map <math>\Phi : \mathcal{X} \rightarrow \mathbb{R}^p</math> such that<ref name=Vap98 />
| |
| | |
| <math>
| |
| k(\mathbf{x},\mathbf{x}') = \sum_{i=1}^p \Phi^i(\mathbf{x})\Phi^i(\mathbf{x}').
| |
| </math>
| |
| | |
| Functions in the RKHS with kernel <math>\mathbf{K}</math> can be then be written as
| |
| | |
| <math>
| |
| f_{\mathbf{w}}(\mathbf{x}) = \sum_{i=1}^p \mathbf{w}^i \Phi^i(\mathbf{x}) = \langle \mathbf{w},\Phi(\mathbf{x}) \rangle,
| |
| </math>
| |
| | |
| and we also have that
| |
| | |
| <math>
| |
| \|f_{\mathbf{w}} \|_k = \|\mathbf{w}\|.
| |
| </math>
| |
| | |
| We can now build a Gaussian process by assuming <math> \mathbf{w} = [w^1,\ldots,w^p]^\top </math> to be distributed according to a multivariate Gaussian distribution with zero mean and identity covariance matrix,
| |
| | |
| <math>
| |
| \mathbf{w} \sim \mathcal{N}(0,\mathbf{I}) \propto \exp(-\|\mathbf{w}\|^2).
| |
| </math>
| |
| | |
| If we assume a Gaussian likelihood we have
| |
| | |
| <math>
| |
| P(\mathbf{Y}|\mathbf{X},f) = \mathcal{N}(f(\mathbf{X}),\sigma^2 \mathbf{I}) \propto \exp\left(-\frac{1}{\sigma^2} \| f_{\mathbf{w}}(\mathbf{X}) - \mathbf{Y} \|^2\right),
| |
| </math>
| |
| | |
| where <math> f_{\mathbf{w}}(\mathbf{X}) = (\langle\mathbf{w},\Phi(\mathbf{x}_1)\rangle,\ldots,\langle\mathbf{w},\Phi(\mathbf{x}_n \rangle) </math>. The resulting posterior distribution is the given by
| |
| | |
| <math>
| |
| P(f|\mathbf{X},\mathbf{Y}) \propto \exp\left(-\frac{1}{\sigma^2} \|f_{\mathbf{w}}(\mathbf{X}) - \mathbf{Y}\|_n^2 + \|\mathbf{w}\|^2\right)
| |
| </math>
| |
| | |
| We can see that a ''maximum posterior (MAP)'' estimate is equivalent to the minimization problem defining [[Tikhonov regularization]], where in the Bayesian case the regularization parameter is related to the noise variance.
| |
| | |
| From a philosophical perspective, the loss function in a regularization setting plays a different role than the likelihood function in the Bayesian setting. Whereas the loss function measures the error that is incurred when predicting <math>f(\mathbf{x})</math> in place of <math>y</math>, the likelihood function measures how likely the observations are from the model that was assumed to be true in the generative process. From a mathematical perspective, however, the formulations of the regularization and Bayesian frameworks make the loss function and the likelihood function to have the same mathematical role of promoting the inference of functions <math>f</math> that approximate the labels <math>y</math> as much as possible.
| |
| | |
| ==References==
| |
| {{Reflist}}
| |
| | |
| [[Category:Bayesian statistics]]
| |
| [[Category:Machine learning]]
| |
| [[Category:Probability theory]]
| |