en>David Eppstein: Added tags to the page using Page Curation (unreferenced)

2012-11-09T06:33:20Z

Added tags to the page using Page Curation (unreferenced)

New page

[[File:Svm 8 polinomial.JPG|thumb|300px|Illustration of the mapping <math>\varphi</math>. On the left a set of samples in the input space, on the right the same samples in the feature space where the polynomial kernel <math>K(x,y)</math> (for some values of the parameters <math>c</math> and <math>d</math> is the inner product. The hyperplane learned in feature space by an SVM is an ellipse in the input space.]]
In [[machine learning]], the '''polynomial kernel''' is a [[kernel function]] commonly used with [[support vector machine]]s (SVMs) and other [[Kernel trick|kernelized]] models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

Intuitively, the polynomial kernel looks not only at the given features of input samples to determine their similarity, but also combinations of these.

==Definition==
For degree-''d'' polynomials, the polynomial kernel is defined as<ref>http://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf</ref>

:<math>K(x,y) = (x^\top y + c)^d</math>

where <math>x</math> and <math>y</math> are vectors in the ''input space'', i.e. vectors of features computed from training or test samples, <math>c \geq 0</math> is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. When <math>c = 0</math>, the kernel is called homogeneous.<ref>{{cite arXiv
|last=Shashua
|first=Amnon
|eprint=0904.3664
|title=Introduction to Machine Learning: Class Notes 67577
|class=cs.LG
|year=2009
|version=v1
|accessdate=26 March 2013
}}</ref> (A further generalized polykernel divides <math>x^\top y</math> by a user-specified scalar parameter <math>a</math>.<ref name="lin2012"/>)

As a kernel, <math>K</math> corresponds an inner product in a feature space based on some mapping <math>\varphi</math>:

:<math>K(x,y) = \langle \varphi(x), \varphi(y) \rangle</math>

The nature of <math>\varphi</math> can be glanced from an example. Let <math>d=2</math>, so we get the special case of the quadratic kernel. Then

:<math>K(x,y) = \left(\sum_{i=1}^n x_i y_i + c\right)^2 = \sum_{i=1}^n x_i^2 y_i^2 + \sum_{i=2}^n \sum_{j=1}^{i-1} \sqrt{2} x_i y_i \sqrt{2} x_j y_j + \sum_{i=1}^n \sqrt{2c} x_i \sqrt{2c} y_i + c^2
</math>

From this it follows that the feature map is given by:

:<math>
\varphi(x) = \langle x_n^2, \ldots, x_1^2, \sqrt{2} x_n x_{n-1}, \ldots, \sqrt{2} x_n x_1, \sqrt{2} x_{n-1} x_{n-2}, \ldots, \sqrt{2} x_{n-1} x_{1}, \ldots, \sqrt{2} x_{2} x_{1}, \sqrt{2c} x_n, \ldots, \sqrt{2c} x_1, c \rangle
</math>

When the input features are binary-valued (booleans), <math>c = 1</math> and the <math>\sqrt{2}</math> terms are ignored, the mapped features correspond to [[Logical conjunction|conjunction]]s of input features.<ref name="Goldberg2008">Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.</ref>

==Practical use==
Although the [[RBF kernel]] is more popular in SVM classification than the polynomial kernel, the latter is quite popular in [[natural language processing]] (NLP).<ref name="Goldberg2008"/><ref name="Chang2010">Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin (2010). [http://jmlr.csail.mit.edu/papers/v11/chang10a.html Training and testing low-degree polynomial data mappings via linear SVM]. J. Machine Learning Research '''11''':1471–1490.</ref>
The most common degree is ''d''=2, since larger degrees tend to [[overfitting|overfit]] on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

* full expansion of the kernel prior to training/testing with a linear SVM,<ref name="Chang2010"/> i.e. full computation of the mapping <math>\varphi</math>
* [[Association rule learning|basket mining]] (using a variant of the [[apriori algorithm]]) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion<ref name="Kudo2003">T. Kudo and Y. Matsumoto (2003). Fast methods for kernel-based text analysis. Proc. ACL 2003.</ref>
* [[inverted index]]ing of support vectors<ref name="Kudo2003"/><ref name="Goldberg2008"/>

One problem with the polynomial kernel is that may it suffer from [[Numerical stability|numerical instability]]: when <math>x^\top y + c < 1</math>, <math>K(x, y) = (x^\top y + c)^d</math> tends to zero as <math>d</math> is increased, whereas when <math>x^\top y + c > 1</math>, <math>K(x, y)</math> tends to infinity.<ref name="lin2012">Chih-Jen Lin (2012). [http://www.csie.ntu.edu.tw/~cjlin/talks/mlss_kyoto.pdf Machine learning software: design and practical use]. Talk at Machine Learning Summer School, Kyoto.</ref>

==References==
<references/>

[[Category:Kernel methods for machine learning]]

Locally finite operator - Revision history

en>David Eppstein: Added tags to the page using Page Curation (unreferenced)