<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://en.formulasearchengine.com/index.php?action=history&amp;feed=atom&amp;title=Hyper-Erlang_distribution</id>
	<title>Hyper-Erlang distribution - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://en.formulasearchengine.com/index.php?action=history&amp;feed=atom&amp;title=Hyper-Erlang_distribution"/>
	<link rel="alternate" type="text/html" href="https://en.formulasearchengine.com/index.php?title=Hyper-Erlang_distribution&amp;action=history"/>
	<updated>2026-05-31T06:45:14Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.0-wmf.28</generator>
	<entry>
		<id>https://en.formulasearchengine.com/index.php?title=Hyper-Erlang_distribution&amp;diff=30234&amp;oldid=prev</id>
		<title>en&gt;BeyondNormality: /* References */</title>
		<link rel="alternate" type="text/html" href="https://en.formulasearchengine.com/index.php?title=Hyper-Erlang_distribution&amp;diff=30234&amp;oldid=prev"/>
		<updated>2013-12-10T03:52:06Z</updated>

		<summary type="html">&lt;p&gt;&lt;span class=&quot;autocomment&quot;&gt;References&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Proximal gradient&amp;#039;&amp;#039;&amp;#039; (forward backward splitting) &amp;#039;&amp;#039;&amp;#039;methods for learning&amp;#039;&amp;#039;&amp;#039; is an area of research in [[optimization]] and [[statistical learning theory]] which studies algorithms for a general class of [[Convex function#Definition|convex]] [[Regularization (mathematics)|regularization]] problems where the regularization penalty may not be [[Differentiable function|differentiable]]. One such example is &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt; regularization (also known as Lasso) of the form&lt;br /&gt;
:&amp;lt;math&amp;gt;\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, \quad \text{ where } x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application.&amp;lt;ref name=combettes&amp;gt;{{cite journal|last=Combettes|first=Patrick L.|coauthors=Wajs, Valérie R.|title=Signal Recovering by Proximal Forward-Backward Splitting|journal=Multiscale Model. Simul.|year=2005|volume=4|issue=4|pages=1168–1200|url=http://epubs.siam.org/doi/abs/10.1137/050626090}}&amp;lt;/ref&amp;gt;&amp;lt;ref name=structSparse&amp;gt;{{cite journal|last=Mosci|first=S.|coauthors=Rosasco, L., Matteo, S., Verri, A., and Villa, S.|title=Solving Structured Sparsity Regularization with Proximal Methods|journal=Machine Learning and Knowledge Discovery in Databases|year=2010|volume=6322|pages=418–433}}&amp;lt;/ref&amp;gt; Such customized penalties can help to induce certain structure in problem solutions, such as &amp;#039;&amp;#039;sparsity&amp;#039;&amp;#039; (in the case of [[#Lasso regularization|lasso]]) or &amp;#039;&amp;#039;group structure&amp;#039;&amp;#039; (in the case of  [[#Exploiting group structure|group lasso]]).&lt;br /&gt;
&lt;br /&gt;
== Relevant background ==&lt;br /&gt;
&lt;br /&gt;
[[Proximal Gradient Methods|Proximal gradient methods]] are applicable in a wide variety of scenarios for solving [[convex optimization]] problems of the form&lt;br /&gt;
:&amp;lt;math&amp;gt; \min_{x\in \mathcal{H}} F(x)+R(x),&amp;lt;/math&amp;gt;&lt;br /&gt;
where &amp;lt;math&amp;gt;F&amp;lt;/math&amp;gt; is [[Convex function|convex]] and differentiable with [[Lipschitz continuity|Lipschitz continuous]] [[gradient]], &amp;lt;math&amp;gt; R&amp;lt;/math&amp;gt; is a [[Convex function|convex]], [[Semicontinuous function|lower semicontinuous]] function which is possibly nondifferentiable, and &amp;lt;math&amp;gt;\mathcal{H}&amp;lt;/math&amp;gt; is some set, typically a [[Hilbert space]]. The usual criterion of &amp;lt;math&amp;gt; x&amp;lt;/math&amp;gt; minimizes &amp;lt;math&amp;gt; F(x)+R(x)&amp;lt;/math&amp;gt; if and only if &amp;lt;math&amp;gt; \nabla (F+R)(x) = 0&amp;lt;/math&amp;gt; in the convex, differentiable setting is now replaced by&lt;br /&gt;
:&amp;lt;math&amp;gt; 0\in \partial (F+R)(x), &amp;lt;/math&amp;gt;&lt;br /&gt;
where &amp;lt;math&amp;gt;\partial \varphi&amp;lt;/math&amp;gt; denotes the [[subdifferential]] of a real-valued, convex function &amp;lt;math&amp;gt; \varphi&amp;lt;/math&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
Given a convex function &amp;lt;math&amp;gt;\varphi:\mathcal{H} \to \mathbb{R}&amp;lt;/math&amp;gt; an important operator to consider is its &amp;#039;&amp;#039;&amp;#039;proximity operator&amp;#039;&amp;#039;&amp;#039; &amp;lt;math&amp;gt;\operatorname{prox}_{\varphi}:\mathcal{H}\to\mathcal{H} &amp;lt;/math&amp;gt; defined by&lt;br /&gt;
:&amp;lt;math&amp;gt; \operatorname{prox}_{\varphi}(u) = \operatorname{arg}\min_{x\in\mathcal{H}} \varphi(x)+\frac{1}{2}\|u-x\|_2^2,&amp;lt;/math&amp;gt;&lt;br /&gt;
which is well-defined because of the strict convexity of the &amp;lt;math&amp;gt; \ell_2&amp;lt;/math&amp;gt; norm. The proximity operator can be seen as a generalization of a [[Projection (linear algebra)|projection]].&amp;lt;ref name=combettes /&amp;gt;&amp;lt;ref name=moreau /&amp;gt;&amp;lt;ref name=bauschke&amp;gt;{{cite book|last=Bauschke|first=H.H., and Combettes, P.L.|title=Convex analysis and monotone operator theory in Hilbert spaces|year=2011|publisher=Springer}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
We see that the proximity operator is important because &amp;lt;math&amp;gt; x^* &amp;lt;/math&amp;gt; is a minimizer to the problem &amp;lt;math&amp;gt; \min_{x\in\mathcal{H}} F(x)+R(x)&amp;lt;/math&amp;gt; if and only if&lt;br /&gt;
:&amp;lt;math&amp;gt;x^* = \operatorname{prox}_{\gamma R}\left(x^*-\gamma\nabla F(x^*)\right),&amp;lt;/math&amp;gt; where &amp;lt;math&amp;gt;\gamma&amp;gt;0&amp;lt;/math&amp;gt; is any positive real number.&amp;lt;ref name=combettes /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Moreau decomposition ===&lt;br /&gt;
&lt;br /&gt;
One important technique related to proximal gradient methods is the &amp;#039;&amp;#039;&amp;#039;Moreau decomposition,&amp;#039;&amp;#039;&amp;#039; which decomposes the identity operator as the sum of two proximity operators.&amp;lt;ref name=combettes /&amp;gt; Namely, let &amp;lt;math&amp;gt;\varphi:\mathcal{X}\to\mathbb{R}&amp;lt;/math&amp;gt; be a [[Semi-continuity|lower semicontinuous]], convex function on a vector space &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt;. We define its [[Convex conjugate|Fenchel conjugate]] &amp;lt;math&amp;gt;\varphi^*:\mathcal{X}\to\mathbb{R}&amp;lt;/math&amp;gt; to be the function&lt;br /&gt;
:&amp;lt;math&amp;gt;\varphi^*(u) := \sup_{x\in\mathcal{X}} \langle x,u\rangle - \varphi(x).&amp;lt;/math&amp;gt;&lt;br /&gt;
The general form of Moreau&amp;#039;s decomposition states that for any &amp;lt;math&amp;gt;x\in\mathcal{X}&amp;lt;/math&amp;gt; and any &amp;lt;math&amp;gt;\gamma&amp;gt;0&amp;lt;/math&amp;gt; that&lt;br /&gt;
:&amp;lt;math&amp;gt;x = \operatorname{prox}_{\gamma \varphi}(x) + \gamma\operatorname{prox}_{\varphi^*/\gamma}(x/\gamma),&amp;lt;/math&amp;gt;&lt;br /&gt;
which for &amp;lt;math&amp;gt;\gamma=1&amp;lt;/math&amp;gt; implies that &amp;lt;math&amp;gt;x = \operatorname{prox}_{\varphi}(x)+\operatorname{prox}_{\varphi^*}(x)&amp;lt;/math&amp;gt;.&amp;lt;ref name=combettes /&amp;gt;&amp;lt;ref name=moreau&amp;gt;{{cite journal|last=Moreau|first=J.-J.|title=Fonctions convexes duales et points proximaux dans un espace hilbertien|journal=C. R. Acad. Sci. Paris Ser. A Math.|year=1962|volume=255|pages=2987-2899}}&amp;lt;/ref&amp;gt; The Moreau decomposition can be seen to be a generalization of the usual orthogonal decomposition of a vector space, analogous with the fact that proximity operators are generalizations of projections.&amp;lt;ref name=combettes /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
In certain situations it may be easier to compute the proximity operator for the conjugate &amp;lt;math&amp;gt;\varphi^*&amp;lt;/math&amp;gt; instead of the function &amp;lt;math&amp;gt;\varphi&amp;lt;/math&amp;gt;, and therefore the Moreau decomposition can be applied. This is the case for  [[#Exploiting group structure|group lasso]].&lt;br /&gt;
&lt;br /&gt;
== Lasso regularization ==&lt;br /&gt;
&lt;br /&gt;
Consider the [[Regularization (mathematics)|regularized]] [[empirical risk minimization]] problem with square loss and with the [[L1-norm|&amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt; norm]] as the regularization penalty:&lt;br /&gt;
:&amp;lt;math&amp;gt;\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_1, &amp;lt;/math&amp;gt;&lt;br /&gt;
where &amp;lt;math&amp;gt;x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.&amp;lt;/math&amp;gt; The &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt; regularization problem is sometimes referred to as &amp;#039;&amp;#039;lasso&amp;#039;&amp;#039; ([[Least squares#Lasso method|least absolute shrinkage and selection operator]]).&amp;lt;ref name=tibshirani /&amp;gt; Such &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt; regularization problems are interesting because they induce &amp;#039;&amp;#039; sparse&amp;#039;&amp;#039; solutions, that is, solutions &amp;lt;math&amp;gt;w&amp;lt;/math&amp;gt; to the minimization problem have relatively few nonzero components. Lasso can be seen to be a convex relaxation of the non-convex problem&lt;br /&gt;
:&amp;lt;math&amp;gt;\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \|w\|_0, &amp;lt;/math&amp;gt;&lt;br /&gt;
where &amp;lt;math&amp;gt;\|w\|_0&amp;lt;/math&amp;gt; denotes the &amp;lt;math&amp;gt;\ell_0&amp;lt;/math&amp;gt; &amp;quot;norm&amp;quot;, which is the number of nonzero entries of the vector &amp;lt;math&amp;gt;w&amp;lt;/math&amp;gt;. Sparse solutions are of particular interest in learning theory for interpretability of results: a sparse solution can identify a small number of important factors.&amp;lt;ref name=tibshirani&amp;gt;{{cite journal|last=Tibshirani|first=R.|title=Regression shrinkage and selection via the lasso|journal=J. R. Stat. Soc., Ser. B|year=1996|volume=58|series=1|issue=1|pages=267–288}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Solving for &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt; proximity operator ===&lt;br /&gt;
&lt;br /&gt;
For simplicity we restrict our attention to the problem where &amp;lt;math&amp;gt;\lambda=1&amp;lt;/math&amp;gt;. To solve the problem&lt;br /&gt;
:&amp;lt;math&amp;gt;\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+  \|w\|_1, &amp;lt;/math&amp;gt;&lt;br /&gt;
we consider our objective function in two parts: a convex, differentiable term &amp;lt;math&amp;gt;F(w) = \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2&amp;lt;/math&amp;gt; and a convex function &amp;lt;math&amp;gt;R(w) = \|w\|_1&amp;lt;/math&amp;gt;. Note that &amp;lt;math&amp;gt;R&amp;lt;/math&amp;gt; is not strictly convex.&lt;br /&gt;
&lt;br /&gt;
Let us compute the proximity operator for &amp;lt;math&amp;gt;R(w)&amp;lt;/math&amp;gt;. First we find an alternative characterization of the proximity operator &amp;lt;math&amp;gt;\operatorname{prox}_{R}(x)&amp;lt;/math&amp;gt; as follows:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;math&amp;gt;&lt;br /&gt;
\begin{align}&lt;br /&gt;
u = \operatorname{prox}_R(x) \iff &amp;amp; 0\in \partial \left(R(u)+\frac{1}{2}\|u-x\|_2^2\right)\\&lt;br /&gt;
\iff &amp;amp; 0\in \partial R(u) + u-x\\&lt;br /&gt;
\iff &amp;amp; x-u\in \partial R(u).&lt;br /&gt;
\end{align}&lt;br /&gt;
&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
For &amp;lt;math&amp;gt;R(w) = \|w\|_1&amp;lt;/math&amp;gt; it is easy to compute &amp;lt;math&amp;gt;\partial R(w)&amp;lt;/math&amp;gt;: the &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt;th entry of &amp;lt;math&amp;gt;\partial R(w)&amp;lt;/math&amp;gt; is precisely&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt; \partial |w_i| = \begin{cases}&lt;br /&gt;
1,&amp;amp;w_i&amp;gt;0\\&lt;br /&gt;
-1,&amp;amp;w_i&amp;lt;0\\&lt;br /&gt;
\left[-1,1\right],&amp;amp;w_i = 0.&lt;br /&gt;
\end{cases}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Using the recharacterization of the proximity operator given above, for the choice of &amp;lt;math&amp;gt;R(w) = \|w\|_1&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\gamma&amp;gt;0&amp;lt;/math&amp;gt; we have that &amp;lt;math&amp;gt;\operatorname{prox}_{\gamma R}(x)&amp;lt;/math&amp;gt; is defined entrywise by&lt;br /&gt;
&lt;br /&gt;
::&amp;lt;math&amp;gt;\left(\operatorname{prox}_{\gamma R}(x)\right)_i = \begin{cases}&lt;br /&gt;
x_i-\gamma,&amp;amp;x_i&amp;gt;\gamma\\&lt;br /&gt;
0,&amp;amp;|x_i|\leq\gamma\\&lt;br /&gt;
x_i+\gamma,&amp;amp;x_i&amp;lt;-\gamma,&lt;br /&gt;
\end{cases}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
which is known as the [[Thresholding (image processing)|soft thresholding]] operator &amp;lt;math&amp;gt;S_{\gamma}(x)=\operatorname{prox}_{\gamma \|\cdot\|_1}(x)&amp;lt;/math&amp;gt;.&amp;lt;ref name=combettes /&amp;gt;&amp;lt;ref name=daubechies&amp;gt;{{cite journal|last=Daubechies|first=I.|coauthors=Defrise, M., and De Mol, C.|title=An iterative thresholding algorithm for linear inverse problem with a sparsity constraint|journal=Comm. Pure Appl. Math|year=2004|volume=57|issue=11|pages=1413–1457}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Fixed point iterative schemes ===&lt;br /&gt;
&lt;br /&gt;
To finally solve the lasso problem we consider the fixed point equation shown earlier:&lt;br /&gt;
:&amp;lt;math&amp;gt;x^* = \operatorname{prox}_{\gamma R}\left(x^*-\gamma\nabla F(x^*)\right).&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial &amp;lt;math&amp;gt;w^0\in\mathbb{R}^d&amp;lt;/math&amp;gt;, and for &amp;lt;math&amp;gt;k=1,2,\ldots&amp;lt;/math&amp;gt; define&lt;br /&gt;
:&amp;lt;math&amp;gt;w^{k+1} = S_{\gamma}\left(w^k - \gamma \nabla F\left(w^k\right)\right).&amp;lt;/math&amp;gt;&lt;br /&gt;
Note here the effective trade-off between the empirical error term &amp;lt;math&amp;gt;F(w) &amp;lt;/math&amp;gt; and the regularization penality &amp;lt;math&amp;gt;R(w)&amp;lt;/math&amp;gt;. This  fixed point method has decoupled the effect of the two different convex functions which comprise the objective function into a gradient descent step (&amp;lt;math&amp;gt; w^k - \gamma \nabla F\left(w^k\right)&amp;lt;/math&amp;gt;) and a soft thresholding step (via &amp;lt;math&amp;gt;S_\gamma&amp;lt;/math&amp;gt;).&lt;br /&gt;
&lt;br /&gt;
Convergence of this fixed point scheme is well-studied in the literature&amp;lt;ref name=combettes /&amp;gt;&amp;lt;ref name=daubechies /&amp;gt; and is guaranteed under appropriate choice of step size &amp;lt;math&amp;gt;\gamma&amp;lt;/math&amp;gt; and loss function (such as the square loss taken here). [[Gradient descent#Extensions|Accelerated methods]] were introduced by Nesterov in 1983 which improve the rate of convergence under certain regularity assumptions on &amp;lt;math&amp;gt;F&amp;lt;/math&amp;gt;.&amp;lt;ref name=nesterov&amp;gt;{{cite journal|last=Nesterov|first=Yurii|title=A method of solving a convex programming problem with convergence rate &amp;lt;math&amp;gt;O(1/k^2)&amp;lt;/math&amp;gt;|journal=Soviet Math. Doklady|year=1983|volume=27|issue=2|pages=372–376}}&amp;lt;/ref&amp;gt; Such methods have been studied extensively in previous years.&amp;lt;ref&amp;gt;{{cite book|last=Nesterov|first=Yurii|title=Introductory Lectures on Convex Optimization|year=2004|publisher=Kluwer Academic Publisher}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
For more general learning problems where the proximity operator cannot be computed explicitly for some regularization term &amp;lt;math&amp;gt;R&amp;lt;/math&amp;gt;, such fixed point schemes can still be carried out using approximations to both the gradient and the proximity operator.&amp;lt;ref name=bauschke /&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Villa|first=S.|coauthors=Salzo, S., Baldassarre, L., and Verri, A.|title=Accelerated and inexact forward-backward algorithms|journal=SIAM J. Optim.|year=2013|volume=23|issue=3|pages=1607–1633|doi=10.1137/110844805}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Practical considerations ==&lt;br /&gt;
&lt;br /&gt;
There have been numerous developments within the past decade in [[convex optimization]] techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.&amp;lt;ref name=structSparse /&amp;gt;&amp;lt;ref name=bach&amp;gt;{{cite journal|last=Bach|first=F.|coauthors=Jenatton, R., Mairal, J., and Obozinski, Gl.|title=Optimization with sparsity-inducing penalties|journal=Found. &amp;amp; Trends Mach. Learn.|year=2011|volume=4|issue=1|pages=1–106|doi=10.1561/2200000015}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Adaptive step size ===&lt;br /&gt;
&lt;br /&gt;
In the fixed point iteration scheme&lt;br /&gt;
:&amp;lt;math&amp;gt;w^{k+1} = \operatorname{prox}_{\gamma R}\left(w^k-\gamma \nabla F\left(w^k\right)\right),&amp;lt;/math&amp;gt;&lt;br /&gt;
one can allow variable step size &amp;lt;math&amp;gt;\gamma_k&amp;lt;/math&amp;gt; instead of a constant &amp;lt;math&amp;gt;\gamma&amp;lt;/math&amp;gt;. Numerous adaptive step size schemes have been proposed throughout the literature.&amp;lt;ref name=combettes /&amp;gt;&amp;lt;ref name=bauschke /&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Loris|first=I.|coauthors=Bertero, M., De Mol, C., Zanella, R., and Zanni, L.|title=Accelerating gradient projection methods for &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt;-constrained signal recovery by steplength selection rules|journal=Applied &amp;amp; Comp. Harmonic Analysis|volume=27|issue=2|pages=247–254|year=2009}}&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Wright|first=S.J.|coauthors=Nowak, R.D., and Figueiredo, M.A.T.|title=Sparse reconstruction by separable approximation|journal=IEEE Trans. Image Process.|year=2009|volume=57|issue=7|pages=2479–2493|doi=10.1109/TSP.2009.2016892}}&amp;lt;/ref&amp;gt; Applications of these schemes&amp;lt;ref name=structSparse /&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Loris|first=Ignace|title=On the performance of algorithms for the minimization of &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt;-penalized functionals|journal=Inverse Problems|year=2009|volume=25|issue=3|doi=10.1088/0266-5611/25/3/035008}}&amp;lt;/ref&amp;gt;  suggest that these can offer substantial improvement in number of iterations required for fixed point convergence.&lt;br /&gt;
&lt;br /&gt;
=== Elastic net (mixed norm regularization) ===&lt;br /&gt;
&lt;br /&gt;
[[Elastic net regularization]] offers an alternative to pure &amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt; regularization. The problem of lasso (&amp;lt;math&amp;gt;\ell_1&amp;lt;/math&amp;gt;) regularization involves the penalty term &amp;lt;math&amp;gt;R(w) = \|w\|_1&amp;lt;/math&amp;gt;, which is not strictly convex. Hence, solutions to &amp;lt;math&amp;gt;\min_w F(w) + R(w),&amp;lt;/math&amp;gt; where &amp;lt;math&amp;gt;F&amp;lt;/math&amp;gt; is some empirical loss function, need not be unique. This is often avoided by the inclusion of an additional strictly convex term, such as an &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt; norm regularization penalty. For example, one can consider the problem&lt;br /&gt;
:&amp;lt;math&amp;gt;\min_{w\in\mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i- \langle w,x_i\rangle)^2+ \lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2\right), &amp;lt;/math&amp;gt;&lt;br /&gt;
where &amp;lt;math&amp;gt;x_i\in \mathbb{R}^d\text{ and } y_i\in\mathbb{R}.&amp;lt;/math&amp;gt;&lt;br /&gt;
For &amp;lt;math&amp;gt;0&amp;lt;\mu\leq 1&amp;lt;/math&amp;gt; the penalty term &amp;lt;math&amp;gt;\lambda \left((1-\mu)\|w\|_1+\mu \|w\|_2\right)&amp;lt;/math&amp;gt; is now strictly convex, and hence the minimization problem now admits a unique solution. It has been observed that for sufficiently small &amp;lt;math&amp;gt;\mu &amp;gt; 0&amp;lt;/math&amp;gt;, the additional penalty term &amp;lt;math&amp;gt;\mu \|w\|_2&amp;lt;/math&amp;gt; acts as a preconditioner and can substantially improve convergence while not adversely affecting the sparsity of solutions.&amp;lt;ref name=structSparse /&amp;gt;&amp;lt;ref name=deMolElasticNet&amp;gt;{{cite journal|last=De Mol|first=C.|coauthors=De Vito, E., and Rosasco, L.|title=Elastic-net regularization in learning theory|journal=J. Complexity|year=2009|volume=25|issue=2|pages=201–230|doi=10.1016/j.jco.2009.01.002}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Exploiting group structure ==&lt;br /&gt;
&lt;br /&gt;
Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in [[statistical learning theory]]. Certain problems in learning can often involve data which has additional structure that is known &amp;#039;&amp;#039; a priori&amp;#039;&amp;#039;. In the past several years there have been new developments which incorporate information about group structure to provide methods which are tailored to different applications. Here we survey a few such methods.&lt;br /&gt;
&lt;br /&gt;
=== Group lasso ===&lt;br /&gt;
&lt;br /&gt;
Group lasso is a generalization of the [[#Lasso regularization|lasso method]] when features are grouped into disjoint blocks.&amp;lt;ref name=groupLasso&amp;gt;{{cite journal|last=Yuan|first=M.|coauthors=Lin, Y.|title=Model selection and estimation in regression with grouped variables|journal=J. R. Stat. Soc. B|year=2006|volume=68|issue=1|pages=49–67|doi=10.1111/j.1467-9868.2005.00532.x}}&amp;lt;/ref&amp;gt; Suppose the features are grouped into blocks &amp;lt;math&amp;gt;\{w_1,\ldots,w_G\}&amp;lt;/math&amp;gt;. Here we take as a regularization penalty&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;R(w) =\sum_{g=1}^G \|w_g\|_2,&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
which is the sum of the &amp;lt;math&amp;gt;\ell_2&amp;lt;/math&amp;gt; norm on corresponding feature vectors for the different groups. A similar proximity operator analysis as above can be used to compute the proximity operator for this penalty. Where the lasso penalty has a proximity operator which is soft thresholding on each individual component, the proximity operator for the group lasso is soft thresholding on each group. For the group &amp;lt;math&amp;gt;w_g&amp;lt;/math&amp;gt; we have that proximity operator of &amp;lt;math&amp;gt;\lambda\gamma\left(\sum_{g=1}^G \|w_g\|_2\right) &amp;lt;/math&amp;gt; is given by&lt;br /&gt;
&lt;br /&gt;
:&amp;lt;math&amp;gt;\widetilde{S}_{\lambda\gamma }(w_g) =  \begin{cases}&lt;br /&gt;
w_g-\lambda\gamma \frac{w_g}{\|w_g\|_2}, &amp;amp; \|w_g\|_2&amp;gt;\lambda\gamma \\&lt;br /&gt;
0, &amp;amp; \|w_g\|_2\leq \lambda\gamma&lt;br /&gt;
\end{cases}&amp;lt;/math&amp;gt;&lt;br /&gt;
&lt;br /&gt;
where &amp;lt;math&amp;gt;w_g&amp;lt;/math&amp;gt; is the &amp;lt;math&amp;gt;g&amp;lt;/math&amp;gt;th group.&lt;br /&gt;
&lt;br /&gt;
In contrast to lasso, the derivation of the proximity operator for group lasso relies on the [[#Moreau decomposition|Moreau decomposition]]. Here the proximity operator of the conjugate of the group lasso penalty becomes a projection onto the [[Ball (mathematics)|ball]] of a [[dual norm]].&amp;lt;ref name=structSparse /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
=== Other group structures ===&lt;br /&gt;
&lt;br /&gt;
In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure.  Such generalizations of group lasso have been considered in a variety of contexts.&amp;lt;ref&amp;gt;{{cite journal|last=Chen|first=X.|coauthors=Lin, Q., Kim, S., Carbonell, J.G., and Xing, E.P.|title=Smoothing proximal gradient method for general structured sparse regression|journal=Ann. Appl. Stat.|year=2012|volume=6|issue=2|pages=719–752|doi=10.1214/11-AOAS514}}&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Mosci|first=S.|coauthors=Villa, S., Verri, A., and Rosasco, L.|title=A primal-dual algorithm for group sparse regularization with overlapping groups|journal=NIPS|year=2010|volume=23|pages=2604–2612}}&amp;lt;/ref&amp;gt;&amp;lt;ref name=nest&amp;gt;{{cite journal|last=Jenatton|first=R.|coauthors=Audibert, J.-Y., and Bach, F.|title=Structured variable selection with sparsity-inducing norms|journal=J. Mach. Learn. Res.|year=2011|volume=12|pages=2777–2824}}&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Zhao|first=P.|coauthors=Rocha, G., and Yu, B.|title=The composite absolute penalties family for grouped and hierarchical variable selection|journal=Ann. Statist.|year=2009|volume=37|issue=6A|pages=3468–3497|doi=10.1214/07-AOS584}}&amp;lt;/ref&amp;gt; For overlapping groups one common approach is known as &amp;#039;&amp;#039;latent group lasso&amp;#039;&amp;#039; which introduces latent variables to account for overlap.&amp;lt;ref&amp;gt;{{cite journal|last=Obozinski|first=G.|coauthors=Laurent, J., and Vert, J.-P.|title=Group lasso with overlaps: the latent group lasso approach|journal=INRIA Technical Report|year=2011|url=http://hal.inria.fr/inria-00628498/en/}}&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;{{cite journal|last=Villa|first=S.|coauthors=Rosasco, L., Mosci, S., and Verri, A.|title=Proximal methods for the latent group lasso penalty|journal=preprint|year=2012|url=http://arxiv.org/abs/1209.0368}}&amp;lt;/ref&amp;gt; Nested group structures are studied in &amp;#039;&amp;#039;hierarchical structure prediction&amp;#039;&amp;#039; and with [[directed acyclic graph]]s.&amp;lt;ref name=nest /&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== See also ==&lt;br /&gt;
&lt;br /&gt;
* [[Proximal Gradient Methods|Proximal gradient methods]]&lt;br /&gt;
* * [[Statistical learning theory]]&lt;br /&gt;
* [[Regularization (mathematics)#Regularization in statistics and machine learning|Regularization]]&lt;br /&gt;
* [[Convex analysis]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
&lt;br /&gt;
{{reflist}}&lt;br /&gt;
&lt;br /&gt;
[[Category:First order methods|First order methods]]&lt;br /&gt;
[[Category:Convex optimization]]&lt;br /&gt;
[[Category:Machine learning]]&lt;/div&gt;</summary>
		<author><name>en&gt;BeyondNormality</name></author>
	</entry>
</feed>