Singular spectrum analysis: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>Helpful Pixie Bot
m ISBNs (Build KH)
 
en>Dthomsen8
m clean up, typo(s) fixed: et al → et al. (11) using AWB
Line 1: Line 1:
There are as many reasons to engage in home improvement as there are home owners. Some projects are large in scale, while others can be done in a snap. Whether you will hire a contractor or do it yourself, prior to starting your home improvements, you need to know certain things. You can learn about the whole process by reading the tips in this article. In many cases, just a small project can make a huge different to a home's value. If you intend to put your home on the market in the near future, do a couple of quick projects to boost your home's value.
{{Regression bar}}
'''Non-linear least squares''' is the form of [[least squares]] analysis used to fit a set of ''m'' observations with a model that is non-linear in ''n'' unknown parameters (''m'' > ''n''). It is used in some forms of [[non-linear regression]]. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to [[linear least squares (mathematics)|linear least squares]], but also some [[least squares#Differences between linear and non-linear least squares|significant differences]].


For example, you can paint the interior and exterior of your home. Even this simple project that may only cost a few hundred dollars can increase your home's value by a thousand dollars or more! If you are replacing baseboards opt for stained ones instead of painted ones. This look is considered classic, and most homes can benefit from the more natural appeal. Stained baseboards also hide imperfections better. Scratches and scuffs don't stand out as much as they would on painted baseboards.
== Theory ==
Consider a set of <math>m</math> data points, <math>(x_1, y_1), (x_2, y_2),\dots,(x_m, y_m),</math> and a curve (model function) <math>y=f(x, \boldsymbol \beta),</math> that in addition to the variable <math>x</math> also depends on <math>n</math> parameters, <math>\boldsymbol \beta = (\beta_1, \beta_2, \dots, \beta_n),</math> with <math>m\ge n.</math> It is desired to find the vector  <math>\boldsymbol \beta</math> of parameters such that the curve fits best the given data in the least squares sense, that is, the sum of squares
:<math>S=\sum_{i=1}^{m}r_i^2</math>
is minimized, where the [[errors and residuals in statistics|residuals]] (errors) ''r<sub>i</sub>'' are given by
:<math>r_i= y_i - f(x_i, \boldsymbol \beta) </math>


There are a wide variety of stain colors that you can explore until you find one that is a right fit for you. Small rooms can look dark and dull, but this can be changed. Get some light in there! Always make sure that the windows are clean so light shines in more effectively. Letting in as much natural light as possible can make a big difference in how your smaller rooms feel. Use pale colors when it comes to your walls and avoid clutter. Suddenly, your small room will seem to expand.
for <math>i=1, 2,\dots, m.</math>


Use wallpaper to create a great looking bookcase. Pick a unique and interesting design. By applying wallpaper to the very back of your bookcase, it will simply peek over the top of your books and be very aesthetically pleasing. You will not only have a nice bookcase, but a nice addition to your room. Make sure you turn off the right power circuit whenever you work on anything electrical. If you neglect to do this, you increase the risk of sustaining damage, injury or even death.
The [[Maxima and minima|minimum]] value of ''S'' occurs when the [[gradient]] is zero. Since the model contains ''n'' parameters there are ''n'' gradient equations:


Using the right type of lighting in any room makes a big difference in how it looks. Bright rooms feel much warmer to the homeowners and visitors. You can also brighten a room by having an additional light fixture installed. You can improve the overall feel of a house just by bringing some light to one room. Many older homes are carpeted throughout, but decades of wear and tear, often leaves a stained, outdated mess. Wood floors can last significantly longer than carpeting, and have a timeless look about them.
:<math>\frac{\partial S}{\partial \beta_j}=2\sum_i r_i\frac{\partial r_i}{\partial \beta_j}=0 \quad (j=1,\ldots,n).</math>


Plan your next home improvement project. Avoiding the planning until the time you begin your project can cause indecisiveness and many lost dollars due to poor, spur-of-the-minute decisions. If you plan in advance, you will have a much easier time completing the project. A new glaze can uplift the look of a bathroom. An old bathroom can look like new thanks to glaze. Replacing bathroom fixtures or floors can be quite costly, but glaze can only cost a couple hundred dollars.
In a non-linear system, the derivatives <math>\frac{\partial r_i}{\partial \beta_j}</math> are functions of both the independent variable and the parameters, so these gradient equations do not have a closed solution. Instead, initial values must be chosen for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation,


  It's one of the best ways to increase your home's value without spending too much money.<br><br>If you loved this article and you would like to acquire a lot more info about property renovation ideas - [http://www.homeimprovementdaily.com just click the following web site] - kindly take a look at the web site.
:<math>\beta_j \approx \beta_j^{k+1} =\beta^k_j+\Delta \beta_j. \, </math>
 
Here, ''k'' is an iteration number and the vector of increments, <math>\Delta \boldsymbol \beta\,</math> is known as the shift vector. At each iteration the model is linearized by approximation to a first-order [[Taylor series]] expansion about <math> \boldsymbol \beta^k\!</math>
:<math>f(x_i,\boldsymbol \beta)\approx f(x_i,\boldsymbol \beta^k) +\sum_j \frac{\partial f(x_i,\boldsymbol \beta^k)}{\partial \beta_j} \left(\beta_j -\beta^{k}_j \right) \approx f(x_i,\boldsymbol \beta^k) +\sum_j J_{ij} \,\Delta\beta_j. </math>
The [[Jacobian]], '''J''', is a function of constants, the independent variable ''and'' the parameters, so it changes from one iteration to the next. Thus, in terms of the linearized model, <math>\frac{\partial r_i}{\partial \beta_j}=-J_{ij}</math> and the residuals are given by
 
:<math>r_i=\Delta y_i- \sum_{s=1}^{n} J_{is}\ \Delta\beta_s; \ \Delta y_i=y_i- f(x_i,\boldsymbol \beta^k).</math>
 
Substituting these expressions into the gradient equations, they become
 
:<math>-2\sum_{i=1}^{m}J_{ij} \left( \Delta y_i-\sum_{s=1}^{n} J_{is}\ \Delta \beta_s \right)=0</math>
 
which, on rearrangement, become ''n'' simultaneous linear equations, the '''normal equations'''
 
:<math>\sum_{i=1}^{m}\sum_{s=1}^{n} J_{ij}J_{is}\ \Delta \beta_s=\sum_{i=1}^{m} J_{ij}\ \Delta y_i \qquad (j=1,\dots,n).\,</math>
 
The normal equations are written in matrix notation as
 
:<math>\mathbf{\left(J^TJ\right)\Delta \boldsymbol \beta=J^T\ \Delta y}.</math>
 
When the observations are not equally reliable, a weighted sum of squares may be minimized,
 
:<math>S=\sum_{i=1}^m W_{ii}r_i^2.</math>
 
Each element of the [[diagonal matrix|diagonal]] weight matrix '''W''' should, ideally, be equal to the reciprocal of the error [[variance]] of the measurement.<ref>This implies that the observations are uncorrelated. If the observations are [[correlated]], the expression
 
:<math>S=\sum_k \sum_j r_k W_{kj} r_j\,</math>
 
applies. In this case the weight matrix should ideally be equal to the inverse of the error [[variance-covariance matrix]] of the observations.</ref>  
The normal equations are then
 
:<math>\mathbf{\left(J^TWJ\right)\Delta \boldsymbol \beta=J^TW\ \Delta y}.</math>
 
These equations form the basis for the [[Gauss–Newton algorithm]] for a non-linear least squares problem.
 
<!-- === Differences between linear and non-linear least squares ===
*NLLSQ (Non-linear least squares) requires initial estimates of the parameters, LLSQ (linear least squares) does not.
*NLLSQ requires that the Jacobian be calculated. Analytical expressions for the partial derivatives can be complicated. If analytical expressions are impossible to obtain the partial derivatives must be calculated by numerical approximation.
*In NLLSQ divergence is a common phenomenon whereas in LLSQ it is quite rare. Divergence occurs when the sum of squares increases from one iteration to the next. It is caused by the inadequacy of the approximation that the Taylor series can be truncated at the first term.
*NLLSQ is an iterative process, LLSQ is not. The iterative process has to be terminated when a convergence criterion is satisfied.
*In LLSQ the solution is unique, but in NLLSQ there may be multiple minima in the sum of squares.
*In NLLSQ estimates of the parameter errors are [[biased]], but in LLSQ they are not.
These differences must be considered whenever the solution to a non-linear least squares problem is being sought. -->
=== Geometrical interpretation ===
In linear least squares the [[Optimization (mathematics)|objective function]], ''S'', is a [[quadratic function#Bivariate quadratic function|quadratic function]] of the parameters.
:<math>S=\sum_i W_{ii} \left(y_i-\sum_jX_{ij}\beta_j \right)^2</math>
When there is only one parameter the graph of ''S'' with respect to that parameter will be a [[parabola]]. With two or more parameters the contours of ''S'' with respect to any pair of parameters will be concentric [[ellipse]]s (assuming that the normal equations matrix <math>\mathbf{X^TWX}</math> is [[positive-definite matrix|positive definite]]). The minimum parameter values are to be found at the centre of the ellipses. The geometry of the general objective function can be described as paraboloid elliptical.
In NLLSQ the objective function is quadratic with respect to the parameters only in a region close to its minimum value, where the truncated Taylor series is a good approximation to the model.
:<math>S \approx\sum_i W_{ii} \left(y_i-\sum_j J_{ij}\beta_j \right)^2</math>
The more the parameter values differ from their optimal values, the more the contours deviate from elliptical shape. A consequence of this is that initial parameter estimates should be as close as practicable to their (unknown!) optimal values. It also explains how divergence can come about as the Gauss–Newton algorithm is convergent only when the objective function is approximately quadratic in the parameters.
 
== Computation ==
 
=== Initial parameter estimates ===
Problems of ill-conditioning and divergence can be ameliorated by finding initial parameter estimates that are near to the optimal values. A good way to do this is by [[computer simulation]]. Both the observed and calculated data are displayed on a screen. The parameters of the model are adjusted by hand until the agreement between observed and calculated data is reasonably good. Although this will be a subjective judgment, it is sufficient to find a good starting point for the non-linear refinement.
 
=== Solution ===
Any method among the ones described [[#Algorithms|below]] can be applied for find a solution.
 
=== Convergence criteria ===
The common sense criterion for convergence is that the sum of squares does not decrease from one iteration to the next. However this criterion is often difficult to implement in practice, for various reasons. A useful convergence criterion is
:<math>\left|\frac{S^k-S^{k+1}}{S^k}\right|<0.0001.</math>
The value 0.0001 is somewhat arbitrary and may need to be changed. In particular it may need to be increased when experimental errors are large. An alternative criterion is
 
:<math>\left|\frac{\Delta \beta_j}{\beta_j}\right|<0.001, \qquad j=1,\dots,n.</math>
 
Again, the numerical value is somewhat arbitrary; 0.001 is equivalent to specifying that each parameter should be refined to 0.1% precision. This is reasonable when it is less than the largest relative standard deviation on the parameters.
 
===Calculation of the Jacobian by numerical approximation===
{{main|Numerical differentiation}}
There are models for which it is either very difficult or even impossible to derive analytical expressions for the elements of the Jacobian. Then, the numerical approximation
:<math>\frac{\partial f(x_i, \boldsymbol \beta)}{\partial \beta_j} \approx \frac{\delta f(x_i, \boldsymbol \beta)}{\delta \beta_j}</math>
is obtained by calculation of <math>f(x_i, \boldsymbol \beta)\,</math> for <math>\beta_j\,</math> and <math>\beta_j+\delta \beta_j\,</math>. The increment,<math>\delta \beta_j\,</math>, size should be chosen so the numerical derivative is not subject to approximation error by being too large, or [[round-off]] error by being too small.
 
=== Parameter errors, confidence limits, residuals etc. ===
Some information is given in [[linear least squares (mathematics)#Parameter errors, correlation and confidence limits|the section]] on the [[linear least squares (mathematics)|linear least squares]] page.
 
=== Multiple minima ===
Multiple minima can occur in a variety of circumstances some of which are:
*A parameter is raised to a power of two or more. For example, when fitting data to a [[Lorentzian]] curve
:: <math>f(x_i, \boldsymbol \beta)=\frac{\alpha}{1+\left(\frac{\gamma-x_i}{\beta} \right)^2}</math>
where <math>\alpha</math> is the height, <math>\gamma</math> is the position and <math>\beta</math> is the half-width at half height, there are two solutions for the half-width, <math>\hat \beta</math> and <math>-\hat \beta</math> which give the same optimal value for the objective function.
*Two parameters can be interchanged without changing the value of the model. A simple example is when the model contains the product of two parameters, since <math>\alpha \beta</math> will give the same value as <math>\beta \alpha</math>.
*A parameter is in a trigonometric function, such as <math>\sin \beta\,</math>, which has identical values at <math>\hat \beta +2n \pi</math>. See [[Levenberg–Marquardt algorithm#Example|Levenberg&ndash;Marquardt algorithm]] for an example.
Not all multiple minima have equal values of the objective function. False minima, also known as local minima, occur when the objective function value is greater than its value at the so-called global minimum. To be certain that the minimum found is the global minimum, the refinement should be started with widely differing initial values of the parameters. When the same minimum is found regardless of starting point, it is likely to be the global minimum.
 
When multiple minima exist there is an important consequence: the objective function will have a maximum value somewhere between two minima. The normal equations matrix is not positive definite at a maximum in the objective function, as the gradient is zero and no unique direction of descent exists. Refinement from a point (a set of parameter values) close to a maximum will be ill-conditioned and should be avoided as a starting point. For example, when fitting a Lorentzian the normal equations matrix is not positive definite when the half-width of the band is zero.<ref>In the absence of [[round-off error]] and of experimental error in the independent variable the normal equations matrix would be singular</ref>
 
=== Transformation to a linear model ===
{{main|Nonlinear regression#Transformation}}
{{move section portions|Nonlinear regression#Transformation|date=August 2013}}
A non-linear model can sometimes be transformed into a linear one. For example, when the model is a simple exponential function,
:<math>f(x_i,\boldsymbol \beta)= \alpha e^{\beta x_i}</math>
it can be transformed into a linear model by taking logarithms.
:<math>\log f(x_i,\boldsymbol \beta)=\log \alpha + \beta x_i</math>
Graphically this corresponds to working on a [[semi-log plot]]. The sum of squares becomes
:<math>S=\sum_i (\log y_i-\log \alpha - \beta x_i)^2.\!</math>
This procedure should be avoided unless the errors are multiplicative and [[log normal distribution|log-normally distributed]] because it can give misleading results. This comes from the fact that whatever the experimental errors on '''y''' might be, the errors on '''log y''' are different. Therefore, when the transformed sum of squares is minimized different results will be obtained both for the parameter values and their calculated standard deviations. However, with multiplicative errors that are log-normally distributed, this procedure gives unbiased and consistent parameter estimates.
 
Another example is furnished by [[Michaelis&ndash;Menten kinetics#Equation optimization|Michaelis&ndash;Menten kinetics]], used to determine two parameters <math>V_{\max}</math> and <math>K_m</math>: 
:<math> v = \frac{V_{\max}[S]}{K_{m} + [S]}</math>.
The [[Lineweaver–Burk plot]]
:<math> \frac{1}{v} = \frac{1}{V_\max} + \frac{K_m}{V_{\max}[S]}</math>
of <math>\frac{1}{v}</math> against <math>\frac{1}{[S]}</math> is linear in the parameters <math>\frac{1}{V_\max}</math> and <math>\frac{K_m}{V_\max}</math>, but very sensitive to data error and strongly biased toward fitting the data in a particular range of the independent variable <math>[S]</math>.
 
== Solution ==
{{split section|Non-linear least squares algorithms|date=August 2013}}
 
=== Gauss–Newton method ===
{{main|Gauss–Newton algorithm}}
The normal equations
:<math>\mathbf{\left( J^TWJ \right)\Delta \boldsymbol\beta=\left( J^TW \right) \Delta y}</math>
may be solved for <math>\Delta \boldsymbol\beta</math> by [[Cholesky decomposition]], as described in [[linear least squares (mathematics)#Computation|linear least squares]]. The parameters are updated iteratively
:<math>\boldsymbol\beta^{k+1}=\boldsymbol\beta^k+\Delta \boldsymbol\beta</math>
where ''k'' is an iteration number. While this method may be adequate for simple models, it will fail if divergence occurs. Therefore protection against divergence is essential.
 
==== Shift-cutting ====
If divergence occurs, a simple expedient is to reduce the length of the shift vector, <math>\mathbf{\Delta \beta}</math>, by a fraction, ''f''
:<math>\boldsymbol\beta^{k+1}=\boldsymbol\beta^k+f\ \Delta \boldsymbol\beta.</math>
For example the length of the shift vector may be successively halved until the new value of the objective function is less than its value at the last iteration. The fraction, ''f'' could be optimized by a [[line search]].<ref name=BDS>M.J. Box, D. Davies and W.H. Swann, Non-Linear optimisation Techniques, Oliver & Boyd, 1969</ref>  As each trial value of ''f'' requires the objective function to be re-calculated it is not worth optimizing its value too stringently.
 
When using shift-cutting, the direction of the shift vector remains unchanged. This limits the applicability of the method to situations where the direction of the shift vector is not very different from what it would be if the objective function were approximately quadratic in the parameters, <math>\boldsymbol\beta^k.</math>
 
==== Marquardt parameter ====
{{main|Levenberg–Marquardt algorithm}}
If divergence occurs and the direction of the shift vector is so far from its "ideal" direction that shift-cutting is not very effective, that is, the fraction, ''f'' required to avoid divergence is very small, the direction must be changed. This can achieved by using the [[Levenberg–Marquardt algorithm|Marquardt]] parameter.<ref>This technique was proposed independently by Levenberg (1944), Girard (1958), Wynne (1959), Morrison (1960) and Marquardt (1963). Marquardt's name alone is used for it in much of the scientific literature.</ref> In this method the normal equations are modified
:<math>\mathbf{\left( J^TWJ +\lambda I \right)\Delta \boldsymbol \beta=\left( J^TW \right) \Delta y}</math>
where <math>\lambda</math> is the Marquardt parameter and '''I''' is an identity matrix. Increasing the value of <math>\lambda</math> has the effect of changing both the direction and the length of the shift vector. The shift vector is rotated towards the direction of [[steepest descent]]
:when <math>\lambda \mathbf{I\gg{}J^TWJ}, \  \mathbf{\Delta \boldsymbol \beta} \approx 1/\lambda \mathbf{J^TW\  \Delta y}.</math>
<math>\mathbf{J^TW\  \Delta y}</math> is the steepest descent vector. So, when <math>\lambda</math> becomes very large, the shift vector becomes a small fraction of the steepest descent vector.
 
Various strategies have been proposed for the determination of the Marquardt parameter. As with shift-cutting, it is wasteful to optimize this parameter too stringently. Rather, once a value has been found that brings about a reduction in the value of the objective function, that value of the parameter is carried to the next iteration, reduced if possible, or increased if need be. When reducing the value of the Marquardt parameter, there is a cut-off value below which it is safe to set it to zero, that is, to continue with the unmodified Gauss–Newton method. The cut-off value may be set equal to the smallest singular value of the Jacobian.<ref name=LH/> A bound for this value is given by <math>1/\mbox{trace} \mathbf{\left(J^TWJ \right)^{-1}}</math>.<ref>R. Fletcher, UKAEA Report AERE-R 6799, H.M. Stationery Office, 1971</ref>
 
=== QR decomposition ===
The minimum in the sum of squares can be found by a method that does not involve forming the normal equations. The residuals with the linearized model can be written as
:<math>\mathbf{r=\Delta y-J\ \Delta\boldsymbol\beta}.</math>
The Jacobian is subjected to an orthogonal decomposition; the [[QR decomposition]] will serve to illustrate the process.
 
:<math>\mathbf{J=QR}</math>
 
where '''Q''' is an [[Orthogonal matrix|orthogonal]] <math>m \times m</math> matrix and '''R''' is an <math>m \times n</math> matrix which is [[block matrix|partitioned]] into a <math>n \times n</math> block, <math>\mathbf\R_n</math>, and a <math>m-n \times n</math> zero block. <math>\mathbf\R_n</math> is upper triangular.
 
:<math>\mathbf{R}= \begin{bmatrix}
\mathbf{R}_n \\
\mathbf{0}\end{bmatrix}</math>
 
The residual vector is left-multiplied by <math>\mathbf Q^T</math>.
 
:<math>\mathbf{Q^Tr=Q^T\ \Delta y -R\ \Delta\boldsymbol\beta}= \begin{bmatrix}
\mathbf{\left(Q^T\ \Delta y -R\ \Delta\boldsymbol\beta \right)}_n \\
\mathbf{\left(Q^T\ \Delta y  \right)}_{m-n}\end{bmatrix}</math>
 
This has no effect on the sum of squares since <math>S=\mathbf{r^T Q Q^Tr = r^Tr}</math> because '''Q''' is [[orthogonal]]
The minimum value of ''S'' is attained when the upper block is zero. Therefore the shift vector is found by solving
 
:<math>\mathbf{R_n\ \Delta\boldsymbol\beta =\left(Q^T\ \Delta y \right)_n}. \, </math>
 
These equations are easily solved as '''R''' is upper triangular.
 
=== Singular value decomposition ===
A variant of the method of orthogonal decomposition involves [[singular value decomposition]], in which '''R''' is diagonalized by further orthogonal transformations.
 
:<math>\mathbf{J=U \boldsymbol\Sigma V^T} \, </math>
 
where <math>\mathbf U</math> is orthogonal, <math>\boldsymbol\Sigma </math> is a diagonal matrix of singular values and <math>\mathbf V</math> is the orthogonal matrix of the eigenvectors of <math>\mathbf {J^TJ}</math> or equivalently the right singular vectors of <math>\mathbf{J}</math>. In this case the shift vector is given by
 
:<math>\mathbf{\boldsymbol\Delta\beta=V \boldsymbol\Sigma^{-1}\left( U^T\  \boldsymbol\Delta y \right)}_n. \, </math>
 
The relative simplicity of this expression is very useful in theoretical analysis of non-linear least squares. The application of singular value decomposition is discussed in detail in Lawson and Hanson.<ref name=LH>C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice–Hall, 1974</ref>
 
=== Gradient methods ===
There are many examples in the scientific literature where different methods have been used for non-linear data-fitting problems.
 
*Inclusion of second derivatives in The Taylor series expansion of the model function. This is [[Newton's method in optimization]].
:: <math>f(x_i, \boldsymbol \beta)=f^k(x_i, \boldsymbol \beta) +\sum_j J_{ij} \, \Delta \beta_j + \frac{1}{2}\sum_j\sum_k \Delta\beta_j \, \Delta\beta_k \,H_{jk_{(i)}},\ H_{jk_{(i)}}=\frac{\partial^2 f(x_i, \boldsymbol \beta)}{\partial \beta_j \, \partial \beta_k }. </math>
: The matrix '''H''' is known as the [[Hessian matrix]]. Although this model has better convergence properties near to the minimum, it is much worse when the parameters are far from their optimal values. Calculation of the Hessian adds to the complexity of the algorithm. This method is not in general use.
*[[Davidon–Fletcher–Powell formula|Davidon–Fletcher–Powell method]]. This method, a form of pseudo-Newton method, is similar to the one above but calculates the Hessian by successive approximation, to avoid having to use analytical expressions for the second derivatives.
*[[Steepest descent]]. Although a reduction in the sum of squares is guaranteed when the shift vector points in the direction of steepest descent,  this method often performs poorly. When the parameter values are far from optimal the direction of the steepest descent vector, which is normal (perpendicular) to the contours of the objective function, is very different from the direction of the Gauss–Newton vector. This makes divergence much more likely, especially as the minimum along the direction of steepest descent may correspond to a small fraction of the length of the steepest descent vector. When the contours of the objective function are very eccentric, due to there being high correlation between parameters. the steepest descent iterations, with shift-cutting, follow a slow, zig-zag trajectory towards the minimum.
*[[Conjugate gradient method|Conjugate gradient search]]. This is an improved steepest descent based method with good theoretical convergence properties, although it can fail on finite-precision digital computers even when used on quadratic problems.<ref>M. J. D. Powell, Computer Journal, (1964), '''7''', 155.</ref>
 
=== Direct search methods ===
Direct search methods depend on evaluations of the objective function at a variety of parameter values and do not use derivatives at all. They offer alternatives to the use of numerical derivatives in the Gauss–Newton method and gradient methods.
* Alternating variable search.<ref name=BDS/> Each parameter is varied in turn by adding a fixed or variable increment to it and retaining the value that brings about a reduction in the sum of squares. The method is simple and effective when the parameters are not highly correlated. It has very poor convergence properties, but may be useful for finding initial parameter estimates.
 
*[[Nelder–Mead method|Nelder–Mead (simplex) search]] A [[simplex]] in this context is a [[polytope]] of ''n''&nbsp;+&nbsp;1 vertices in ''n'' dimensions; a triangle on a plane, a tetrahedron in three-dimensional space and so forth. Each vertex corresponds to a value of the objective function for a particular set of parameters. The shape and size of the simplex is adjusted by varying the parameters in such a way that the value of the objective function at the highest vertex always decreases. Although the sum of squares may initially decrease rapidly, it can converge to a nonstationary point on quasiconvex problems, by an example of M. J. D. Powell.
 
More detailed descriptions of these, and other, methods are available, in ''[[Numerical Recipes]]'', together with computer code in various languages.
 
== See also ==
* [[Least squares support vector machine]]
* [[Curve fitting]]
* [[Nonlinear programming]]
* [[Optimization (mathematics)]]
* [[Levenberg&ndash;Marquardt algorithm]]
 
== Notes ==
<references/>
 
== References ==
*C. T. Kelley, ''Iterative Methods for Optimization'', SIAM Frontiers in Applied Mathematics, no 18, 1999, ISBN 0-89871-433-8. [http://www.siam.org/books/textbooks/fr18_book.pdf  Online copy]
* T. Strutz: ''Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond)''. Vieweg+Teubner, ISBN 978-3-8348-1022-9.
 
{{Least Squares and Regression Analysis}}
 
[[Category:Numerical analysis]]
[[Category:Mathematical optimization]]
[[Category:Regression analysis]]
[[Category:Least squares]]

Revision as of 01:31, 23 January 2014

Template:Regression bar Non-linear least squares is the form of least squares analysis used to fit a set of m observations with a model that is non-linear in n unknown parameters (m > n). It is used in some forms of non-linear regression. The basis of the method is to approximate the model by a linear one and to refine the parameters by successive iterations. There are many similarities to linear least squares, but also some significant differences.

Theory

Consider a set of m data points, (x1,y1),(x2,y2),,(xm,ym), and a curve (model function) y=f(x,β), that in addition to the variable x also depends on n parameters, β=(β1,β2,,βn), with mn. It is desired to find the vector β of parameters such that the curve fits best the given data in the least squares sense, that is, the sum of squares

S=i=1mri2

is minimized, where the residuals (errors) ri are given by

ri=yif(xi,β)

for i=1,2,,m.

The minimum value of S occurs when the gradient is zero. Since the model contains n parameters there are n gradient equations:

Sβj=2iririβj=0(j=1,,n).

In a non-linear system, the derivatives riβj are functions of both the independent variable and the parameters, so these gradient equations do not have a closed solution. Instead, initial values must be chosen for the parameters. Then, the parameters are refined iteratively, that is, the values are obtained by successive approximation,

βjβjk+1=βjk+Δβj.

Here, k is an iteration number and the vector of increments, Δβ is known as the shift vector. At each iteration the model is linearized by approximation to a first-order Taylor series expansion about βk

f(xi,β)f(xi,βk)+jf(xi,βk)βj(βjβjk)f(xi,βk)+jJijΔβj.

The Jacobian, J, is a function of constants, the independent variable and the parameters, so it changes from one iteration to the next. Thus, in terms of the linearized model, riβj=Jij and the residuals are given by

ri=Δyis=1nJisΔβs;Δyi=yif(xi,βk).

Substituting these expressions into the gradient equations, they become

2i=1mJij(Δyis=1nJisΔβs)=0

which, on rearrangement, become n simultaneous linear equations, the normal equations

i=1ms=1nJijJisΔβs=i=1mJijΔyi(j=1,,n).

The normal equations are written in matrix notation as

(JTJ)Δβ=JTΔy.

When the observations are not equally reliable, a weighted sum of squares may be minimized,

S=i=1mWiiri2.

Each element of the diagonal weight matrix W should, ideally, be equal to the reciprocal of the error variance of the measurement.[1] The normal equations are then

(JTWJ)Δβ=JTWΔy.

These equations form the basis for the Gauss–Newton algorithm for a non-linear least squares problem.

Geometrical interpretation

In linear least squares the objective function, S, is a quadratic function of the parameters.

S=iWii(yijXijβj)2

When there is only one parameter the graph of S with respect to that parameter will be a parabola. With two or more parameters the contours of S with respect to any pair of parameters will be concentric ellipses (assuming that the normal equations matrix XTWX is positive definite). The minimum parameter values are to be found at the centre of the ellipses. The geometry of the general objective function can be described as paraboloid elliptical. In NLLSQ the objective function is quadratic with respect to the parameters only in a region close to its minimum value, where the truncated Taylor series is a good approximation to the model.

SiWii(yijJijβj)2

The more the parameter values differ from their optimal values, the more the contours deviate from elliptical shape. A consequence of this is that initial parameter estimates should be as close as practicable to their (unknown!) optimal values. It also explains how divergence can come about as the Gauss–Newton algorithm is convergent only when the objective function is approximately quadratic in the parameters.

Computation

Initial parameter estimates

Problems of ill-conditioning and divergence can be ameliorated by finding initial parameter estimates that are near to the optimal values. A good way to do this is by computer simulation. Both the observed and calculated data are displayed on a screen. The parameters of the model are adjusted by hand until the agreement between observed and calculated data is reasonably good. Although this will be a subjective judgment, it is sufficient to find a good starting point for the non-linear refinement.

Solution

Any method among the ones described below can be applied for find a solution.

Convergence criteria

The common sense criterion for convergence is that the sum of squares does not decrease from one iteration to the next. However this criterion is often difficult to implement in practice, for various reasons. A useful convergence criterion is

|SkSk+1Sk|<0.0001.

The value 0.0001 is somewhat arbitrary and may need to be changed. In particular it may need to be increased when experimental errors are large. An alternative criterion is

|Δβjβj|<0.001,j=1,,n.

Again, the numerical value is somewhat arbitrary; 0.001 is equivalent to specifying that each parameter should be refined to 0.1% precision. This is reasonable when it is less than the largest relative standard deviation on the parameters.

Calculation of the Jacobian by numerical approximation

Mining Engineer (Excluding Oil ) Truman from Alma, loves to spend time knotting, largest property developers in singapore developers in singapore and stamp collecting. Recently had a family visit to Urnes Stave Church. There are models for which it is either very difficult or even impossible to derive analytical expressions for the elements of the Jacobian. Then, the numerical approximation

f(xi,β)βjδf(xi,β)δβj

is obtained by calculation of f(xi,β) for βj and βj+δβj. The increment,δβj, size should be chosen so the numerical derivative is not subject to approximation error by being too large, or round-off error by being too small.

Parameter errors, confidence limits, residuals etc.

Some information is given in the section on the linear least squares page.

Multiple minima

Multiple minima can occur in a variety of circumstances some of which are:

  • A parameter is raised to a power of two or more. For example, when fitting data to a Lorentzian curve
f(xi,β)=α1+(γxiβ)2

where α is the height, γ is the position and β is the half-width at half height, there are two solutions for the half-width, β^ and β^ which give the same optimal value for the objective function.

  • Two parameters can be interchanged without changing the value of the model. A simple example is when the model contains the product of two parameters, since αβ will give the same value as βα.
  • A parameter is in a trigonometric function, such as sinβ, which has identical values at β^+2nπ. See Levenberg–Marquardt algorithm for an example.

Not all multiple minima have equal values of the objective function. False minima, also known as local minima, occur when the objective function value is greater than its value at the so-called global minimum. To be certain that the minimum found is the global minimum, the refinement should be started with widely differing initial values of the parameters. When the same minimum is found regardless of starting point, it is likely to be the global minimum.

When multiple minima exist there is an important consequence: the objective function will have a maximum value somewhere between two minima. The normal equations matrix is not positive definite at a maximum in the objective function, as the gradient is zero and no unique direction of descent exists. Refinement from a point (a set of parameter values) close to a maximum will be ill-conditioned and should be avoided as a starting point. For example, when fitting a Lorentzian the normal equations matrix is not positive definite when the half-width of the band is zero.[2]

Transformation to a linear model

Mining Engineer (Excluding Oil ) Truman from Alma, loves to spend time knotting, largest property developers in singapore developers in singapore and stamp collecting. Recently had a family visit to Urnes Stave Church. Template:Move section portions A non-linear model can sometimes be transformed into a linear one. For example, when the model is a simple exponential function,

f(xi,β)=αeβxi

it can be transformed into a linear model by taking logarithms.

logf(xi,β)=logα+βxi

Graphically this corresponds to working on a semi-log plot. The sum of squares becomes

S=i(logyilogαβxi)2.

This procedure should be avoided unless the errors are multiplicative and log-normally distributed because it can give misleading results. This comes from the fact that whatever the experimental errors on y might be, the errors on log y are different. Therefore, when the transformed sum of squares is minimized different results will be obtained both for the parameter values and their calculated standard deviations. However, with multiplicative errors that are log-normally distributed, this procedure gives unbiased and consistent parameter estimates.

Another example is furnished by Michaelis–Menten kinetics, used to determine two parameters Vmax and Km:

v=Vmax[S]Km+[S].

The Lineweaver–Burk plot

1v=1Vmax+KmVmax[S]

of 1v against 1[S] is linear in the parameters 1Vmax and KmVmax, but very sensitive to data error and strongly biased toward fitting the data in a particular range of the independent variable [S].

Solution

Template:Split section

Gauss–Newton method

Mining Engineer (Excluding Oil ) Truman from Alma, loves to spend time knotting, largest property developers in singapore developers in singapore and stamp collecting. Recently had a family visit to Urnes Stave Church. The normal equations

(JTWJ)Δβ=(JTW)Δy

may be solved for Δβ by Cholesky decomposition, as described in linear least squares. The parameters are updated iteratively

βk+1=βk+Δβ

where k is an iteration number. While this method may be adequate for simple models, it will fail if divergence occurs. Therefore protection against divergence is essential.

Shift-cutting

If divergence occurs, a simple expedient is to reduce the length of the shift vector, Δβ, by a fraction, f

βk+1=βk+fΔβ.

For example the length of the shift vector may be successively halved until the new value of the objective function is less than its value at the last iteration. The fraction, f could be optimized by a line search.[3] As each trial value of f requires the objective function to be re-calculated it is not worth optimizing its value too stringently.

When using shift-cutting, the direction of the shift vector remains unchanged. This limits the applicability of the method to situations where the direction of the shift vector is not very different from what it would be if the objective function were approximately quadratic in the parameters, βk.

Marquardt parameter

Mining Engineer (Excluding Oil ) Truman from Alma, loves to spend time knotting, largest property developers in singapore developers in singapore and stamp collecting. Recently had a family visit to Urnes Stave Church. If divergence occurs and the direction of the shift vector is so far from its "ideal" direction that shift-cutting is not very effective, that is, the fraction, f required to avoid divergence is very small, the direction must be changed. This can achieved by using the Marquardt parameter.[4] In this method the normal equations are modified

(JTWJ+λI)Δβ=(JTW)Δy

where λ is the Marquardt parameter and I is an identity matrix. Increasing the value of λ has the effect of changing both the direction and the length of the shift vector. The shift vector is rotated towards the direction of steepest descent

when λIJTWJ,Δβ1/λJTWΔy.

JTWΔy is the steepest descent vector. So, when λ becomes very large, the shift vector becomes a small fraction of the steepest descent vector.

Various strategies have been proposed for the determination of the Marquardt parameter. As with shift-cutting, it is wasteful to optimize this parameter too stringently. Rather, once a value has been found that brings about a reduction in the value of the objective function, that value of the parameter is carried to the next iteration, reduced if possible, or increased if need be. When reducing the value of the Marquardt parameter, there is a cut-off value below which it is safe to set it to zero, that is, to continue with the unmodified Gauss–Newton method. The cut-off value may be set equal to the smallest singular value of the Jacobian.[5] A bound for this value is given by 1/trace(JTWJ)1.[6]

QR decomposition

The minimum in the sum of squares can be found by a method that does not involve forming the normal equations. The residuals with the linearized model can be written as

r=ΔyJΔβ.

The Jacobian is subjected to an orthogonal decomposition; the QR decomposition will serve to illustrate the process.

J=QR

where Q is an orthogonal m×m matrix and R is an m×n matrix which is partitioned into a n×n block, n, and a mn×n zero block. n is upper triangular.

R=[Rn0]

The residual vector is left-multiplied by QT.

QTr=QTΔyRΔβ=[(QTΔyRΔβ)n(QTΔy)mn]

This has no effect on the sum of squares since S=rTQQTr=rTr because Q is orthogonal The minimum value of S is attained when the upper block is zero. Therefore the shift vector is found by solving

RnΔβ=(QTΔy)n.

These equations are easily solved as R is upper triangular.

Singular value decomposition

A variant of the method of orthogonal decomposition involves singular value decomposition, in which R is diagonalized by further orthogonal transformations.

J=UΣVT

where U is orthogonal, Σ is a diagonal matrix of singular values and V is the orthogonal matrix of the eigenvectors of JTJ or equivalently the right singular vectors of J. In this case the shift vector is given by

Δβ=VΣ1(UTΔy)n.

The relative simplicity of this expression is very useful in theoretical analysis of non-linear least squares. The application of singular value decomposition is discussed in detail in Lawson and Hanson.[5]

Gradient methods

There are many examples in the scientific literature where different methods have been used for non-linear data-fitting problems.

f(xi,β)=fk(xi,β)+jJijΔβj+12jkΔβjΔβkHjk(i),Hjk(i)=2f(xi,β)βjβk.
The matrix H is known as the Hessian matrix. Although this model has better convergence properties near to the minimum, it is much worse when the parameters are far from their optimal values. Calculation of the Hessian adds to the complexity of the algorithm. This method is not in general use.
  • Davidon–Fletcher–Powell method. This method, a form of pseudo-Newton method, is similar to the one above but calculates the Hessian by successive approximation, to avoid having to use analytical expressions for the second derivatives.
  • Steepest descent. Although a reduction in the sum of squares is guaranteed when the shift vector points in the direction of steepest descent, this method often performs poorly. When the parameter values are far from optimal the direction of the steepest descent vector, which is normal (perpendicular) to the contours of the objective function, is very different from the direction of the Gauss–Newton vector. This makes divergence much more likely, especially as the minimum along the direction of steepest descent may correspond to a small fraction of the length of the steepest descent vector. When the contours of the objective function are very eccentric, due to there being high correlation between parameters. the steepest descent iterations, with shift-cutting, follow a slow, zig-zag trajectory towards the minimum.
  • Conjugate gradient search. This is an improved steepest descent based method with good theoretical convergence properties, although it can fail on finite-precision digital computers even when used on quadratic problems.[7]

Direct search methods

Direct search methods depend on evaluations of the objective function at a variety of parameter values and do not use derivatives at all. They offer alternatives to the use of numerical derivatives in the Gauss–Newton method and gradient methods.

  • Alternating variable search.[3] Each parameter is varied in turn by adding a fixed or variable increment to it and retaining the value that brings about a reduction in the sum of squares. The method is simple and effective when the parameters are not highly correlated. It has very poor convergence properties, but may be useful for finding initial parameter estimates.
  • Nelder–Mead (simplex) search A simplex in this context is a polytope of n + 1 vertices in n dimensions; a triangle on a plane, a tetrahedron in three-dimensional space and so forth. Each vertex corresponds to a value of the objective function for a particular set of parameters. The shape and size of the simplex is adjusted by varying the parameters in such a way that the value of the objective function at the highest vertex always decreases. Although the sum of squares may initially decrease rapidly, it can converge to a nonstationary point on quasiconvex problems, by an example of M. J. D. Powell.

More detailed descriptions of these, and other, methods are available, in Numerical Recipes, together with computer code in various languages.

See also

Notes

  1. This implies that the observations are uncorrelated. If the observations are correlated, the expression
    S=kjrkWkjrj
    applies. In this case the weight matrix should ideally be equal to the inverse of the error variance-covariance matrix of the observations.
  2. In the absence of round-off error and of experimental error in the independent variable the normal equations matrix would be singular
  3. 3.0 3.1 M.J. Box, D. Davies and W.H. Swann, Non-Linear optimisation Techniques, Oliver & Boyd, 1969
  4. This technique was proposed independently by Levenberg (1944), Girard (1958), Wynne (1959), Morrison (1960) and Marquardt (1963). Marquardt's name alone is used for it in much of the scientific literature.
  5. 5.0 5.1 C.L. Lawson and R.J. Hanson, Solving Least Squares Problems, Prentice–Hall, 1974
  6. R. Fletcher, UKAEA Report AERE-R 6799, H.M. Stationery Office, 1971
  7. M. J. D. Powell, Computer Journal, (1964), 7, 155.

References

  • C. T. Kelley, Iterative Methods for Optimization, SIAM Frontiers in Applied Mathematics, no 18, 1999, ISBN 0-89871-433-8. Online copy
  • T. Strutz: Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Vieweg+Teubner, ISBN 978-3-8348-1022-9.

Template:Least Squares and Regression Analysis