# Talk:Gauss–Newton algorithm

## Name?

I always thought that Newton's method, when applied to systems of equation, is still called Newton's method (or Newton-Raphson) and that the Gauss-Newton is a modified Newton's method for solving least squares problems. To wit, let f(x) = sum_i r_i(x)^2 be the least squares problem (x is a vector, and I hope you don't mind my being too lazy to properly type-set the maths). Then we need to solve 2 A(x) r(x) = 0, where A(x) is the Jacobian matrix, so A_ij = derivative of r_j w.r.t. x_i. In my understanding, Newton's method is as described in the article: f'(x_k) (x_{k+1} - x_k) = - f(x_k) with f(x) = 2 A(x) r(x). The derivative f'(x) can be calculated as f'(x) = 2 A(x) A(x)^\top + 2 sum_i r_i(x) nabla^2 r_i(x). On the other hand, the Gauss-Newton method neglect the second term, so we get the iteration A(x_k) A(x_k)^\top (x_{k+1} - x_k) = - A(x_k) r(x_k).

Could you please tell me whether I am mistaken, preferably giving a reference if I am? Thanks. -- Jitse Niesen 18:33, 18 Nov 2004 (UTC)

It is quite possible that you are right. Please feel free to improve both this article and the corresponding section in Newton's method. I do not have an appropriate reference at hands, therefore I am unable to contribute to a clarification. - By the way, thanks for the personal message on my talk page. -- Frau Holle 22:23, 20 Nov 2004 (UTC)
I moved the description of the general Newton's method in R^n back to Newton's method and wrote here a bit about the modified method for least squares problems. -- Jitse Niesen 00:45, 5 Dec 2004 (UTC)

## Transform of the Jacobians in the wrong place?

Should the formula not be p(t)=p(t+1)-(J'J)^(-1)J'f?

I thought the RHS was supposed to look like an OLS regression estimate... —The preceding unsigned comment was added by 131.111.8.96 (talkcontribs) 15:48, 10 January 2006 (UTC)

Please be more precise. Do you want to change the formula
${\displaystyle p^{k+1}=p^{k}-(J_{f}(p^{k})\,J_{f}(p^{k})^{\top })^{-1}J_{f}(p^{k})\,f(p^{k})}$
${\displaystyle p^{k}=p^{k+1}-(J_{f}(p^{k})\,J_{f}(p^{k})^{\top })^{-1}J_{f}(p^{k})\,f(p^{k})?}$
If so, can you give some more justfication than "I thought the RHS was supposed to look like an OLS regression estimate", like a reference? If not, what do you want? -- Jitse Niesen (talk) 18:12, 10 January 2006 (UTC)

montazr rabiei

The term that is neglected in the Newton step to obtain the Gauss-Newton step is not the Laplacian. It also contains mixed second-order derivatives, which is not the case for the Laplacian! Georg

You're right. Many thanks for your comment. It should be fixed now. -- Jitse Niesen (talk) 03:10, 28 October 2006 (UTC)

## no inversion?

What is meant by "The matrix inverse is never computed explicitly in practice"? Right after it is said that a certain linear system should be solved. How does that system gets solved if not by the inversion of that matrix?

It seems to mee also that there is a pseudoinversion going on there, and this should be made explicit in the article. -- NIC1138 18:17, 21 June 2007 (UTC)

I have added more details on how to solve that equation without inverting the full matrix. -- Tim314 19:46, 26 September 2007 (UTC)

### Re: Computing the step ${\displaystyle \delta ^{k}}$

Invertible refers to a square matrix. Since ${\displaystyle J_{f}(p^{k})}$ is ${\displaystyle m\times n}$, it would be more exact to say it is of rank ${\displaystyle n}$ rather than invertible. Note that this might invalidate the following argument (using the inverse matrix), so some comment like "for the square case ..., and proper generalization is available for the rectangular case" might be needed. --AmitAronovitch (talk) 11:42, 10 January 2008 (UTC)

## Missing Assumptions - Difference between Gauss-Newton and Newton

I would like to see an explicit statement about the different assumptions behind Gauss-Newton and Newton method. The Newton-method-article is quite clear in its assumptions.

Now, in the Gauss-Newton method, why can the Hessian (implicitely) be approximated by ${\displaystyle J_{f}(p)^{\top }J_{f}(p)}$. What is the geometric interpretation? Is the reason a special assumption, e.g. that the function is very close to zero near the true solution and therefore ${\displaystyle \sum _{i=1}^{m}f_{i}(p)\,H(f_{i})(p)}$ is neglectable ? That would mean that Gauss-Newton (and also Levenberg-Marquardt!) is not suitable for finding minima where the "residual" function value is quite large, e.g. ${\displaystyle f_{i}(p)=p^{2}+100000000}$

Or from another viewpoint: Given the funtion value and the jacobian there are infinitely many parabolas which are tangent to the function, what's the justification to guess the Hessian to be (close to) ${\displaystyle J_{f}(p)^{\top }J_{f}(p)}$ ?

Without this explanation, the update rule given in the article seems too unjustified for me, where does it come from? I do believe Gauss had good reasons for it :-)

--89.53.92.248 17:18, 6 September 2007 (UTC)

Meanwhile I do believe that G.N. is the special case where we have a sum of squared differences (ssd) between functions and "desired function values". In pure Newton method, we would compute the jacobian and hessian of the overall ssd. In Gauss-Newton we dont approximate the ssd but we linearize each of the functions (1st order taylor series), which implies that the hessian of these linearized functions is zero and the resulting "modified" ssd is quadratic in the original parameters. However, I would still like to see an explicit derivation here. --89.53.75.102 10:17, 16 September 2007 (UTC)
I have made additions to the "Derivation" section (also retitled by me), which hopefully address these issues. In answer to the above question, it's true that Gauss-Newton may have problems in the case of large residuals. Levenberg-Marquardt is more robust in those cases (although I believe Gauss-Newton is usually faster when the minimum is zero). --Tim314 19:54, 26 September 2007 (UTC)

## Derivation

Much like the comment above, I am looking for more of a derivation of the Gauss-Newton algorithm. This article is useful in telling us what the algorithm is, what it uses as an update equation, but not why it works. —Preceding unsigned comment added by 138.64.2.77 (talk) 14:15, 13 September 2007 (UTC)

Thanks also to Jitse Niesen for his positive feedback on my talk page. --Tim314 19:57, 26 September 2007 (UTC)

## Merger proposal

I propose that the article Levenberg-Marquardt algorithm be merged into this one.Petergans (talk) 08:28, 4 January 2008 (UTC)

I do not think they should be merged. I view them as separate. fnielsen (talk) 09:21, 4 January 2008 (UTC)

The basis of my proposal is that the derivation of the normal equations is the same whether or not a Marquardt parameter is used. I see the Marquart parameter as a "damping" device to overcome problems of divergence in "un-damped" non-linear least-squares calculations. The issue is discussed in detail in my book "Data Fitting in the Chemical Sciences", Wiley, 1992 Petergans (talk) 11:11, 4 January 2008 (UTC)

I agree with Petergans in at least one respect. The Levenberg-Marquardt technique is a modification of the Gauss-Newton method, but has the same mathematical basis. If not merged into one, the Levenberg-Marquardt article should branch from the Gauss-Newton article and not reproduce the same build-up. At the very least, each article should refer to the other. Pgpotvin (talk) 05:03, 22 February 2008 (UTC)

## Comment that is maybe worth adding

Note that the Newton method can also be considered a special case of the GNA: for minimizing a general function ${\displaystyle {\tilde {S}}}$, choose ${\displaystyle f_{i}={\frac {\partial {\tilde {S}}}{\partial p_{i}}}}$, then ${\displaystyle S=\sum {f_{i}}^{2}}$ has as its minima (which are exactly 0) the extrema of ${\displaystyle {\tilde {S}}}$, ${\displaystyle J_{f}(p)}$ is the Hessian of ${\displaystyle {\tilde {S}}}$, and the step formula reduces to the Newton step. —Preceding unsigned comment added by AmitAronovitch (talkcontribs) 12:17, 10 January 2008 (UTC)

## Note on a section in the article

The section

==Computing the step ${\displaystyle \delta ^{k}}$==


assumes that the jacobian is a square matrix, which is seldom the case (its dimensions are m×n, where n is the size of the vector p of variables and m is the number of terms in the sum of squares). I suggest we either specify that the jacobian must be square for what is in that section to work, or even better, cut this section out altogether. Comments? Oleg Alexandrov (talk) 23:09, 18 January 2008 (UTC)

I cut out that section. Is there anything in this article which is wrong? It looks reasonably good to me. Oleg Alexandrov (talk) 05:50, 20 January 2008 (UTC)

I have begun the process mentioned above. If you look at my (in progress) draft re-write of linear least squares in User:Petergans/a/sandbox you will see why this article misses the point completely. In non-linear least squares the system is approximated to a linear one so that the theory of linear least squares can be applied, but further complications arise because of the approximation. Petergans (talk) 08:30, 20 January 2008 (UTC)

Mmm. I find the current linear least squares article much easier to understand than your proposed rewrite. I'd very much prefer to keep the first part of the current article the way it is, and only later go into the facts that linear least squares is about fitting experimental data, and that the least squares solution is the minimal variance solution.
In other words, the article the way it is, it requires from reader just a background in linear algebra and multivariate calculus. The way you propose to write it people would need a more serious background to understand what the article is about.
By the way, this discussion belongs at talk:linear least squares. Any objections if we copy it and continue there? Oleg Alexandrov (talk) 17:07, 20 January 2008 (UTC)

Please don't do that just at the moment - I'm consulting elsewhere by e-mail as to how to proceed. When the time is ripe a discussion will be opened there. In the meantime, thanks for your comments. It's difficult to find the right level at which to pitch an article. I usually err on the side of being too rigorous! Petergans (talk) 18:45, 20 January 2008 (UTC)

I have a question for you as a mathematician. Are there instances where a sum of squares is minimized other than the minimization of a sum of squared residuals? Your comments, and this article itself, seem to suggest that there are, but I don't know of any. Petergans (talk) 16:29, 21 January 2008 (UTC)

I guess most applications are indeed for some kind of sum of squares of differences of things, at least that's all I know too. However, the algorithm holds for a general sum of squares, and I think that's the most natural way of stating the algorithm. Oleg Alexandrov (talk) 04:31, 22 January 2008 (UTC)

I now understand your position. Looked at as an abstract concept this article is fine. I shall have to argue that the practical aspect is more important, even if it means that the concept becomes more complicated. After all, Gauss first developed the method of least squares to predict the orbit position of Ceres. This has been a useful discussion. Many thanks. Petergans (talk) 10:49, 22 January 2008 (UTC)

## A major proposal

Please see talk:least squares#A major proposal for details. Petergans (talk) 17:35, 22 January 2008 (UTC)

I have prepared a new article in User:Petergans/b/sandbox. I propose that this article should replace the present one and be renamed Non-linear least squares. Petergans (talk) 09:09, 31 January 2008 (UTC)

I an engineer with a PhD in math. use Gauss-Newton in my daily work. With your rewrite, I would not be able to understand what the article is about and how to use the method.
I suggest you do not attempt to integrate things like that. Wikipedia is not a book for the specialist, it is a loose collection of quasi-independent articles written as simply as possible. I object to the rewrite in the strongest terms. Oleg Alexandrov (talk) 15:54, 31 January 2008 (UTC)
If you want, please create the new Non-linear least squares article, but leave the original articles unmodified, so that the reader can get the theory from your article, but the individual applications from the simpler separate articles. Oleg Alexandrov (talk) 15:55, 31 January 2008 (UTC)
See Wikipedia talk:WikiProject Mathematics#Least squares objectionable rewrite. Oleg Alexandrov (talk) 16:02, 31 January 2008 (UTC)
I think the new article is great -- it really brings together a lot of NLLS material -- but I'd prefer that it not overwrite this article. Why don't you move it to Nonlinear least squares (or recreate it there)? CRGreathouse (t | c) 17:41, 31 January 2008 (UTC)
I agree with CRGreathouse. The new article is great, move it to Non-linear least squares, but leave the current article too. Wikipedia is not paper, and there is plenty of room for two presentations of related material from different viewpoints. The current article is short, clear, and humble, so that humble editors can still contribute to it. Your new article is comprehensive, detailed, and quite bold, making it harder for timid editors like me to contribute. I had a year or two of numerical at the graduate level in math, and would feel comfortable editing the current article with only a single reference in hand, but I'd probably have to consult my whole library to feel sure I was providing edits that kept the perspective of the new article. For instance, I have only a few references that pay more than lip service to the statistical distribution of the errors in the stability analysis.
At any rate, good work. Your experience is invaluable, and your articles are bold. Your steps "talk page, sandbox, implement" are an ideal model. In this case, I think a new article is in order. JackSchmidt (talk) 18:40, 31 January 2008 (UTC)
Minor point, purely wiki technical, nonlinear least squares became an article last night. I don't have the admin move tool, where you don't leave behind a redir with empty history, so someone might need to assist the move, depending on which punctuation is preferred. JackSchmidt (talk) 18:44, 31 January 2008 (UTC)
Petergans prefers Non-linear least squares from what I see in the text, so I moved User:Petergans/b/sandbox there and redirected Nonlinear least squares to it. We can discuss further at Talk:Non-linear least squares whether there should be a dash or not. Oleg Alexandrov (talk) 04:13, 1 February 2008 (UTC)

## Least squares: implementation of proposal

which contain more technical details, but it has sufficient detail to stand on its own.

In addition Gauss-Newton algorithm (this article) has been revised. The earlier article contained a serious error regarding the validity of setting second derivatives to zero. Points to notice include:

• Adoption of a standard notation in all four articles mentioned above. This makes for easy cross-referencing. The notation also agrees with many of the articles on regression
• Weighted least squares should be deleted. The first section is adequately covered in Linear least squares and Non-linear least squares. The second section (Linear Algebraic Derivation) is rubbish.

This completes the fist phase of restructuring of the topic of least squares analysis. From now on I envisage only minor revision of related articles. May I suggest that comments relating to more than one article be posted on talk: least squares and that comments relating to a specific article be posted on the talk page of that article. This note is being posted an all four talk pages and Wikipedia talk:WikiProject Mathematics.

Petergans (talk) 09:22, 8 February 2008 (UTC)

## Response to rewrite

First, the article is written from a narrow regression theory perspective. Gauss-Newton is a general purpose optimization algorithm, and has to be treated as such. In my work I use it to solve an inverse problem, which has nothing to do with fitting parameters.

Second, Petergans removed some very useful information about how to solve the obtained linear system and how to do line search. Granted, this information may be available in other articles somewhere, but Wikipedia is not a book, and it is good to include relevant material for the benefit of the reader, even at the risk of repetition.

I propose we go back to the original notation and text, which treats the algorithm in its proper context, in optimization. If Petergans wants to add an application to regression theory, he's more than welcome to. Oleg Alexandrov (talk) 03:23, 9 February 2008 (UTC)

Reversion is not the way forward. The algorithm described here is the one developed by Gauss and used in the vast majority of cases. See, for example, nonlinear regression. What is desribed by Oleg is a completely separate application and as such would merit a reference or even an additional section. Please reply here, indicating sources so that the scope of the optimization application can be assessed.
The article backpropagation contains the following: "The backpropagation algorithm for calculating a gradient has been rediscovered a number of times, and is a special case of a more general technique called automatic differentiation in the reverse accumulation mode. It is also closely related to the Gauss-Newton algorithm, and is also part of continuing research in neural backpropagation." Is this what Oleg is referring to as an inverse problem? Petergans (talk) 08:34, 9 February 2008 (UTC)
What I am saying is that the residuals need not be of the form
${\displaystyle r_{i}=y_{i}-f(x_{i},{\boldsymbol {\beta }}),}$
rather, anything like
${\displaystyle r_{i}=y_{i}-f_{i}({\boldsymbol {\beta }})}$
where ${\displaystyle f_{i}}$ are completely independent functions and don't come from fitting a regression with independent paramemeters ${\displaystyle x_{i}}$. See here for a description in that language.
As such, there is no need to discuss the article in terms of fitting parameters, Gauss-Newton is an optimization algorithm that is used for any least square minimization. Oleg Alexandrov (talk) 16:36, 9 February 2008 (UTC)
Peter, instead of stating that reversion is not the way forward, it would be more convincing to explain where your version is better than the previous version. You make the connection with regression by saying that the residuals have a particular form, which is nice but it doesn't change the algorithm. For instance, the ${\displaystyle x_{i}}$ and ${\displaystyle y_{i}}$ do not appear in the article except in the definition for the residual. That's what Oleg means when he says that there is no reason to restrict ourselves to regression. Optimization is a more general setting that encompasses the regression problem. As for references, both books listed in the article will do.
What is the "serious error" that you said the previous version of this article contained? Why did you remove the section on line search? You say yourself that the sum of squared residuals may increase; that is exactly the problem that line search solves.
I don't think there is any connection with backpropagation. -- Jitse Niesen (talk) 17:59, 9 February 2008 (UTC)

I quote from the version of 20 Jan.

${\displaystyle H_{S}(p)=2J_{f}(p)^{\top }J_{f}(p)+2\sum _{i=1}^{m}f_{i}(p)\,H_{f_{i}}(p)}$
where ${\displaystyle H_{f_{i}}}$ is the Hessian of ${\displaystyle f_{i}}$.
Note that the second term in this expression for ${\displaystyle H_{S}}$ tends to zero as ${\displaystyle S(p)}$ tends to zero (because :then all the ${\displaystyle f_{i}(p)}$ approach zero). Thus if the minimum value of S(p) is close to zero, and our trial value of p is close to the minimum, then we can approximate the Hessian by ${\displaystyle H_{S}(p)=2J_{f}(p)^{\top }J_{f}(p)}$

The error is that the absolute value of S is immaterial. The approximation is valid when S is near its minimum value, regardless of what that value is, because ,on expanding the model to the second term it becomes

${\displaystyle f_{i}\approx f_{i}^{k}+\sum _{j}J_{ij}\Delta \beta _{j}+1/2\sum _{j}\sum _{k}H_{ijk}\Delta \beta _{j}\Delta \beta _{k}}$

It is clear that the second term becomes much smaller than the first as ${\displaystyle \mathbf {\Delta } \beta }$ goes to zero a S goes to its minimum value. The model then becomes approximately linear in the parameters.

What I have written is consistent with non-linear least squares, where both line search and Marquardt parameter are discussed. Line search is pretty much obsolete in data fitting as the Marquardt parameter is now generally accepted as being much better at handling divergence. As mentioned above, I have presented the historical Gauss-Newton method which was, from its inception, a data fitting method.

We don't disagree that GN involves minimization of a sum of squared residuals. It's just that residuals do not necessarily involve an independent variable. I think it would be appropriate for others to add a new section on use of GN in optimization but I strongly urge that it be given an experimental context, preferably an actual application. Petergans (talk) 20:33, 9 February 2008 (UTC)

I'm not convinced that what you call an error really is an error. I'd argue that the convergence of the algorithm is dictated by how well it approximates S and not how well it approximates f. I am also very surprised by your statement that the algorithm converges quadratically. Could you please point me to a proof for this?
Not everybody will come to this article after reading non-linear least squares. This article should have all information about the Gauss-Newton algorithm as this term is understood nowadays, which includes line search. If Gauss originally didn't use line search, this can of course be mentioned. Generally it would be nice if you could write a bit about the history. And while Levenberg-Marquardt is more robust, Gauss-Newton is sometimes faster and easier to implement.
It would be silly to write another section about the use of Gauss-Newton in optimization because it would be exactly the same, except that the line
${\displaystyle r_{i}=y_{i}-f(x_{i},{\boldsymbol {\beta }})}$
is replaced by
${\displaystyle r_{i}=y_{i}-f_{i}({\boldsymbol {\beta }}),}$
as Oleg says. An application you're probably aware of is non-linear total least squares or orthogonal distance regression. -- Jitse Niesen (talk) 17:31, 10 February 2008 (UTC)
(To Peter.) Note that the minimum of ${\displaystyle S(p)}$ is not necessarily the minimum of each individual ${\displaystyle f_{i}}$ or ${\displaystyle f_{i}^{2}.}$ I fail to see from your analysis why the terms ${\displaystyle H_{ijk}}$ (without multiplying them by ${\displaystyle \Delta \beta _{j}\Delta \beta _{k}}$) would go to zero, and as such, why the hessian terms could be ignored. Could you please be more clear here? Thanks. Oleg Alexandrov (talk) 18:28, 10 February 2008 (UTC)

To me it still appears that the rewrite made this article worse, not better. I have decided not to revert however, since in spite of chopping this article Petergans did add the nice non-linear least squares article containing a lot of good info in one place.

I put back at Gauss-Newton method the more general form of the residues used in other areas beyond parameter fitting. I will add back the line search paragraph too, some time later. I will keep Petergans's notation. Oleg Alexandrov (talk) 21:17, 10 February 2008 (UTC)

Maybe the sum of squares can go to zero in some optimization scenario. It is certain that, in data fitting, is does not do so. In fact, the expectation value of a unit-weighted sum of squares is ${\displaystyle \sigma ^{2}(m-n)}$. Also, S belongs to a ${\displaystyle \chi ^{2}}$ distribution. The contour map on p62 of my book shows how the contours of S for the non-linear function of two parameters, ${\displaystyle f=\alpha e^{-\beta x}}$ become elliptical near the minimum, that is, S becomes quadratic in the parameters. (I've e-mailed a copy of the diagram to Jitse). As to why the Hessian term can be neglected, I may not have got it right because I'm not a mathematician and I don't have any reference books on optimization to hand to check it out. Petergans (talk) 22:38, 10 February 2008 (UTC)

## bookmark

Peter you are right that the residuals need not be always close to 0. I found out Nocedal's book from the references on google books[1] and added from there the precise situations when the second-order term can be ignored. Hopefully this settles the issue. Oleg Alexandrov (talk) 01:15, 11 February 2008 (UTC)
Peter, you're right that S becomes quadratic near the optimal β. This quadratic term contains two contributions, one involving the linear term in the model f and one involving the quadratic term in f. The question is whether the latter one of these is so much smaller that it can be ignored. Just out of interest, which optimization books do you like? I'm interested in what books people in applications actually read.
Oleg, I added a bit about nonlinear regression as that is the main application. The term least squares is often used synonymous with regression, even though we may not like it. -- Jitse Niesen (talk) 14:04, 11 February 2008 (UTC)
People in applications don't generally read stuff in the mathematics literature. I'm an exception because I have taken a particular interest in numerical analysis of chemistry data. As regards secondary sources, the best reference for least squares is, in my opinion, W. C. Hamilton, Statistics in Physical Science, The Ronald Press, New York, 1964. Hamilton was a crystallographer and introduced the r-factor (crystallography). His approach is a rigorous one. Other references in linear least squares are generally less rigorous, but useful nonetheless. Another standard is P.R. Bevington, Data Reduction and Error Analysis for the Physical Sciences, McGraw-Hill, 1969. Petergans (talk) 09:28, 25 February 2008 (UTC)

Why, on the first line, do you put ${\displaystyle m\geq n}$? When m=n the Jacobian is square so the normal equation reduce to a set of n simultaneous equations in n unknowns, ${\displaystyle \mathbf {J{\boldsymbol {\Delta }}{\boldsymbol {\beta }}=r} }$. Surely this is not the Gauss-Newton algorithm? Petergans (talk) 09:07, 11 February 2008 (UTC)

There is nothing preventing the algorithm from working for ${\displaystyle m=n}$ (just as you can do a linear least square fit of a line through only two data points). From what I heard least squares are not so stable from ${\displaystyle m=n,}$ but again, Gauss-Newton does work in this situation, as long as the Jacobian (which is a square matrix in this case) has full rank. Oleg Alexandrov (talk) 15:47, 11 February 2008 (UTC)

## Latest changes

The same error is repeated when

${\displaystyle {\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}-\mathbf {H} ^{-1}\mathbf {g} \,}$ and ${\displaystyle \mathbf {g} =-2\mathbf {J^{T}r;H=2J^{T}J} \,}$ becomes
${\displaystyle {\boldsymbol {\beta }}^{k+1}={\boldsymbol {\beta }}^{k}-\mathbf {\left(J^{T}J\right)^{-1}J^{T}r} .}$

???? Petergans (talk) 14:11, 24 February 2008 (UTC)

This is not an error. The Jacobian is of the vectors of residuals, not of the derivatives of f. Since the residuals are ${\displaystyle r_{i}=y_{i}-f(x_{i},\beta ),}$ that's why you did not have a minus before, it was coming for free from the minus next to f. Oleg Alexandrov (talk) 17:35, 24 February 2008 (UTC)

You are creating confusion by your change of definitions. The Jacobian should relate to the derivatives with respect to f, as it is so defined in non-linear least squares, and as it is so used in applications. Anyone coming from non-linear least squares for more information will be totally thrown by the changes of sign. Petergans (talk) 21:22, 24 February 2008 (UTC)

I am using the notation from here. Gauss-Newton is a general purpose algorithm, not used only in the context of what is at non-linear least squares.
A simple solution to this problem is to remove the formula
${\displaystyle r_{i}=y_{i}-f(x_{i},{\boldsymbol {\beta }})}$
from this article. It is unreasonable to ask that the Jacobian be defined in terms of derivatives of this f, Gauss-Newton can be used for residuals which do not necessarily have this form. Oleg Alexandrov (talk) 22:17, 24 February 2008 (UTC)

No, Oleg, what is unreasonable is that in non-linear least squares it says

"${\displaystyle \mathbf {\left(J^{T}WJ\right)\Delta {\hat {\boldsymbol {\beta }}}=J^{T}W\Delta y} }$

These equations form the basis for the Gauss-Newton algorithm for a non-linear least squares problem."

"${\displaystyle (\mathbf {J} ^{T}\mathbf {J} )\delta {\boldsymbol {\beta }}^{k}=-\mathbf {J} ^{T}\mathbf {r} }$"

Any reader seeing these two expressions will instantly think that one of them has the wrong sign on the right hand side. The derivative with respect to f can be for either ${\displaystyle f(x_{i},{\boldsymbol {\beta }})orf({\boldsymbol {\beta }})}$, it makes no difference here. The point is this. The fact that any WP article should not depend on another does not mean that related topics need not be consistent with each other. If they are inconsistent both become more difficult to read.

Why do you raise the issue of the independent variable again? That issue has been resolved. Move on. Petergans (talk) 18:24, 25 February 2008 (UTC)

Sorry, I was not clear enough. It is not about the independent variables. I am trying to say that the residual can have different form, for example,
${\displaystyle r_{i}=f(x_{i},{\boldsymbol {\beta }})-y_{i}.}$
The Gauss-Newton iteration should be expressed in terms of the residuals, not in terms of some particular representation of the residuals. The book in the references does that. I also performed a google search, and all the results I got on the first page of search results use the formula with a minus: [2] [3] [4] [5], [6]. Oleg Alexandrov (talk) 04:19, 26 February 2008 (UTC)

No, no, Oleg you are missing the point. It's not about external references. Its about internal consistency in Wikipedia. If you don't change your presentation to be consistent with articles that link to this one, I will do it myself. Petergans (talk) 08:16, 26 February 2008 (UTC)

Per Wikipedia:Verifiability and WP:NOR, we must follow existing external references. Most readers visiting this article will expect to see the same formulation as in the existing optimization literature. Oleg Alexandrov (talk) 16:01, 26 February 2008 (UTC)

I think we should represent both points of view, with the one in the literature going first. I'll delay this issue for a while. Oleg Alexandrov (talk) 03:33, 29 February 2008 (UTC)

## Objective of Gauss-Newton

The objective of the Gauss-Newton method is to model a set of ${\displaystyle m}$ numbers, the dependent variables, ${\displaystyle y_{1},y_{2},\dots ,y_{m},}$ with a function, ${\displaystyle f({\boldsymbol {\beta }})}$, of ${\displaystyle n}$ adjustable parameters, ${\displaystyle \beta _{1},\beta _{2},\dots ,\beta _{n}}$, ${\displaystyle m>n}$.

This is not correct. The objective of Gauss-Newton is to find the minimum, not to model things. Regression is about modeling things. Gauss-Newton is just a tool for finding the minimum of a sum of squared terms, whether coming from data fitting or from other source.

Also, the independent variables play no role in the Gauss-Newton method. It was not clear from the article what their role was, nor is that relevant in this article, rather only to specific applications. I removed their mention. I would welcome their addition back if it is well-explained. Oleg Alexandrov (talk) 03:33, 29 February 2008 (UTC)

Oleg, you simply don't know what you are talking about. There are always independent variables, otherwise the function ${\displaystyle f_{i}(\beta )}$ would be the same for all i. We could write ${\displaystyle \mathbf {r=y-f(X,{\boldsymbol {\beta }})} }$. There are two cases
1. All ${\displaystyle X_{ij}}$ are functions of an independent variable, x. This is the traditional application of GN.
2. Each ${\displaystyle X_{ij}}$ is an independent variable in its own right. This is a relatively recent development. This discussion would be much clearer if you cited an application for it, something which you have, so far, refused to do.
Tho role of x is detailed in non-linear least squares. Indeed, the whole theoretical background is there. My position is that the GN article should have been merged with non-linear least squares, as I proposed. This article adds nothing of substance to what is contained there. I defer to you in leaving it as a separate article, but that does not mean that all the background has to be duplicated.
My introduction indicated the two types of application. Yours does not. It is not acceptable. Petergans (talk) 10:42, 29 February 2008 (UTC)

Peter, you should step out of your narrow area of specialization. You are identifying Gauss-Newton with regression. Gauss-Newton does not model anything, it is an algorithm to find the minimum of a sum of squared residuals, whether those residuals come from what you are familiar with, or whether they don't.

Let us delay the discussion of independent variables for later. I think it is just a misunderstanding of terminology. Let us focus on what the purpose of Gauss-Newton is. See what I wrote in the article. Oleg Alexandrov (talk) 15:34, 29 February 2008 (UTC)

It is not an narrow area of specialization. There is a vast literature where GN is used for data fitting. You have not come up with a single example of another use. Your edit has removed essential material. Please take more care and try to see that I am seeking a compromise that covers both our points of view. Petergans (talk) 16:41, 29 February 2008 (UTC)

Here is an application along the lines of what I use in my work: find the approximate solution of the system

${\displaystyle \sin \beta _{1}+\beta _{1}\cos \beta _{2}=1}$
${\displaystyle e^{\beta _{1}}-\beta _{2}/\beta _{1}=-1.}$

The residuals are the right-hand side minus the left-hand side.

There is no modeling involved here. Modeling has been done beforehand, and what you model (data fitting) is different than what I model (finding the shape of electronic circuits). Gauss-Newton is about finding the minimum. Please don't confuse an application with a tool (out of many) used for that application. Oleg Alexandrov (talk) 16:56, 29 February 2008 (UTC)

What you describle is a problem of solving a set of nonlinear simultaneous equations, ${\displaystyle f_{1}({\boldsymbol {\beta }})=0}$ and ${\displaystyle f_{2}({\boldsymbol {\beta }})=0}$. It is true that one way to do this is to use the GN formalism. We do the same thing in Determination of equilibrium constants#Speciation calculations. However, the minimum value of the objective function is zero in this case, so in reality this is Newton's method for finding the zeros of the functions, with first derivatives only. In fact, the normal equations collapse to ${\displaystyle \mathbf {J\delta \beta =r} \!}$, which makes this clear: ${\displaystyle \beta ^{k+1}=\beta ^{k}+J^{-1}r}$. The only advantage (that I know of) of using the normal equations is that ${\displaystyle \mathbf {J^{T}J} }$ is symmetric and should be positive definite, which can be useful if J is ill-conditioned with respect to inversion. (I avoid using terms like singular and rank when dealing with numerical matrices).
I suggest that the application of GN to the solution of nonlinear simultaneous equations be treated in a separate section from the application to least squares systems. The implementation is different because an "exact" solution exists, at least in principle and the residuals that you defined above go to zero. The alternatives to GN for least squares systems are discussed in nonlinear least squares. Petergans (talk) 22:28, 29 February 2008 (UTC)

That the number of equations equals the number of variables above was not intentional. Here is one more equation, for a system of three equations:

${\displaystyle \sin \beta _{1}+\beta _{1}\cos \beta _{2}=1}$
${\displaystyle e^{\beta _{1}}-\beta _{2}/\beta _{1}=-1.}$
${\displaystyle \beta _{1}-\beta _{1}\beta _{2}=2.}$

The point is still the same. The objective of Gauss-Newton is not to model anything, the objective is to find the minimum of a sum of squared residuals. You are confusing an application (data fitting, a very important one), with an algorithm.

The way you state the algorithm is confusing for people who don't use Gauss-Newton for data fitting. You must separate the algorithm from its application. Oleg Alexandrov (talk) 03:24, 1 March 2008 (UTC)

OK, I gave it one more try. Please discuss here a bit rather than reverting as you've been doing a lot lately.
I have stated the objective of the algorithm in a language free from terminology from data fitting, since that one is alien to somebody not in your area. I put the application to data fitting right below. I hope that will make you happy. Oleg Alexandrov (talk) 16:15, 1 March 2008 (UTC)
Acceptable. I've just cleaned up the English a bit and use m>n because the essence of GN is that it is used for overdetermined systems. The WP equation interpter was malfunctioning when I did it, but hopefully the text alright.
Incidentally, Nocedal is wrong about why the second term in the Hessian can be ignored. It has nothing to do with linearity of residuals as such. Any function which has a minimum can be approximated by a parabola with respect to each parameter in the vicinity of the minimum, just as any curve can be approximated by a sequence of straight-line segments. This is the same thing as saying that the model function is linear in the parameters at the minimum: linear model -> quadratic SSQ. In fact, if the model function is linear in the parameters, so are the residuals. Therefore, being close to the minimum is a sufficient condition by itself. When the parabolic approximation is valid GN converges in one iteration, just like linear least squares. Petergans (talk) 08:29, 2 March 2008 (UTC)
No Peter, it is not so simple. You are right that any function ${\displaystyle g(x)}$ can be approximated by a parabola around the minimum ${\displaystyle x_{0}}$. That parabola approximation must involve up to second derivatives, ${\displaystyle g(x_{0}),}$ ${\displaystyle g'(x_{0}),}$ ${\displaystyle g''(x_{0})}$ (see Taylor expansion).
Note however that Gauss-Newton does not use second derivatives, only first derivatives. So, you first approximate the function by a parabola, which is good, no problem here, but then you further approximate the parabola by throwing away second order derivatives. Here is where you can make a big error. Oleg Alexandrov (talk) 16:04, 2 March 2008 (UTC)
Another misunderstanding. When the model is linear in parameters, the 2nd. derivative of f with respect to a parameter is zero. The sum of squares, S, is quadratic in parameters, the gradient of S is zero at the minimum, but obviously the curvature of S is not zero. Don't jump so hastily to the conclusion that I am in error. I've been doing this stuff for centuries. Petergans (talk) 09:55, 4 March 2008 (UTC)
I am not jumping to any conclusion. My comment was carefully worded and respectful.
The model is not always linear in the parameters. And then the second order derivatives may not be small.
I think you are assuming that the second order terms are less relevant than the first order terms. That is true if you multiply them by ${\displaystyle \Delta \beta ^{2}}$ but this is not the case when you drop the second derivative terms from Gauss-Newton.
I suggest you do a careful calculation, intuition is not a good guide here. Oleg Alexandrov (talk) 16:59, 4 March 2008 (UTC)
Intuition has nothing to do with it. I did the calculations long ago. Look at the diagram (copyright material) from my book at Hyperquad downloads. This shows contours of the sum of squares for the model ${\displaystyle y=pe^{-qx}}$. As you get close to the minimum the contours tend towards being ellipses, that is, S tends towards being quadratic in the parameters. In other words, the system approximates to a linear least squares system in the vicinity of the minimum. Petergans (talk) 18:32, 4 March 2008 (UTC)
Peter, exactly what are you claiming? Of course you can ignore the second term in the Hessian and the method will converge. You can also ignore both terms in the Hessian and get gradient descent which also converges. In what sense is it okay to drop the second term, but not the first term? -- Jitse Niesen (talk) 22:48, 4 March 2008 (UTC)
That the contours are elliptical just proves that Newton's method converges. What you need is a proof that Gauss-Newton converges without the second-derivative terms. Until then, I think we can leave the article the way it is, following the Nocedal reference which is very respectable. Oleg Alexandrov (talk) 04:03, 5 March 2008 (UTC)

## Small recent notation changes

Peter, I wonder if you agree with what I changed recently in the article (see the history and explanations there of what I did). I plan to further work on it, but want to make sure there are no disagreements so far. Oleg Alexandrov (talk) 04:35, 4 March 2008 (UTC)

I've simplified the notation by making it clearer that the normal equations etc. apply at a particular iteration. The principle is that when something applies over a range of equations it simplifies the notation by making this explicit at the beginning, rather than repeat it in each equation. I hope you will agree that as it stands there is no scope for misunderstanding. Petergans (talk) 09:55, 4 March 2008 (UTC)

## Second derivatives

I see that the relevant text has been changed to

This approximation is justified when the residuals ${\displaystyle r_{i}}$ are close to zero (which is true when close to the minimum in many applications) or when the residuals are almost linear, as then the second-order partial derivatives are close to zero.

We've already been through this already and it was accepted that the magnitude of the residuals is irrelevant.

My "claim" is that the linearity of the residuals is a consequence, not an explanation. The fundamental reason why the second derivatives ${\displaystyle {\frac {\partial ^{2}f}{\partial \beta _{i}\partial \beta _{j}}}}$ and ${\displaystyle {\frac {\partial ^{2}r}{\partial \beta _{i}\partial \beta _{j}}}}$ go to zero is that, in the vicinity of the minimum the model is approximately linear in the parameters and S is approximately quadratic in each parameter. These are two aspects of the same condition. If the model is approximately linear in the parameters, then so are the residuals. Therefore the only condition required for the second derivatives to be negligible is that the parameter values are close to their optimal values, that is, S is close to its minimum value. Petergans (talk) 09:01, 5 March 2008 (UTC)

I never accepted that the magnitude of the residuals is irrelevant, and I don't think Oleg has either. I don't believe that ${\displaystyle {\frac {\partial ^{2}r}{\partial \beta _{i}\partial \beta _{j}}}}$ goes to zero when you approach the minimum; what is the basis for that claim? -- Jitse Niesen (talk) 12:23, 5 March 2008 (UTC)

Jitse, this is what you said above: Peter, you're right that S becomes quadratic near the optimal β. I took that to mean that you accepted the irrelevancy of the magnitude of the residuals. Sorry if I misunderstood. Oleg accepted the proposition explicitly. In any case, if S has become quadratic, the model must have become linear.

${\displaystyle \mathbf {r=y-X\beta ;\ r^{T}r=y^{T}y-y^{T}X\beta -\beta ^{T}X^{T}y+\beta ^{T}(X^{T}X)\beta } }$

This is a particular property of a least squares system and does not apply to general optimization problems. There's nothing highfaluting here. The model can be expanded about the optimal β as a Taylor series to first order only; it is then linear in the parameters and S is quadratic. The reason that the nonlinear system converges is that it approximates to a linear one at convergence and we know that, in a linear least squares system, residual magnitudes are irrelvant and

${\displaystyle {\frac {\partial ^{2}r_{i}}{\partial \beta _{j}\partial \beta _{k}}}={\frac {\partial \left(-X_{ij}\right)}{\partial \beta _{k}}}=0}$

Petergans (talk) 17:04, 5 March 2008 (UTC)

Peter, I did not say the second-order terms go to zero. I wrote there that there are conditions when they go to zero, those conditions are not always true.
Look at the problem of minimizing ${\displaystyle S(x)=(1+x^{2})^{2},}$ so the residual is ${\displaystyle r(x)=1+x^{2}.}$ It is obvious that the minimum of ${\displaystyle S(x)}$ is at ${\displaystyle x=0,}$ yet, ${\displaystyle r(0)r''(0)=2,}$ which is not zero. Oleg Alexandrov (talk) 04:15, 6 March 2008 (UTC)
One possible source of confusion is that when Nocedal and Wright say that the residuals are almost linear near the minimum, they mean that the second and higher derivatives are small. On the other hand, when we are saying that the sum of squares is approximately quadratic near the minimum, we mean that the first derivative is small (and zero at the minimum). So there are subtly different statements.
Peter, I'm not sure what you mean when you say that the "model can be expanded about the optimal β as a Taylor series to first order only; it is then linear in the parameters". Are you saying that this cannot be done at any other β? If so, why?
To be honest, I do think there is more to it than the comment by Nocedal and Wright shows. I think that another effect which makes the second term in the Hessian smaller is that the residual changes sign which leads to cancellations. But I'm not sure, and it would be original research to put that in. -- Jitse Niesen (talk) 11:59, 6 March 2008 (UTC)

Yes, The first-order expansion of the model about ${\displaystyle {\hat {\boldsymbol {\beta }}}}$ linearizes it and is a good approximation because it is consistent with the fact that S is quadratic at its minimum. Those second derivatives do tend to zero. Also, as the shift vector gets very small, all terms in the expansion of the model in terms of ${\displaystyle \delta \beta }$ above the first term become negligible.

${\displaystyle f({\boldsymbol {\beta }})=f({\boldsymbol {\beta }}^{k})+J\ \delta \beta +{1 \over 2}\delta \beta ^{T}H^{f}\delta \beta \dots \approx f({\boldsymbol {\beta }}^{k})+J\ \delta \beta }$

Far from the minimum S is not quadratic, so the linearized model cannot be a good approximation. Once the refinement reaches a region where ${\displaystyle \delta \beta }$ is small, the linearized model is a good approximation and GN converges rapidly, as the system appears to be nearly a linear least squares one. I revised the article in this sense earlier today. I'll have to go to the library to get a citation, but apart from that can we please close this discussion? Petergans (talk) 16:25, 6 March 2008 (UTC)

Peter, please derive Gauss-Newton from Newton's method for the very simple residual ${\displaystyle r(\beta )=1+\beta ^{2},}$ for which the cost function is ${\displaystyle S(\beta )=(1+\beta ^{2})^{2}.}$
The minimum is at ${\displaystyle \beta =0.}$ The term
${\displaystyle r_{i}{\frac {\partial ^{2}r_{i}}{\partial \beta _{j}\partial \beta _{k}}}}$
you claim is close to zero actually equals to 2. Oleg Alexandrov (talk) 01:26, 7 March 2008 (UTC)

The theory applies to a sum of squares. The cost function is ${\displaystyle S(\beta )=(1+\beta ^{2})^{2}}$ does not qualify. Petergans (talk) 09:21, 7 March 2008 (UTC)

Here is a sum of three terms in two variables:

${\displaystyle S(\beta _{1},\beta _{2})=(1+\beta _{1}^{2})^{2}+(1+2\beta _{1}^{2})^{2}+(1+5\beta _{2}^{2})^{2}.}$
The same argument stands.
You now simply claim that the second term is ignored, but you removed the explanation about when this is so. That term cannot always be ignored. I'll put the explanation back. Oleg Alexandrov (talk) 15:50, 7 March 2008 (UTC)

## Optimization

Don't be so hasty. I've got Bjorck's book out of the library and this has helped to clarify the nature of your objections. Essentially, in data fitting the theory stands as it is. However, in optimization there is no model function. Instead of a model function, functions are defined whose sum of squares is to be minimized, so the theory won't apply. Bjorck gives the even simpler example ${\displaystyle r_{1}=\beta +1,\ r_{2}=\lambda \beta ^{2}+\beta -1}$ which is cited in R. Fletcher, Practical Methods of Optimization, Wiley, 1987. Bjorck states that the GN method is not locally convergent for |${\displaystyle \lambda }$| >1. I've plotted the sum of squares and for λ>1 there is a maximum at β=0, which certainly complicates matters. The sum of squares is quartic in β so it's not surprising that there are multiple minima. See non-linear least squares#multiple minima.

What the GN article needs is a clear distinction (separate sections?) between modeling and optimization since the same criteria cannot be applied in both applications. Let me think about it and come up with a proposal. Petergans (talk) 17:35, 7 March 2008 (UTC)

Please note that Gauss-Newton is an optimization algorithm, not a data-fitting algorithm. I do not mind you add here some theory of what happens in the data fitting case, but that should not obscure the fact that Gauss-Newton is a general algorithm used in plenty of other applications. Oleg Alexandrov (talk) 19:11, 7 March 2008 (UTC)

## Gauss-Newton used in the literature

I made a google books search for Guass-Newton method. Here are the first ten results, with the book title.

At references 1, 2, 3, 6, 7, 10, the search direction is computed in terms of residuals,

${\displaystyle (\mathbf {J_{r}} ^{T}\mathbf {J_{r}} )\delta {\boldsymbol {\beta }}^{k}=-\mathbf {J_{r}} ^{T}\mathbf {r} ,}$

I was not able to find a formula at references 4, 5, 8, 9.

Nowhere in those references have I seen model functions as in this article.

With due respect to Petergans, who is surely an expert in his area of expertise, I do not believe it is appropriate to treat this article using model functions. Model functions will show up in applications of Gauss-Newton to data fitting, but they may not show up in other applications.

We need to state this article in terms of residuals. Any particular application will have its particular residuals. I do not think it is right to state this article only in terms of one particular application. Oleg Alexandrov (talk) 17:07, 8 March 2008 (UTC)

To be more precise, the derivation of the algorithm should be in terms of residuals, not model functions. To help the readers at non-linear least squares, an expression for the shift vector using model functions can be stated. I hope this will satisfy Petergans's concern for compatibility of those articles, while stating the article in its proper framework, to be usable for all applications. Oleg Alexandrov (talk) 17:49, 8 March 2008 (UTC)

## A new beginning

This article has hitherto been based on a misleading premise, namely, that there is a single algorithm which we call Gauss-Newton. In fact there are two areas of application which have nothing to do with each other. It is an instance of convergent evolution where a similar end-product is produced by different pathways. This has been brought out into the open by the discussion above concerning the approximation to the Hessian, where two quite different criteria apply.

1. Data modeling owes nothing the Newton’s method of function minimization. It is motivated by the Gauss-Markov-Aitken theorem and by the maximum likelihood criterion. Convergence is guaranteed by the fact that the linearization process is valid at the minimum of S. A consequence of this is that the second derivative term in the Newton’s method Hessian tends to zero, though this is immaterial to the refinement process.
2. In optimization the least squares criterion is used as a criterion of best fit, without external motivation. Gauss-Newton is used as an simplification of Newton’s method in order to avoid the need to calculate second derivatives. This method of linearizing the system depends on the (arbitrary) magnitudes of the function values and of the first and second derivative terms in the Hessian.

The descriptions of the two applications are to some extent incompatible. In data modeling a sum of squared residuals is minimized, but in optimization it is a sum of squared function values. In data modeling there is always an independent variable, but in optimization there may not be an independent variable.

It follows that while the two applications share some things in common, they also need to be treated separately when it comes to the detail. I have tried to do this in the revised article, giving both applications approximately equal importance, as is done in Björck's book. Note also a reference to Quasi-Newton methods has been added.

Regarding expertise, no-one is an expert in everything. A great strength of WP is that it brings together editors of every conceivable kind and that we respect and learn from each other. Petergans (talk) 20:02, 9 March 2008 (UTC)

Good you tried. But they are not two different algorithms. The goal is always the same, to find ${\displaystyle \beta }$ where some sum of squares is smallest. The shift vector equation is also identical. The way you adjust the parameters is identical. The way you solve the normal equations is also identical.
Currently, it is impossible to understand from the article which is the algorithm, which is the derivation, and which is the application. Perhaps you could give it another try. :) Oleg Alexandrov (talk) 03:56, 10 March 2008 (UTC)
The new try is based on chapter 9 of Björck's book which is thoroughly sourced (860 references to the original literature). It removes all possible confusion such as arose previously when a "single algorithm" notation was used for incompatible applications. Although the updating formula is the same for both applications, the properties of the algorithm are different because of the different way the formula is obtained. Björck goes into much more detail than is suitable for WP. Any proposal for a substantial change should be equally well sourced. Petergans (talk) 09:43, 12 March 2008 (UTC)

The book you use may have cited 860 references, but the way the article looks is not good, so let's focus on that.

Peter, the Gauss-Newton algorithm is very simple

1. Residuals are given
2. Normal equations are written
3. Normal equations are solved
4. Shift vector is updated

You are forcing any reader looking for the algorithm to read through two derivations and data fitting theory, just to get there. This is bad style and poor writing.

Please state the algorithm first. Its properties and uses in data fitting belong to a separate section, after the algorithm. Oleg Alexandrov (talk) 15:40, 12 March 2008 (UTC)

I based my article on a well-sourced text by an independent expert. You are trying to force your point of view. I have no further comment. Petergans (talk) 16:31, 12 March 2008 (UTC)
I am sorry you think like that. I believe both of us have the best interests of the reader in mind. I have also pointed above to ten book expositions to how Gauss-Newton is presented in the literature.
I have raised this edit conflict at Wikipedia talk:WikiProject Mathematics#Request for outside editorial review on the Gauss-Newton algorithm article. Oleg Alexandrov (talk) 02:43, 13 March 2008 (UTC)

## Note

${\displaystyle \delta {\boldsymbol {\beta }}=-\mathbf {\left(J_{r}^{T}J_{r}\right)^{-1}J_{r}^{T}r} ,}$

before the derivations, because it is a good teaching practice to state a result before doing two derivations to get it, I think.

There is no confusion with its usage at non-linear least squares, because

Is that reasonable? Oleg Alexandrov (talk) 05:31, 14 March 2008 (UTC)

Oleg, we have been over this before. The minus sign will create confusion for readers who come to this page from one of the least squares or regression pages. What you have added is mathematically correct, but potentially misleading because the reader will not necessarily be aware that there are alternative ways of specifying the Jacobian. The alternatives are spelled out later in the article. Petergans (talk) 09:57, 14 March 2008 (UTC)

Peter, I will very respectfully ask you to consider this matter very carefully. I agree that even though data fitting has not yet been mentioned up to that point in the article, and even though the formula very explicitly says that the jacobian is of r, and there were no model functions mentioned so far, the formula may still have the potential to cause some confusion for data fitting readers who are accustomed to seeing the formula in one way.
However, I hope you also agree that, unlike in a data fitting book, we need to strive to aim at the generic reader who visits this page. There is a very vast literature using Gauss-Newton for arbitrary residues. I hope you are able to see the following book pages in your browser: 1 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, (these came from a google books search).
You also need to think of the reader who does not have the time to read two technical derivations in order to find the normal equations describing the method. That reader may not even have the data fitting skills and mathematical background to read all that.
Also, the data fitting reader can easily derive the formula for model functions by plugging in the residuals used in data fitting into the formula for the shift vector.
It is my sincere hope that you will agree that there is more than one factor to be considered when deciding whether the formula should be there or not. I will argue on the whole we should err for putting the formula in, as the advantages outweigh the disadvantages. I would like to challenge you to consider this. Thanks. Oleg Alexandrov (talk) 12:58, 14 March 2008 (UTC)

Oleg, I have never doubted the importance of optimization applications. The issue in question is how to address the two constituencies of readers. My compromise was based on the principle that each constuency should be addressed individually in so far as there are differences between them, even-handedly, that is, without prejudice as to which is "more important". By explicitly labelling sections for each constuency any reader can skip to the section of interest. The reader should not expected to derive anything of substance.

I worked very hard to find a compromise that I though would deal with all the issues raised earlier on this talk page. What is so unaccepable about it? Now you are going over the same ground again and again. I am fed up to the back teeth with this constant bickering. It has to stop! Let's move on. Petergans (talk) 14:52, 14 March 2008 (UTC)

I would like to first point out that "optimization" means "find a minimum". It is not an application. As such, it is incorrect to say data fitting is one application and optimization is another.
I have added the data fitting formula too, as it is an important case. I don't mind you talking about data fitting, all I've been trying to do is avoid using specialized data fitting terminology where it can make the article hard to read by other people. Oleg Alexandrov (talk) 16:04, 14 March 2008 (UTC)

## Cleanup

I have cleaned up the first half a bit [7] and removed some redundancies. I did not change the meaning of any math, hopefully. I did add at the beginning that the r's are residuals because that's what they seem to be at the end of algorithm section. Otherwise the "residuals" come pulled out of thin air. I have renamed "Data modelling" to "Data fitting" and "Optimization" to "General case because that's what it is. But I had to stop in the middle because of the following confusion. I did not want to step on any toes, as the difference might be dear to some people who come to this from different angles.

There is really no difference between the "Data fitting" a.k.a. "Data modelling" and "General case" a.k.a. "Optimization" at all other than notation, at least in the article as it is. You can write any function ${\displaystyle r(\beta )}$ in the form ${\displaystyle r(\beta )=y-f(\beta )}$. Yet the discussion of "convergence properties" under data fitting alludes to global convergence (without ever saying so fully, and actually contradicting itself... starts with "convergence guaranteed" and ends with "chaotic"??) while the discussion in the second case is much less assertive about convergence. It seems in any case, all you get is local convergence, that's all. You are close enough to the solution and the functions r are close to linear, the algorithm converges, otherwise it is a matter of luck. Are there really some distinctive convergence properties in the "data fitting" case, and if so what are the forgotten assumptions on the functions f to guarantee those properties? Jmath666 (talk) 08:18, 15 March 2008 (UTC)

Never mind, arbitrary ${\displaystyle r_{i}(\beta )}$ is more general than ${\displaystyle r_{i}(\beta )=y_{i}-f(x_{i},\beta )}$ with one ${\displaystyle f}$. Still, does this special form guarantee global convergence and/or uniqueness, under what assumptions? Jmath666 (talk) 08:29, 15 March 2008 (UTC)

The statement of quadratic convergence starting close to the minimum is indeed, very dubious, even assuming the particular residues with one ${\displaystyle f}$. Peter, do you have a reference for it? Oleg Alexandrov (talk) 15:37, 15 March 2008 (UTC)
No problem about quadratic convergence. However it is not global which should be clear. Jmath666 (talk) 18:39, 15 March 2008 (UTC)
Also, I do not see why exactly the convergence arguments or derivation should be different in the two cases. Jmath666 (talk) 19:19, 15 March 2008 (UTC)
As I said before, I too want to see some evidence for (local) quadratic convergence. On the other hand, the section "Convergence properties (optimization)" says that you don't get local convergence in some situations. I think that the method always converges locally, but it converges faster in the two cases mentioned in that section. Incidentally, the method is globally convergent (modulo technicalities) if a suitable line search is adopted (Nocedal & Wright).
I would be very interested in what Åke Björck has to say about the difference between data fitting (where there is a model function) and general optimization problems. Nothing I read makes this distinction and I don't see why it would have any effect on the derivation and convergence theory of the method. -- Jitse Niesen (talk) 19:21, 15 March 2008 (UTC) (via edit conflict); wrong comments struck out 01:25, 16 March 2008 (UTC)
It makes no difference mathematically, but it will make a difference in what the "usual" applications are and thus the "usual" behavior is, which is the angle I think Petergans is coming from. Jmath666 (talk) 19:27, 15 March 2008 (UTC)

## Convergence?

I am sorry on a second thought I must join the voices calling for proof of quadratic convergence. Even worse, I do not see local convergence at all.

### Derivation of the algorithm

We want to solve

${\displaystyle S=\sum \limits _{i=1}^{m}r_{i}^{2}\left(\beta _{1},\ldots ,\beta _{n}\right)\rightarrow \min }$

At a local minimum,

${\displaystyle {\frac {\partial S}{\partial \beta _{j}}}=0\forall j\Longleftrightarrow \sum \limits _{i=1}^{m}r_{i}{\frac {\partial r_{i}}{\partial \beta _{j}}}=0\forall j\Longleftrightarrow r^{T}J=0}$
${\displaystyle J_{ij}={\frac {\partial r_{i}}{\partial \beta _{j}}}.}$
${\displaystyle r^{T}\left(\beta ^{t}+\delta \right)J\left(\beta ^{t}+\delta \right)\approx 0.}$

The Gauss-Newton method is obtained from the approximations (Taylor expansions)

{\displaystyle {\begin{aligned}r\left(\beta ^{t}+\delta \right)&\approx r\left(\beta ^{t}\right)+J\left(\beta ^{t}\right)\delta \\J\left(\beta ^{t}+\delta \right)&\approx J\left(\beta ^{t}\right)\end{aligned}}}

which gives (all quantities are evaluated at ${\displaystyle \textstyle \beta ^{t}}$)

${\displaystyle \left(r+J\delta \right)^{T}J=0,}$

which is equivalent to ${\displaystyle \textstyle J^{T}\left(r+J\delta \right)=0}$ (which happen to be the normal equations for solving ${\displaystyle \textstyle r+J\delta =0}$ by least squares), which gives the increment (all quantities are evaluated at ${\displaystyle \textstyle \beta ^{t}}$)

${\displaystyle \delta =-\left(J^{T}J\right)^{-1}J^{T}r.}$

### Convergence

To analyze convergence, let ${\displaystyle \textstyle \beta ^{\ast }}$ be an exact solution, that is, ${\displaystyle \textstyle r^{T}\left(\beta ^{\ast }\right)J\left(\beta ^{\ast }\right)=0}$, and let ${\displaystyle \textstyle \beta ^{t}=\beta ^{\ast }+\varepsilon }$. Now from Taylor expansion,

{\displaystyle {\begin{aligned}J\left(\beta ^{t}\right)&=J\left(\beta ^{\ast }+\varepsilon \right)=J\left(\beta ^{\ast }\right)+O\left(\left\Vert \varepsilon \right\Vert \right)\\r\left(\beta ^{t}\right)&=r\left(\beta ^{\ast }+\varepsilon \right)=r\left(\beta ^{\ast }\right)+J\left(\beta ^{\ast }\right)\varepsilon +O\left(\left\Vert \varepsilon ^{2}\right\Vert \right)\end{aligned}}}

so from the update formula,

${\displaystyle \delta =-\left(J^{T}J\right)^{-1}J^{T}r=\left[-\left(J\left(\beta ^{\ast }\right)^{T}J\left(\beta ^{\ast }\right)\right)^{-1}J\left(\beta ^{\ast }\right)^{T}+O\left(\left\Vert \varepsilon \right\Vert \right)\right]\left[r\left(\beta ^{\ast }\right)+J\left(\beta ^{\ast }\right)\varepsilon +O\left(\left\Vert \varepsilon ^{2}\right\Vert \right)\right]}$
${\displaystyle \delta =-\varepsilon +O\left(\left\Vert \varepsilon \right\Vert \right)r\left(\beta ^{\ast }\right)+O\left(\left\Vert \varepsilon ^{2}\right\Vert \right),}$

which gives

${\displaystyle \beta ^{t+1}-\beta ^{\ast }=\varepsilon +\delta =O\left(\left\Vert \varepsilon \right\Vert \right)r\left(\beta ^{\ast }\right)+O\left(\left\Vert \varepsilon ^{2}\right\Vert \right)}$

So, unless ${\displaystyle r=0}$ at solution I do not get a reason to expect convergence at all. The verbal argument in Convergence properties (data modeling) sounds nice but without some actual proof this is just words. Where did I go wrong? Jmath666 (talk) 21:38, 15 March 2008 (UTC)

I realize the above is essentially the same as said in Convergence properties (optimization). I just do not see what advantage one gets from having the function f, which would bypass this difficulty. Jmath666 (talk) 21:51, 15 March 2008 (UTC)

Also, I realize that the point I raised is similar if not same as "ignoring the Hessian" which was discussed above but did not seem to have been resolved.

Two trivial observations:

1. If the method converges it will converge to a stationary point of S because ${\displaystyle \delta \to 0}$ implies the stationarity condition ${\displaystyle J^{T}r=0}$ for the limit. (Under the technical assumption that ${\displaystyle \|(J^{T}J)^{-1}\|}$ is bounded in a neigborhood of the limit, of course.)

2. The general "Optimization" case is really not any more general than "Data modeling" (or fitting). Consider arbitrary ${\displaystyle r_{i}(\beta )}$, let ${\displaystyle y_{i}=0,x_{i}=i}$, and define ${\displaystyle f(x_{i},\beta )=r_{i}(\beta )}$, interpolating ${\displaystyle f}$ between the values ${\displaystyle x_{i}}$ for constant ${\displaystyle \beta }$. This interpolation can be made as smooth as desired (splines) and the resulting function ${\displaystyle f}$ will be smooth if the ${\displaystyle r_{i}}$ are smooth. Thus anything that is true about the "optimization" is true also about the "data fitting", particularly convergence properties or the lack of them.

Again, there seem be some tacit assumptions in the "data fitting" part about what f may be for all those statement about convergence to be true? If there is no mathematical basis for the method to converge for an arbitrary f (or for arbitrary ${\displaystyle r_{i}}$, which is, as we have seen, the same thing) but the method does converge in many examples seen in applications, it should be said so clearly. Jmath666 (talk) 23:19, 15 March 2008 (UTC)

Yes, I was too hasty when I said that the method always converges locally. In fact, it looks like Peter Gans mentions under #Optimization an example from Fletcher's book where the method does not converge. If this is indeed the case then that should of course be mentioned. Any counterexample can be translated to a data fitting setting, as you say. -- Jitse Niesen (talk) 01:25, 16 March 2008 (UTC)
Nice observation with manufacturing a data fitting function out of arbitrary residuals. Having another derivation for the case of data fitting parameters then makes no sense (it did not make sense even so far, but now at least there is a solid argument against it). Oleg Alexandrov (talk) 05:32, 16 March 2008 (UTC)

To make it easier for Peter, who is not a mathematician, here's an example for which Gauss-Newton fails, which contradicts what he claims in the article for data fitting. Take the functions

${\displaystyle r_{1}=\beta +1,\ r_{2}=\lambda \beta ^{2}+\beta -1}$

mentioned above, for which Gauss-Newton does not converge (${\displaystyle \lambda >1)}$ (according to Peter). Consider the model function

${\displaystyle f(x,\beta )=x\beta ^{2}+\beta +1-x.\,}$

Look at the data points x_1=0, ${\displaystyle x_{2}=2}$ with ${\displaystyle y_{1}=0,}$ ${\displaystyle y_{2}=0.}$ A perfectly valid setting for data fitting. But then,

${\displaystyle f(x_{1},\beta )=\beta +1\,}$
${\displaystyle f(x_{2},\beta )=2\beta ^{2}+\beta -1\,}$

(so ${\displaystyle \lambda =2>1}$), and Gauss-Newton fails to converge, not to talk about quadratic convergence. Oleg Alexandrov (talk) 05:42, 16 March 2008 (UTC)

(edit conflict) Oleg, I think what you put there now, with the Jacobian of f vs. the Jacobian of r, took care of the dual derivation nicely. Now we know that the dual "convergence properties" can go too, at least from the mathematical point of view. In light of the simple analysis I wrote above (sorry about the OR but it was easier than dig in the books), some comments there start making sense to me.
Perhaps the dicussion of convergence in the article could go somehow like this: Convergence can be quadratic in the regime when the iterations are far enough for the last term to dominate (or is always quadratice when r=0 at solution), but the iterates still need to be close enough to the solution to converge to it. When very close to the solution and r is not zero at solution, the linear term will dominate and the iterations may converge linearly, or maybe not at all. The smaller the residual is at solution and the closer r is to linear, the faster the linear convergence will be. It also may happen that the iterations jump around until eventually they get caught in some basin of attraction. In many cases of practical importance convergence is quadratic first and then fast linear because f is "nice" (=close enough to linear in beta) and/or the data fit is good.
From a note Peter left on my talk page [8]: "... The sections on convergence properties are intentionally vague about the nature of the minimum. This, too, is a case where data fitting and what you call general optimization are different. In data fitting multiple minima are rare except when fitting trigonometrical models. In general optimization they seem to be the rule rather than the exception...". Clearly, his POV is pragmatic, he cares about what happens in practice not about the analysis as much, and that is just as important to cover as the merciless math. Jmath666 (talk) 06:11, 16 March 2008 (UTC)

I took all the material concerning convergence from Björck. And yes, he includes the general case ${\displaystyle r_{1}=\beta +1,r_{2}=\lambda \beta ^{2}+\beta +1\,}$. I even went to the trouble of graphing the objective function in EXCEL. Try it for λ=2 and you might be surprised! However, one can't include everything in WP, one has to simplify...

The statement the "The GN method fails to converge" is the reason why I maintain that it is not a single algorithm. Jmath666 is right, I am an empiricist and I have to say that GN is used for data fitting of many different kinds without any concern about [ultimate] convergence except in ill-founded cases. Refinement of crystal structures in X-ray crystallography is an example; the use of GN is routine and many hundreds of structures are published every year.

The question of local and global minima is a general one for all optimization problems. If multiple minima exist the one found by minimization will depend on initial parameter estimates. This is not an issue specific to GN. I also show in my book that when multiple minima are present there is a region of parameter space where the normal equations matrix is singular, so GN will fail with some inital parameter estimates. I take the view that interested readers will read the cited publications for more detailed information than can, or should, be given in WP. Petergans (talk) 08:52, 16 March 2008 (UTC)

Petergans, I think you would agree that one could use the same GN computer code (with subroutines computing r and Jr as input) for data fitting as well as for optimization. That makes it the same algorithm by any definition I can imagine. If you use the word "algorithm" in a different meaning, please explain your definition. We know that convergence of GN is not guaranteed, both for application to data fitting and to optimization. However for many important applications in data fitting, GN tends to converge well and failure is rare, while in optimization applications, which tend to be more arbitrary, failure is more frequent. Why not just write it up like that? Jmath666 (talk) 15:00, 16 March 2008 (UTC)

## Merging the convergence and derivation sections

I removed the section on convergence for data fitting. Its main focus was the "region of quadratic convergence", and as it was shown by Jmath nicely above, there is no such thing. We do not include incorrect information in a Wikipedia article just because "in practice" it happens to work. The claim of quadratic convergence fails for very simple and nice model functions as shown above, that makes the claim incorrect for data fitting, as it is for the general case.

I suggest we also merge the derivations sections. Currently we have two derivation sections. One of them does explain when we can drop the second derivative term, the second one just drops it without explanation (and we see now that that term can't be dropped, even for data fitting).

I like what Jmath is proposing above for the merged convergence section. Oleg Alexandrov (talk) 15:38, 16 March 2008 (UTC)

Actually, for some problems (with small residuals at solution and/or close to linear; the same ones where the second derivative term can be dropped), convergence would be quadratic first and then linear convergence takes over; by that time the solution may be good enough so the user would see the quadratic convergence only. So perhaps the "region of quadratic convergence" can stay in some form. I wonder what the references have to say about that. Jmath666 (talk) 16:56, 16 March 2008 (UTC)
What the references say can seen here (turn to page 260).
In order to get quadratic (or even superlinear convergence), you must be able to really ignore the term dropped from Newton's method (residual times second order derivative). I don't think you can count on that too often. As such, it is hard to pin down when you get quadratic convergence, and I think most of the time you get only superlinear.
Either we do a serious analysis here, or we don't mention "region of quadratic convergence", you must be lucky to ever be in that mode, and it is not well-known when that happens. Oleg Alexandrov (talk) 18:40, 16 March 2008 (UTC)
Why should we do serious analysis here? Practical performance of a method and provable results are often quite different things. We just need to make sure we do not present as a mathematical statement something that is not. Thanks for the reference. He makes much the same observations I did, and on (page 260) says than "in many situations" "performance is comparable to Newton's method" etc. Why don't we just distill the article from pages 259-260 there? It's clear enough for me. Jmath666 (talk) 21:21, 16 March 2008 (UTC)
Either way the two convergence sections need to be merged. They say the same thing. And we should keep make it clear that "almost quadratic" is the "often" behavior, rather than something that can be proved. Oleg Alexandrov (talk) 00:44, 17 March 2008 (UTC)
Exactly. Ditto for local convergence itself. Jmath666 (talk) 01:19, 17 March 2008 (UTC)

## Downhill?

The article states that the increment direction goes downhill. That would usually be called "descent direction" and would mean that ${\displaystyle \nabla S\cdot \delta \beta <0}$. I wonder if this is atually true (can be proved) or if it is folklore again. Here folklore = hopefull wish, often seen in practice, true for many problems of interest, justified by heuristic thinking, etc. These have their place but should be identified as such and not confused with a mathematical statement. The word "tend" tends to be useful for that purpose. Jmath666 (talk) 17:07, 16 March 2008 (UTC)

Never mind, this can be proved trivially, ${\displaystyle \nabla S\cdot \delta \beta =-2r^{T}J(J^{T}J)^{-1}J^{T}r<0}$, so we are fine there. Jmath666 (talk) 17:10, 16 March 2008 (UTC)

## Convergence, again

Since the increment is a descent direction and the values of S are bounded below the values of S have to converge. Now to show that this implies that the iterations converge is a different matter. Things like coercivity and bounding the descent direction away from orthogonal to the gradient come to mind. There are some standard arguments about convergence of descent methods that go like that. Something like that may apply to the line search version at least. Anyway, what do the references say? Jmath666 (talk) 02:26, 17 March 2008 (UTC)

I believe the book by Nocedal and Wright (see Google Books link provided by Oleg) has a theorem proving global convergence for the line search method under a coercivity condition on J^T J and some conditions on the line search (probably Wolfe conditions). I'll check it tonight and add it in if I find the time. -- Jitse Niesen (talk) 11:18, 17 March 2008 (UTC)
I'd suggest we first agree to merge the two convergence sections, and two derivations, as the current situation makes it rather hard to focus on presenting things properly. Oleg Alexandrov (talk) 15:36, 17 March 2008 (UTC)

From my talk page: "The plan is to merge the two sections on convergence, as the properties of the algorithm are the same in data fitting as in the general case. Please state your objections there. Oleg Alexandrov (talk) 15:23, 17 March 2008 (UTC)"

• Convergence in data fitting does not depend on the magnitude of the residuals. Maybe this is an empirical statement, but it is nevertheless true.
• Failure to converge indicates that the model is invalid. The usual cause of failure is that ${\displaystyle J^{T}J}$ is not positive definite, resulting from there being a linear relationship between parameters.
• The Gauss-Newton method is not just a piece of pure mathematics. The different applications of it merit separate treatment. Nothing would be gained by trying to squeeze a quart into a pint pot. (For you non-Brits, 1 quart = 2 pints). Petergans (talk) 20:07, 17 March 2008 (UTC)
If the statements are properly qualified as being empirical, fine. Unqualified statements are too easily confused with mathematically correct statements, especially the math part of Wikipedia. Hence:
• In many problems in data fitting, convergence tends not to depend on the magnitude of the residuals.
• Failure to converge commonly indicates that the model is invalid. The usual cause of failure is that ${\displaystyle J^{T}J}$ is not positive definite, resulting from a linear relationship between the parameters.
• So, let's present the mathematics as plainly as possible, without any false statements, then deal with what tends to happen in differrent applications.
We got pints and quarts here in the US too :-) Jmath666 (talk) 20:23, 17 March 2008 (UTC)

Peter, Gauss-Newton fails for the model function

${\displaystyle f(x,\beta )=x\beta ^{2}+\beta +1-x.\,}$

even though ${\displaystyle J^{T}J}$ is positive definite (for the data points ${\displaystyle x_{0}=0,x_{1}=2}$). This example is again based on the example you provided for optimization.

As such, your claim that Gauss-Newton behaves in a special way for data fitting has been reduced to "usually, with carefully chosen model functions, the method converges". The same is true for any other application, doing a good job in setting up the problem usually results in good convergence properties of Gauss-Newton.

Either way, I'll cut out the "data fitting derivation" (while keeping the "convergence properites" for the moment). This derivation is completely identical to the general "optimization" case, except that in the "data fitting derivation" no attempt is made to justify ignoring the second derivative, and now we know there are some situations when it can't be ignored, even in data fitting. Oleg Alexandrov (talk) 03:15, 18 March 2008 (UTC)

I don't think that the derivations are the same. The idea behind the "data fitting derivation" is that you linearize the residuals r, while the "optimization derivation" start with linearizing the sum of squares, S. I think linearizing the residuals is a very natural idea and easier to understand, so I would like to keep it in the article. I don't see any reason why the first derivation is particular to a data fitting setting though.
Peter says that failure to converge indicates that the model is invalid. I think this is basically what we have been saying: if the model is invalid, then the residuals are necessarily large, and that's the reason why the method does not converge. This is also what Nocedal and Wright say: "[Some statisticians] argue that if the residuals are large at the solution, then the model is not up to the task of matching the observations." [9] Perhaps it's a question of how large is "large". If the model is basically okay but fits the data poorly, because the measurement errors are large, then the residual will also be fairly large but perhaps, empirically, not so large that the method fails. Only when the residuals are humongous the method fails to converge, and this happens only when the model is invalid. I find it hard to believe that the existence of multiple minima (which has to do with how S behaves away from the minimum we're interested in) prevent local convergence of the method.
And please don't get me started about UK pints vs US pints. It's one thing to use silly sizes, but at least agree about how big a pint is. :) -- Jitse Niesen (talk) 12:24, 18 March 2008 (UTC)
OK, you start by linearizing different things, but then the calculation still proceeds the same way, by ignoring the second-order derivatives, which is ultimately the point. I would doubt we'd need two (very similar) derivations, and either way, the derivations should be together, since indeed, there is nothing special to data-fitting in the derivation I cut. Oleg Alexandrov (talk) 15:00, 18 March 2008 (UTC)
Don't be so pig-headed, Oleg. The linearization processes are different and they need to be specified separately. Petergans (talk) 15:40, 18 March 2008 (UTC)
Peter, I think we better keep a civil atmosphere here. If you want to do a second derivation, please don't do it for the data fitting case, as that derivation is not special to data fitting.
Also, the process of linearizing the residuals amounts to re-deriving Newton's method as one goes along, so it is at its heart the same as the second derivation, where Newton's method is used ready-made. Oleg Alexandrov (talk) 15:44, 18 March 2008 (UTC)
If you want to keep a civil atmosphere you have to listen to what others are saying, not carry on as if you are the only person who knows how it is. You must appreciate, for example, than GN is derived and used without reference to Newton's method. "at its heart" is your spin, not what you will find in any book on data fitting. Jitse makes a good point. You can't just dismiss it by bringing up another argument.Petergans (talk) 17:38, 18 March 2008 (UTC)
Peter, I would suggest you re-read my answer to Jitse's comment. There I acknowledge that the methods have differences, and I suggest that if the alternative derivation is included, it should be stated for general residues. This more than one month-long argument has been very frustrating for me too, in case you did not notice that. We should, however, keep civil in our arguments. Oleg Alexandrov (talk) 20:09, 18 March 2008 (UTC)
Your usual style of answer - I'm right and you're wrong; suggesting that I have not read what you wrote carefully enough is not civility as I know it. It's deplorable. Petergans (talk) 23:04, 18 March 2008 (UTC)
The method ends up the same, regardless. Therefore the terms being dropped are exactly the same. If you look at my derivation above it is the term ${\displaystyle O(\|\varepsilon \|)r}$. There are several ways to arrive at that term but it is still the same term. Some authors have chosen one and some have chosen another. Just in case: the method selected to get that term has nothing to do with the intended application area, be it data fitting or more general optimization. I actually do not like the way Bjorck does it (the "optimization" here, with approximating the Hessian) I think it is a bit counterintuitive. Jmath666 (talk) 20:43, 18 March 2008 (UTC)
The reason why I want to add both derivations is that I think that it helps to understand the method and its behaviour. Like you, I find the derivation where you start by linearizing the residuals more natural. However, the derivation via Newton's method gives some information on the behaviour of the method: if you know that Newton converges, and the second term in the Hessian is small, then Gauss-Newton is almost Newton and hence it will have rapid local convergence. In fact, eqn 10.30 in Nocedal & Wright says that the ${\displaystyle O(\|\varepsilon \|)r}$ term in your text is bounded by
${\displaystyle \|(J^{T}J)^{-1}r\nabla ^{2}r\|\varepsilon .}$
If the second term in the Hessian is relatively small, then the factor with which the error decreases at every step is small, and hence the method converges rapidly (though usually not quadratically). This connection between the derivation via approximating the Hessian and the convergence analysis is why I'd like to see it included. -- Jitse Niesen (talk) 12:06, 19 March 2008 (UTC)

### sorry to butt in ...

maybe i am missing something but, skimming through the back and forth, i don't see how Petergans has addressed any mathematical issue brought up by Oleg and Jmath. claiming something is "often" true, without specifying what "often" means exactly, means nothing in mathematics. citing the fact that "GN is used for data fitting of many different kinds without any concern about [ultimate] convergence...", apparently a common practice among non-math folks using this particular piece of math, also means nothing. Mct mht (talk) 00:52, 19 March 2008 (UTC)

It may mean nothing in mathematics, but there is more in the world than mathematics :) Many algorithms are used even though we cannot prove they work. I've no problems with saying that something works "often" — in fact, I think we should say this when appropriate — but it should be made clear that this is a practical statement, not a theoretical one (a point that Jmath666 makes above), and it should also ideally carry a reference. -- Jitse Niesen (talk) 12:06, 19 March 2008 (UTC)

## Example

It is of course very nice to include an example. But before more work is spent on it, I would like to say that I prefer it very much if the example came from a data fitting application. Firstly, it's nice to have a more practical example instead of some equations plucked out of nowhere. Secondly, this is the context in which Gauss-Newton is treated in most of the literature. Fletcher (op. cit.) says that most problems in the whole of optimization, not just least squares, probably come from data fitting. -- Jitse Niesen (talk) 12:06, 19 March 2008 (UTC)

My primary goal was to give a concrete example showing the computed Jacobian and going over a bit over the fact that Gauss-Newton is iterative. I don't mind if this example is replaced with an example coming from data fitting, as long as the example is down to earth, rather than going into the issues of standard derivations, etc. Oleg Alexandrov (talk) 15:06, 19 March 2008 (UTC)

The article incorrectly says that that the final values shown are reached after the "fifth" iteration. These values occur after the FIFTIETH iteration. —Preceding unsigned comment added by 128.158.56.170 (talk) 03:09, 11 December 2008 (UTC)

### A possible example

The example of Michaelis-Menten kinetics is illustrated in nonlinear regression. The defining equation is usually written as

${\displaystyle {\text{rate}}={\frac {V_{max}[S]}{K_{M}+[S]}}}$

Rate is the independent variable, ${\displaystyle V_{max}{\text{ and }}K_{M}}$ are the parameters, and [S] is the independent variable. Thus, for this problem

${\displaystyle f(x_{i},{\boldsymbol {\beta }})={\frac {\beta _{1}x_{i}}{\beta _{2}+x_{i}}}}$
${\displaystyle {\frac {\partial f(x_{i},{\boldsymbol {\beta }})}{\partial \beta _{1}}}={\frac {x_{i}}{\beta _{2}+x_{i}}},\ {\frac {\partial f(x_{i},{\boldsymbol {\beta }})}{\partial \beta _{2}}}={\frac {-\beta _{1}x_{i}}{\left(\beta _{2}+x_{i}\right)^{2}}}}$

Data were digitized from the image

[S]/mM rate
0.038 0.033
0.076 0.060
0.158 0.102
0.310 0.157
0.626 0.215
1.253 0.263
2.500 0.295
3.740 0.307

[S] was re-scaled from microMolar to milliMolar to give better conditioned normal equations. Optimum values (from SOLVER in EXCEL) ${\displaystyle {\hat {\beta }}_{1}}$=0.337, ${\displaystyle {\hat {\beta }}_{2}}$=0.356

How much of the GN calculation to show? Petergans (talk) 22:39, 19 March 2008 (UTC)

I'd just substitute into the formulas, no derivations or other calculations. But it would be interesting to see how big the dropped term is. Jmath666 (talk) 05:31, 20 March 2008 (UTC)
Looks OK. I hope the scalings from microMolar to milliMolar need not be mentioned. Having the table transposed to take less room may be good too. Oleg Alexandrov (talk) 06:35, 20 March 2008 (UTC)
Peter, nice job on the example. Jmath, thank you for cleaning up the first part of the article. I think the only issues left is how to present the convergence properties and derivation(s) (next section). Oleg Alexandrov (talk) 16:35, 20 March 2008 (UTC)

To satisfy my curiosity, I programmed this example. It converges as follows

           beta1               beta2         residual   ||H1||    ||H2||
0   0.300000000000000   0.300000000000000   0.04696   6.67667   0.24588
1   0.336165494023700   0.355695276193520   0.00232   6.20372   0.00760
2   0.336881256152115   0.355991132583091   0.00203   6.20231   0.00000
3   0.336881131839563   0.355990005582743   0.00203   6.20232   0.00000
4   0.336881132385477   0.355990008156416   0.00203   6.20232   0.00000
5   0.336881132384231   0.355990008150544   0.00203   6.20232   0.00000
6   0.336881132384234   0.355990008150557   0.00203   6.20232   0.00000
7   0.336881132384234   0.355990008150557   0.00203   6.20232   0.00000


The columns are: iteration number, the parameters ${\displaystyle \beta _{1}}$ and ${\displaystyle \beta _{2}}$, the sum of squared residuals ${\displaystyle \sum r_{i}^{2}}$, and the 2-norms of the matrices ${\displaystyle \left[2\sum _{i=1}^{m}{\frac {\partial r_{i}}{\partial \beta _{j}}}{\frac {\partial r_{i}}{\partial \beta _{k}}}\right]_{jk}}$ and ${\displaystyle \left[2\sum _{i=1}^{m}r_{i}{\frac {\partial ^{2}r_{i}}{\partial \beta _{j}\partial \beta _{k}}}\right]_{jk}}$. The second term can obviously ignored in this example. -- Jitse Niesen (talk) 17:30, 20 March 2008 (UTC)

Here's another set of data for you to try.
 x 1 2 3 4 5 6 7 8 9 10 y 0.84 0.872 0.711 0.756 0.525 0.492 0.594 0.37 0.348 0.418
These data were synthesized for the model ${\displaystyle y=\beta _{1}e^{-\beta _{2}x}}$ using ${\displaystyle \beta _{1}=1{\mbox{ and }}\beta _{2}=0.1}$ and noise added from a uniform distibution of mean zero. It will be more difficult to refine (expect divergence) from "poor" starting values because i) the residuals are rather large and ii) the exponential model is "more nonlinear". Good luck! Petergans (talk) 20:42, 20 March 2008 (UTC)

I'm not sure how this helps to improve the article, but here are the results for the initial guess ${\displaystyle \beta _{1}=5}$ and ${\displaystyle \beta _{2}=0}$:

           beta1               beta2         residual    ||H1||     ||H2||
0   5.000000000000000   0.000000000000000   13.9494   19265.717   17499.6
1   0.915600000000002   0.011745454545455   0.97658   551.05816   245.881
2   0.931186014791968   0.072267573153623   0.28672   231.73121   30.5807
3   0.982492756960484   0.098378745482316   0.21511   178.53533   1.91448
4   0.987855647264506   0.100364473759534   0.21482   175.59871   0.46849
5   0.987816731112011   0.100354903584378   0.21482   175.60823   0.47303
6   0.987817072970896   0.100354991156257   0.21482   175.60814   0.47298
7   0.987817069849500   0.100354990355595   0.21482   175.60814   0.47298
8   0.987817069878039   0.100354990362915   0.21482   175.60814   0.47298
9   0.987817069877779   0.100354990362848   0.21482   175.60814   0.47298
10   0.987817069877781   0.100354990362849   0.21482   175.60814   0.47298
11   0.987817069877781   0.100354990362849   0.21482   175.60814   0.47298


Again, the second term in the Hamiltonian is much smaller than the first at the minimum. It's not hard to get the method to diverge if the initial guess is sufficiently bad, e.g., ${\displaystyle \beta _{1}=-0.1}$ and ${\displaystyle \beta _{2}=0}$. -- Jitse Niesen (talk) 16:26, 21 March 2008 (UTC)

## Convergence properties and derivation(s)

Sorry to raise this again, but the sentence

"For data modeling applications, the method usually converges to a minimum of S"

has zero information value. Gauss-Newton can fail just as easily for data fitting as in the general case (recall Jmath's construction how you can manufacture a data fitting function from any general residues).

There are a few empirical tricks that can help good convergence, such as residues being not too big, ${\displaystyle J^{T}J}$ being positive definite, and starting close to the minimum, but they are also applicable to general optimization.

Any ideas of how to present best the convergence properties and derivations? Oleg Alexandrov (talk) 16:35, 20 March 2008 (UTC)

OK, for typical data modeling applications? We often see methods that are useful yet little of value can be proved. For example, look at the chasm between theory and practice in multigrid methods, esp. algebraic multigrid. Terms like typical and often are by necessity vague. Jmath666 (talk) 23:32, 20 March 2008 (UTC)
Actually, we already have usually there so all is fine. True, you can construe a data fitting function from any general residues, but I would argue these may not be the kind of data fitting problems for which the method is useful in practice. (yes I used circular reasoning here...) The point is that well set up data fitting applications are nice numerically, at least that's what I understood was the point Petergans tried to get across. Jmath666 (talk) 23:36, 20 March 2008 (UTC)
Are "well set up data fitting applications" nice numerically because they are "well setup", or because they are "data fitting applications"? Oleg Alexandrov (talk) 02:31, 21 March 2008 (UTC)
No question. They are well set up when the model is based on some physical law and the data are collected with that law (and perhaps having estimates of probable parameter values) in mind in order that the parameters should be determined by the data and have small standard deviations. In our own work we use a simulation program to find suitable conditions for data collection. Just look at curve fitting#External links to get some idea of the vast number of applications where GN is used successfully. Also many nonlinear data fitting packages are sold commercially. This would not be so if convergence were not assured, if not guaranteed in all cases. Petergans (talk) 10:07, 21 March 2008 (UTC)
There is no argument about Gauss-Newton being an awesome algorithm. It is the core solver on which the simulation software made by my company is based. The whole point I am trying to make is that the nice properties of Gauss-Newton are not due to its application to data fitting, but rather to doing a good job at setting up the problem (based on physical laws, etc.,) for any application.
I have made an attempt to clarify a bit the nice properties of Gauss-Newton. I'd like to note that the formula for the data fitting residues I removed with this edit already shows up earlier, so my goal was not to minimize the usefulness of Gauss-Newton in data fitting, I am trying to give a general picture. Comments? Oleg Alexandrov (talk) 15:23, 21 March 2008 (UTC)
(The example I removed towards the end of this edit already shows up later, as put in by Jitse. So my edit is really much smaller than it looks in the diff.) Oleg Alexandrov (talk) 15:26, 21 March 2008 (UTC)
The impression I get from the books by Nocedal & Wright and Fletcher is that Gauss-Newton should always be implemented with a line search, and then the method has good convergence properties. Peter, do you know whether these commercial packages you mention do a line search. -- Jitse Niesen (talk) 16:26, 21 March 2008 (UTC)
Not specifically. There is no doubt that GN should always be protected against possible divergence. My impression from general reading of the literature is that use of the Marquardt parameter is now and has been for some time the preferred method of doing it, for this reason: divergence occurs because the shift vector is too long and/or does not point towards the minimum. The Marquardt parameter affects both the length and direction of the shift vector, whereas line search affects only its length. In the particular instance that ${\displaystyle J^{T}J}$ is close to being singular, the angle between the shift vector and the descent vector is close to 90o and line search is not very effective in that circumstance. Petergans (talk) 18:10, 21 March 2008 (UTC)
The way Gauss-Newton is used in my work, we both use a Marquardt-like parameter (which you don't want either too big or too small), and we also decrease the step size (shift vector) if the cost function fails to decrease along it, so we use line search. Oleg Alexandrov (talk) 19:58, 21 March 2008 (UTC)

## Conclusion

When ${\displaystyle J^{T}J}$ is positive definite the shift vector points downhill. Using the Marquardt parameter and/or line search to avoid divergence, S is reduced at each iteration. Therefore, when ${\displaystyle J^{T}J}$ is positive definite, the "damped" Gauss-Newton method is guaranteed to converge to a local minimum, (Björck p344). Convergence is independent of the magnitudes of the residuals and of second derivatives. On the same page, "the rate of convergence may be slow on large residual problems and very nonlinear problems". I believe that statement to be incorrect, as detailed below. The rate of convergence may approach quadratic towards the minimum.

${\displaystyle J^{T}J}$ may approach singularity if the data do not adequately define the parameters.

• For each parameter, the set of calculated residuals, or a sub-set of them, should be sensitive to changes in the value of that parameter. When this condition is satisfied the problem is well-founded.
• Noise in the data will be significant only when it is so large that the trend in the data predicted by the model is barely discernible.

Convergence can become very slow when parameters are highly correlated. When a pair of parameters are highly correlated, ρ>0.999, say, the model is indeterminate. When ${\displaystyle J^{T}J}$ is positive definite, all correlation coefficients have |value| < 1. ${\displaystyle J^{T}J}$ is singular when there is a linear relationship between columns of J, that is, when the partial derivatives of f with respect to one parameter are linear combinations of the partial derivatives of one or more other parameters. This condition will obtain when one parameter is a linear combination of other parameters.

For example, when fitting with a model which is the sum of two Lorentzians,

${\displaystyle f(x,[h_{1},h_{2},p_{1},p_{2},w_{1},w_{2}])={\frac {h_{1}}{1+\left({\frac {p_{1}-x}{w_{1}/2}}\right)^{2}}}+{\frac {h_{2}}{1+\left({\frac {p_{2}-x}{w_{2}/2}}\right)^{2}}}}$

when ${\displaystyle |p_{1}-p_{2}|>{1 \over z}{\frac {w_{1}+w_{2}}{2}}}$, that is, when the separation of the Lorentzian bands is larger than some fraction, 1/z, of the average half-width, refinements converge in a few iterations, but as the bands get closer together the number of iterations required increases. When the bands are very close together the data can be fitted almost equally well by a single Lorentzian,${\displaystyle h_{1}=h_{2}}$ etc., so the parameters of the two-band model become indeterminate. This happened with S/N > 100. (P. Gans and J.B. Gill, On the analysis of Raman spectra by curve resolution, Applied Spectroscopy, (1977), 31, pp 451-455)

Thanks to Jitse for the calculations on the exponential model. I chose this model because second derivative might be important since ${\displaystyle {\frac {\partial ^{2}\left(ae^{\beta x}\right)}{\partial \beta ^{2}}}=a\beta ^{2}e^{\beta x}}$, but his calculations show that GN can nevertheless converge in a few iterations, even with noisy data. Petergans (talk) 10:59, 22 March 2008 (UTC)

So, from what you say, Björck p344 has the statement that "the rate of convergence may be slow on large residual problems and very nonlinear problems". You doubt this statement. I think it is valid, and it is already described in the article, at the derivation from Newton's method. Since you are a practitioner, rather than a theoretician, a simple way to check this is to run Gauss-Newton on a few problems you believe to be "well-posed" and study the rate of convergence. I believe it can be below quadratic in many cases. Oleg Alexandrov (talk) 16:28, 22 March 2008 (UTC)
Before we get (again!) into the issue of the rate of convergence can we agree that the "fact of convergence" issue is resolved in terms of the damped GN algorithm and modify the article accordingly? Petergans (talk) 20:47, 22 March 2008 (UTC)
I hope we all agree that plain Gauss-Newton fails to converge sometimes, or that its convergence can sometimes be slow. I am not sure what you mean by the "damped" Gauss-Newton, and how it relates to the material already in the "Divergence" section. Can you state precisely the theorem guaranteeing convergence? Oleg Alexandrov (talk) 00:25, 23 March 2008 (UTC)
Under suitable technical assumptions there exists a damping coefficient step size (the ${\displaystyle \alpha }$ at the top of the divergence section) that guarantees convergence. The coefficient is of course problem dependent, when things go bad it can be very small. There is a general result like that about descent methods. One (theoretical!) technique I recall was to use ${\displaystyle \alpha _{t}=1/t}$. I have studied things like that years ago. I can look into it again or ask some people in optimization (I did not stay in that field), if this is of real interest. Jmath666 (talk) 02:38, 23 March 2008 (UTC)
I agree with this. An informal explanation for the very small step size goes as follows. As ${\displaystyle J^{T}J}$ tends towards singularity, Det(${\displaystyle J^{T}J}$) tends to zero and elements of ${\displaystyle (J^{T}J)^{-1}}$ tend to become very large, causing the shift vector to be come very long. Also, ${\displaystyle \delta \beta }$ tends to being orthogonal to the descent vector. In those circumstances a very small step size is needed to get any reduction in S. Petergans (talk) 15:40, 23 March 2008 (UTC)

A proof is in Björck, together with a detailed discussion of the choice of step size. Björck also gives a proof for the "trust region method". The term "damped GN" (used by Björck for GN + linear search) could be replaced by "GN with protection against divergence". I am suggesting that we move away from the pure GN algorithm, which is subject to divergence problems that can cause a refinement to fail, to the divergence-protected GN algorithm which is now (2008) the basis for the successful use in pretty much all applications. The key point is that when J has full column rank ${\displaystyle J^{T}J}$ and ${\displaystyle J^{T}J+\lambda D\ (\lambda D>0)}$ are positive definite, which ensures that the shift vector is descent-directed in both cases. Petergans (talk) 09:01, 23 March 2008 (UTC)

What do you mean by "we move away from the pure GN algorithm"? This article must still discuss the "pure" Gauss-Newton, as that's what all references mean by this method (including your Björck book I think), and Wikipedia articles must follow existing references per policy.
To me, the natural place to discuss issues of Gauss-Newton is the current "Divergence" section (which can be renamed to something better, if necessary). So, I think the way things should go is the following:
1. State Gauss-Newton (the plain one, as used everywhere in the literature)
2. Example
3. Convergence properties
4. Issues with failure, and possible fixes (line search, Marquardt)
Other proposals? Oleg Alexandrov (talk) 16:12, 23 March 2008 (UTC)
OK. This is in line with the "start simple grow more complex" philosophy. The "Notes" section are the snippets of text that were left over when I simplified the "Algorithm" and was meant as a temporary holding area. It could be merged into "Convergence properties" or improved and made into "Derivation". There ought to be some derivation, either separate or at the beginning of "Convergence". Either should prove that ${\displaystyle \delta \beta }$ is a descent direction esp. as it costs just half line. Jmath666 (talk) 20:34, 23 March 2008 (UTC)
I am actually fine with the "Notes" section where several things could be mentioned briefly. I meant more to say that we should not jump right to the "generalized" GN, in line with what you said about simple to complex. Oleg Alexandrov (talk) 21:20, 23 March 2008 (UTC)

## Normal equations

As far as I know,

${\displaystyle \delta \beta =-(J^{T}J)^{-1}J^{T}r}$

are not the normal equations. It is the formula for the solution of the normal equations

${\displaystyle J^{T}(r-J\beta )=0}$,

which are the conditions for the solution of the overdetermined system ${\displaystyle J\beta =r}$ in the least-squares sense. Jmath666 (talk) 05:41, 24 March 2008 (UTC)

That is correct.
I have decided not take any further part in these discussions. Petergans (talk) 06:45, 24 March 2008 (UTC)

Jmath, you are right, I did not pay attention to the fact that the matrix was already inverted in that formula.

However, I prefer that the normal equations be in the form

${\displaystyle \left(\mathbf {J_{r}^{T}J_{r}} \right)\delta {\boldsymbol {\beta }}=-\mathbf {J_{r}^{T}r} .}$

as those are simpler to invert (either directly, or using an iterative method). Would that be OK? Oleg Alexandrov (talk) 15:20, 24 March 2008 (UTC)

It's the same thing or better, I was just too lazy to write ${\displaystyle J}$ three times. Jmath666 (talk) 23:33, 24 March 2008 (UTC)

## Excessive noise?

Regarding this revert of Peter: I will argue that your picture (top) is giving the reader the misleading impression that data obtained in practice is very accurate and that there is always a model function to fit that data very accurately. I believe that is not always the case, and one of the key properties of least squares is robustness to noise. I'll argue that my picture (bottom) will serve better the article. Oleg Alexandrov (talk) 15:27, 21 April 2008 (UTC)

The answer to the apparent good fit is to show the residuals, which will demonstrate what you want. This level of noise for an analysis (e.g. alkaline phosphatase#diagnostic use) which is regularly performed in hospitals on samples from real patients is not acceptable. The measurements are made by clinical analyzer (do a Google search!), designed to provide good data so that calculated parameters have small errors, which is essential when it comes to making clinical decisions in the real world. That's so what! Petergans (talk) 21:34, 21 April 2008 (UTC)

Perhaps an example could be found where larger residuals come up naturally? Jmath666 (talk) 21:42, 21 April 2008 (UTC)
The example in the text is about the reaction rate in an enzyme-mediated reaction. Nowhere in the text does it say that this is clinical data for actual patients. You could as well assume that the reaction is studied in a classroom experiment and the measuring equipment is only accurate to 10%.
The example is purely illustrative. What matters most is for readers to understand the use of Gauss-Newton in fitting experimental data, which are not always accurate. Oleg Alexandrov (talk) 01:48, 22 April 2008 (UTC)
Even first year students would not produce such bad data. Measuring equipment usually has error less than 1% these days. I suggest you give a table of observed data, calculated data and residuals if you don't want to do a graphic. Petergans (talk) 09:01, 22 April 2008 (UTC)
Suppose you teach this. Which picture would you rather have on the board, one with visually perfect fit or one with the data points visibly away from the line to illustrate the principle? Jmath666 (talk) 13:47, 22 April 2008 (UTC)
I guess it can't be stated clearer than how Jmath put it. Oleg Alexandrov (talk) 15:26, 22 April 2008 (UTC)

I would use genuine experimental data obtained on the equipment that students will use. In my computer programs I always plot the residuals as well as the fit because they cannot normally both be shown on the same scale. For example, in GLEE data of high precision are needed in order to obtain small standard deviations for the calibration constants (parameters), E0 and slope,but the residuals are plotted to see if they show a systematic trend which might invalidate the results.

The principle is that a sum of squared residuals is minimized, so showing the residuals illustrates the principle.

For a plot of Michaelis-Menten kinetics similar to the first one above see E.J. Billo, "Excel for Scientists and Engineers", Wiley, 2007, p330. The context is the use of SOLVER to minimize a sum of squared residuals. Petergans (talk) 18:02, 22 April 2008 (UTC)

It should be easy to smoothly vary the amount of noise. I agree that the "pink" graph is too close of a fit, but if the "blue" graph is too loose, it should be possible to compromise, or even just mention "3% noise added for emphasis". If there is room in the article, a graph fo the residuals would be very useful. In fact, if someone is feeling particularly graphy, I would love to see a graph showing how the residuals behave with respect to the intensity of the noise: So one graph of the residuals, with three datasets on it, "clinical" (pink), "3% noise" (new), "10% noise" (blue), where the percentages should be adjusted to reflect what is actually used (Oleg mentioned 10% in a related context, so I am just pretending the blue graph has "10%" noise). JackSchmidt (talk) 18:36, 22 April 2008 (UTC)
The purpose of the graph to show that it is possible to fit a non-linear function to data, even if the data is not completely reliable. If Peter has a problem with using a Michaelis-Menten kinetics model, it can be simply mentioned that data comes from some measurements, without mentioning the exact source.
I will argue that there is no need to mention what the percentage of noise is. This is an article about the Gauss-Newton algorithm, and what we need is a very simple example and a very simple diagram illustrating an application. Talking about noise and varying the amount of noise would be distracting from the goal of the article, I think. Oleg Alexandrov (talk) 20:25, 22 April 2008 (UTC)
Stability is important, but I think the article is more about an algorithm to converge to a least squares solution, not the stability of least squares itself (so sorry, the second graph idea was probably useless). The "stability" concerns of the algorithm seem to be more about convergence or non-convergence.
• Do you think enough noise could cause the G-N algorithm not to converge at all?
If so, it still might make some sense to discuss varying levels of noise, but rather than residuals, it would make more sense to compare the number of iterations before a reasonable level of convergence (or divergence). Perhaps this sort of radical instability is what is worrying Petergans? The second table would fit well in the following section. JackSchmidt (talk) 21:18, 22 April 2008 (UTC)
I think the Gauss-Newton algorithm can fail to converge even with minimal noise, and it can converge (very fast sometimes) even with a lot of noise. I believe convergence has more to do with the model function one is trying to fit than with the noise in the data. Oleg Alexandrov (talk) 22:34, 22 April 2008 (UTC)
PS In the second graph above, I chose the distance between the red diamonds and the blue curve to be just big enough so that you don't have to strain your eyes to see it. Oleg Alexandrov (talk) 22:37, 22 April 2008 (UTC)

My concern is that real Michaelis-Menten data is never as noisy as in the lower plot. I worked on this type of data in the early 1980s. Unfortunately, all my experimental data has been lost, it's so long ago.

The residuals can be plotted on a scale of say, ${\displaystyle \pm 3\sigma }$, where ${\displaystyle \sigma ={\sqrt {\frac {S}{m-n}}}}$ and should approximate to the error on a measurement, independently of the magnitude of that quantity (noise level).

The answer to the question is yes: with grossly excessive noise parameters become highly correlated and as a correlation coefficient approaches 1 or -1 the normal equations matrix approaches singularity, so GN refinement will eventually fail, even with protection against divergence. If you plotted such data you would not be able to fit it by eye. In practice the noise level mostly determines the standard deviations and correlation coefficients on the parameters. Petergans (talk) 23:28, 22 April 2008 (UTC)

I removed the mention of Michaelis-Menten from the article. I hope this settles some of the argument.
For this choice of data points and model function the algorithm converges in five iterations, which is very reasonable (without any line search or other convergence protection). Granted, if the noise is huge the algorithm may converge slower if at all. Oleg Alexandrov (talk) 03:19, 23 April 2008 (UTC)
This is not a rational solution since the equation is still clearly the Michaelis-Menten equation.
Underlying this controversy is an important point of principle which has not yet been clearly expressed: the purpose of doing a least squares calculation is not just to obtain parameter values, but to obtain estimates of error on those values.
With the data currently in the table I calculate the SD on Vmax to be 25% of its value. In the first place the value has only one significant figure. Secondly, the 95% confidence limits would be about ${\displaystyle \pm }$ 50%, rendering the value all but worthless.
For low error estimates, data of high precision are essential. There is therefore no reason why the fit should not be “visually exact”. In fact, it is normal in data fitting for this to be so. That’s why a plot of residuals is important, if not essential. Petergans (talk) 10:27, 23 April 2008 (UTC)
This is meant to be a very first example for the reader of how to use Gauss-Newton. The fact that the curve fits the data well and the algorithm converges fast should be good enough. There is no argument that the less reliable your data is the less one can be confident in the curve fitting the data. That is the topic of a different discussion, and perhaps beyond the scope of this article. The goals of this example are:
• Show that the Gauss-Newton algorithm is useful for a practical problem
• Show how to get from the practical problem to a least-squares formulation
• Show how to set up the Jacobian and perform several iterations
• Verify visually that the curve does the job of fitting the data.
I think that's enough for a first example. Oleg Alexandrov (talk) 16:43, 23 April 2008 (UTC)

My only concern is that the example shown should be a realistic one, not one based on Mickey Mouse data. The original diagram was sufficient and is also the same data as in the Michaelis-Menten kinetics article. I agree with "verify visually" but would do so with a residual plot in addition to the observed/calculated plot. Petergans (talk) 19:19, 23 April 2008 (UTC)

The goal of this picture is to teach people, in a way as simple as possible, what it means to fit a curve to data. The second picture does a better job, as it shows that the curve can fit the data even when the data is not perfect and even when a high quality model is not available. That is of first and foremost importance. Concerns about whether the measurements in the picture are as accurate as they are in practice are secondary.
And by the way, in my industry (semiconductors), where the unit of measurement is the nanometer, results of measurements (even obtained with a scanning electron microscope) can be pretty inaccurate, and this is where data fitting shines best. Your insistence in using "perfect data" just does disservice to the power of data fitting methods.
Your suggestion to add additional plot showing the residuals to compensate for the first "perfect plot" would just make the explanation more complicated. The current plot (second one above) does a bitter job of introducing what data fitting is about. Oleg Alexandrov (talk) 22:45, 23 April 2008 (UTC)
I strongly disagree, but have it your own way. There is no point in continuing this discussion. Petergans (talk) 21:07, 24 April 2008 (UTC)
Iteration counter "s" is a horrible choice. We already have sum of squares S. Using "k" would be much more obvious, since i,j,k are used as counters in many programming languages.~~ —Preceding unsigned comment added by 132.239.1.230 (talk) 20:22, 21 August 2009 (UTC)

## Demo Code

I have implemented the algorithm given in this article in Matlab code. Hoewever I don't know how to add it to the page / format etc. It may be useful.

%GaussNewton_Demo

%Data x = [0.038 0.1947 0.425 0.626 1.253 2.500 3.740]'; y = [0.05 0.127 0.094 0.2122 0.2729 0.2665 0.3317]';

%PLot X X = [0.01:0.01:4];

%Inital guess B1 = 0.9; B2 = 0.2;

%Plot figure(1) cla hold on scatter(x,y, 'rd', 'fill') axis auto

%Loop for i = 1:5

   %Jacobian
J = [-x./(B2+x) (B1*x)./(B2+x).^2];

   %Hessian (approx)
H = J'*J;

%Calculate residuals
r = y - (B1*x)./(B2+x);

   %Sum of squares of residuals
SSR = sum(r.*r)

%Plot
Y = (B1*X)./(B2+X);
plot(X,Y, 'LineWidth', 1)

%Calculate delta
Delta = pinv(H)*J'*r;

%Apply delta to parameters
B1 = B1-Delta(1);
B2 = B2-Delta(2);

pause(1)



end — Preceding unsigned comment added by MartyBebop (talkcontribs) 11:39, 12 April 2012 (UTC)