|
|
(One intermediate revision by one other user not shown) |
Line 1: |
Line 1: |
| {{Regression bar}}
| | Odbiorcy jednakże zawsze staramy się zaprezentować odmienne idee. Najwyższe; aby posada dawna fascynująca i prawdziwa.<br><br>Stop by my page ... [http://2fantasygames.net/index.php?task=profile&id=28194 tatuaż rzeszów forum] |
| In [[statistics]], a '''studentized residual''' is the quotient resulting from the division of a [[errors and residuals in statistics|residual]] by an [[estimator|estimate]] of its [[standard deviation]]. Typically the standard deviations of residuals in a sample vary greatly from one [[data point]] to another even when the [[errors and residuals in statistics|errors]] all have the same standard deviation, particularly in [[regression analysis]]; thus it does not make sense to compare residuals at different data points without first studentizing. It is a form of a [[Student's t-statistic]], with the estimate of error varying between points.
| |
| | |
| This is an important technique in the detection of [[outlier]]s. It is named in honor of [[William Sealey Gosset]], who wrote under the pseudonym '''''Student''''', and dividing by an ''estimate'' of scale is called '''studentizing,''' in analogy with [[standardizing]] and [[normalization (statistics)|normalizing]]: see [[Studentization]].
| |
| | |
| ==Motivation==
| |
| {{see also|Errors and residuals in statistics}}
| |
| | |
| The key reason for studentizing is that, in [[regression analysis]] of a [[multivariate distribution]], the variances of the ''residuals'' at different input variable values may differ, even if the variances of the ''errors'' at these different input variable values are equal. The issue is the difference between [[errors and residuals in statistics]], particularly the behavior of residuals in regressions.
| |
| | |
| Consider the simple [[linear regression]] model
| |
| | |
| :<math> Y = \alpha_0 + \alpha_1 X + \varepsilon. \, </math>
| |
| | |
| Given a random sample (''X''<sub>''i''</sub>, ''Y''<sub>''i''</sub>), ''i'' = 1, ..., ''n'', each pair (''X''<sub>''i''</sub>, ''Y''<sub>''i''</sub>) satisfies
| |
| | |
| :<math> Y_i = \alpha_0 + \alpha_1 X_i + \varepsilon_i,\,</math>
| |
| | |
| where the ''errors'' ''ε''<sub>''i''</sub>, are [[statistical independence|independent]] and all have the same variance ''σ''<sup>2</sup>. The '''residuals''' are not the true, and unobservable, errors, but rather are ''estimates'', based on the observable data, of the errors. When the method of least squares is used to estimate ''α''<sub>0</sub> and α<sub>1</sub>, then the residuals <math>\scriptstyle\widehat\varepsilon</math>, unlike the errors <math>\scriptstyle\varepsilon</math>, cannot be independent since they satisfy the two constraints
| |
| | |
| :<math>\sum_{i=1}^n \widehat{\varepsilon}_i=0</math>
| |
| | |
| and
| |
| | |
| :<math>\sum_{i=1}^n \widehat{\varepsilon}_i x_i=0.</math>
| |
| | |
| (Here ''ε''<sub>''i''</sub> is the ''i''th error, and <math>\scriptstyle\widehat{\varepsilon}_i</math> is the ''i''th residual.)
| |
| | |
| Moreover, and most importantly, the residuals, unlike the errors, ''do not all have the same variance:'' the variance decreases as the corresponding ''x''-value gets farther from the average ''x''-value. This is a feature of the regression better fitting values at the ends of the domain, not the data itself, and is also reflected in the [[Influence function (statistics)|influence functions]] of various data points on the [[regression coefficient]]s: endpoints have more influence. This can also be seen because the residuals at endpoints depend greatly on the slope of a fitted line, while the residuals at the middle are relatively insensitive to the slope. The fact that ''the variances of the residuals differ,'' even though ''the variances of the true errors are all equal'' to each other, is the ''principal reason'' for the need for studentization.
| |
| | |
| It is not simply a matter of the population parameters (mean and standard deviation) being unknown – it is that ''regressions'' yield ''different residual distributions'' at ''different data points,'' unlike ''point [[estimators]]'' of [[univariate distribution]]s, which share a ''common distribution'' for residuals.
| |
| | |
| ==How to studentize==
| |
| | |
| For this simple model, the [[design matrix]] is
| |
| | |
| :<math>X=\left[\begin{matrix}1 & x_1 \\ \vdots & \vdots \\ 1 & x_n \end{matrix}\right]</math>
| |
| | |
| and the [[hat matrix]] ''H'' is the matrix of the [[orthogonal projection]] onto the column space of the design matrix:
| |
| | |
| :<math>H=X(X^T X)^{-1}X^T.\,</math>
| |
| | |
| The "leverage" ''h''<sub>''ii''</sub> is the ''i''th diagonal entry in the hat matrix. The variance of the ''i''th residual is
| |
| | |
| :<math>\operatorname{var}(\widehat{\varepsilon}_i)=\sigma^2(1-h_{ii}).</math>
| |
| | |
| In case the design matrix ''X'' has only two columns (as in the example above), this is equal to
| |
| | |
| :<math> \operatorname{var}(\widehat{\varepsilon}_i)=\sigma^2\left( 1 - \frac1n -\frac{(x_i-\bar x)^2}{\sum_{j=1}^n (x_j - \bar x)^2 } \right). </math>
| |
| | |
| The corresponding '''studentized residual''' is then
| |
| | |
| :<math>{\widehat{\varepsilon}_i\over \widehat{\sigma} \sqrt{1-h_{ii}\ }}</math>
| |
| | |
| where <math>\widehat{\sigma}</math> is an appropriate estimate of σ (see below).
| |
| | |
| ==Internal and external studentization==
| |
| | |
| The usual estimate of σ<sup>2</sup> is
| |
| | |
| :<math>\widehat{\sigma}^2={1 \over n-m}\sum_{j=1}^n \widehat{\varepsilon}_j^{\,2}.</math>
| |
| | |
| where ''m'' is the number of parameters in the model (2 in our example).
| |
| But it is desirable to exclude the ''i''th observation from the process of estimating the variance when one is considering whether the ''i''th case may be an outlier. Consequently one may use the estimate
| |
| | |
| :<math>\widehat{\sigma}_{(i)}^2={1 \over n-m-1}\sum_{\begin{smallmatrix}j = 1\\j \ne i\end{smallmatrix}}^n \widehat{\varepsilon}_j^{\,2},</math>
| |
| | |
| based on all but the ''i''th case. If the latter estimate is used, ''excluding'' the ''i''th case, then the residual is said to be '''''externally studentized'''''; if the former is used, ''including'' the ''i''th case, then it is '''''internally studentized'''''.
| |
| | |
| If the errors are independent and [[normal distribution|normally distributed]] with [[expected value]] 0 and variance σ<sup>2</sup>, then the [[probability distribution]] of the ''i''th externally studentized residual is a [[Student's t-distribution]] with ''n'' − ''m'' − 1 [[degrees of freedom (statistics)|degrees of freedom]], and can range from <math>\scriptstyle-\infty</math> to <math>\scriptstyle+\infty</math>.
| |
| | |
| On the other hand, the internally studentized residuals are in the range <math>\scriptstyle 0 \,\pm\, \sqrt{\mathrm{r.d.f.}}</math>, where r.d.f. is the number of residual degrees of freedom, namely ''n'' − ''m''. If "i.s.r." represents the internally studentized residual, and again assuming that the errors are independent identically distributed Gaussian variables, then
| |
| | |
| :<math>\mathrm{i.s.r.}^2 = \mathrm{r.d.f.}{t^2 \over t^2+\mathrm{r.d.f.}-1}</math> | |
| | |
| where ''t'' is a random variable distributed as [[Student's t-distribution]] with r.d.f. − 1 degrees of freedom. In fact, this implies that i.s.r.<sup>2</sup>/r.d.f. follows the [[beta distribution]] ''B''(1/2,(r.d.f. − 1)/2). When r.d.f. = 3, the internally studentized residuals are [[uniform distribution (continuous)|uniformly distributed]] between <math>\scriptstyle-\sqrt{3}</math> and <math>\scriptstyle+\sqrt{3}</math>.
| |
| | |
| If there is only one residual degree of freedom, the above formula for the distribution of internally studentized residuals doesn't apply. In this case, the i.s.r.'s are all either +1 or −1, with 50% chance for each.
| |
| | |
| The standard deviation of the distribution of internally studentized residuals is always 1, but this does not imply that the standard deviation of all the i.s.r.'s of a particular experiment is 1.
| |
| For instance, the internally studentized residuals when fitting a straight line going through (0, 0) to the points (1, 4), (2, −1), (2, −1) are <math>\sqrt{2},\ -\sqrt{5}/5,\ -\sqrt{5}/5</math>, and the standard deviation of these is not 1.
| |
| | |
| == See also ==
| |
| * [[Normalization (statistics)]]
| |
| * [[Samuelson's inequality]]
| |
| * [[Standard score]]
| |
| | |
| ==References== | |
| *{{cite book |last1=Cook |first1=R. Dennis |last2=Weisberg |first2=Sanford |title=Residuals and Influence in Regression. |year=1982 |publisher=[[Chapman and Hall]] |location=New York |isbn=041224280X |url=http://www.stat.umn.edu/rir/ |edition=Repr. |accessdate=23 February 2013}}
| |
| | |
| {{DEFAULTSORT:Studentized Residual}}
| |
| [[Category:Statistical outliers]]
| |
| [[Category:Statistical deviation and dispersion]]
| |
| [[Category:Error]]
| |
| [[Category:Measurement]]
| |
| [[Category:Statistical ratios]]
| |
Odbiorcy jednakże zawsze staramy się zaprezentować odmienne idee. Najwyższe; aby posada dawna fascynująca i prawdziwa.
Stop by my page ... tatuaż rzeszów forum