# Difference between revisions of "Distance correlation"

en>Rjwilmsi m (Citation parameter fixes, using AWB (9382)) |
|||

Line 10: | Line 10: | ||

The classical measure of dependence, the [[Pearson product-moment correlation coefficient|Pearson correlation coefficient]],<ref>Pearson (1895)</ref> is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by [[Gabor J Szekely]] in several lectures to address this deficiency of Pearson’s [[correlation]], namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.<ref name=SR2007>Székely, Rizzo and Bakirov (2007)</ref><ref name=SR2009>Székely & Rizzo (2009)</ref> It was proved that distance covariance is the same as the Brownian covariance.<ref name=SR2009/> These measures are examples of [[energy distance]]s. | The classical measure of dependence, the [[Pearson product-moment correlation coefficient|Pearson correlation coefficient]],<ref>Pearson (1895)</ref> is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by [[Gabor J Szekely]] in several lectures to address this deficiency of Pearson’s [[correlation]], namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.<ref name=SR2007>Székely, Rizzo and Bakirov (2007)</ref><ref name=SR2009>Székely & Rizzo (2009)</ref> It was proved that distance covariance is the same as the Brownian covariance.<ref name=SR2009/> These measures are examples of [[energy distance]]s. | ||

==Definitions== | ==Definitions== | ||

Line 31: | Line 32: | ||

</math> | </math> | ||

where <math>\textstyle \overline{a}_{j.}</math> is the {{math|''j''}}-th row mean, <math>\textstyle \overline{a}_{.k}</math> is the{{math|''k''}}-th column mean, and <math>\textstyle \overline{a}_{..}</math> is the grand mean of the distance matrix of the ''X'' sample. The notation is similar for the ''b'' values. (In the matrices of centered distances (''A''<sub>''j'', ''k''</sub>) and (''B''<sub>''j'',''k''</sub>) all rows and all columns sum to zero.) The squared '''sample distance covariance''' is simply the arithmetic average of the products ''A''<sub>''j'', ''k''</sub>''B''<sub>''j'', ''k''</sub>: | where <math>\textstyle \overline{a}_{j.}</math> is the {{math|''j''}}-th row mean, <math>\textstyle \overline{a}_{.k}</math> is the{{math|''k''}}-th column mean, and <math>\textstyle \overline{a}_{..}</math> is the [[grand mean]] of the distance matrix of the ''X'' sample. The notation is similar for the ''b'' values. (In the matrices of centered distances (''A''<sub>''j'', ''k''</sub>) and (''B''<sub>''j'',''k''</sub>) all rows and all columns sum to zero.) The squared '''sample distance covariance''' is simply the arithmetic average of the products ''A''<sub>''j'', ''k ''</sub>''B''<sub>''j'', ''k''</sub>: | ||

:<math> | :<math> | ||

\operatorname{dCov}^2_n(X,Y) := \frac{1}{n^2}\sum_{j, k = 1}^n A_{j, k}\,B_{j, k}. | \operatorname{dCov}^2_n(X,Y) := \frac{1}{n^2}\sum_{j, k = 1}^n A_{j, k}\,B_{j, k}. | ||

Line 112: | Line 113: | ||

(ii) <math>\operatorname{dCor}(X,Y) = 0</math> if and only if <math>X</math> and <math>Y</math> are independent. | (ii) <math>\operatorname{dCor}(X,Y) = 0</math> if and only if <math>X</math> and <math>Y</math> are independent. | ||

(iii) <math>\operatorname{dCor}_n(X,Y) = 1</math> implies that dimensions of the linear | (iii) <math>\operatorname{dCor}_n(X,Y) = 1</math> implies that dimensions of the linear subspaces spanned by <math>X</math> and <math>Y</math> samples respectively are almost surely equal and if we assume that these subspaces are equal, then in this subspace <math>Y = A + b\,\mathbf{C}X</math> for some vector <math>A</math>, scalar <math>b</math>, and [[orthonormal matrix]] <math>\mathbf{C}</math>. | ||

===Distance covariance=== | ===Distance covariance=== | ||

Line 216: | Line 217: | ||

* [[RV coefficient]] | * [[RV coefficient]] | ||

* For a related third-order statistic, see [[Skewness#Distance skewness|Distance skewness]]. | * For a related third-order statistic, see [[Skewness#Distance skewness|Distance skewness]]. | ||

* | |||

==Notes== | ==Notes== |

## Latest revision as of 13:56, 29 December 2014

In statistics and in probability theory, **distance correlation** is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. An important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: **distance variance**, **distance standard deviation** and **distance covariance**. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson product-moment correlation coefficient.

These distance-based measures can be put into an indirect relationship to the ordinary moments by an alternative formulation (described below) using ideas related to Brownian motion, and this has led to the use of names such as **Brownian covariance** and **Brownian distance covariance**.

## Background

The classical measure of dependence, the Pearson correlation coefficient,^{[1]} is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by Gabor J Szekely in several lectures to address this deficiency of Pearson’s correlation, namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.^{[2]}^{[3]} It was proved that distance covariance is the same as the Brownian covariance.^{[3]} These measures are examples of energy distances.

## Definitions

### Distance covariance

Let us start with the definition of the **sample distance covariance**. Let (*X*_{k}, *Y*_{k}), *k*= 1, 2, ..., *n* be a statistical sample from a pair of real valued or vector valued random variables (*X*, *Y*). First, compute all pairwise distances

where || ⋅ || denotes Euclidean norm. That is, compute the *n* by *n* distance matrices (*a*_{j, k}) and (*b*_{j, k}). Then take all doubly centered distances

where is the *j*-th row mean, is the*k*-th column mean, and is the grand mean of the distance matrix of the *X* sample. The notation is similar for the *b* values. (In the matrices of centered distances (*A*_{j, k}) and (*B*_{j,k}) all rows and all columns sum to zero.) The squared **sample distance covariance** is simply the arithmetic average of the products *A*_{j, k }*B*_{j, k}:

The statistic *T*_{n} = *n* dCov^{2}_{n}(*X*, *Y*) determines a consistent multivariate test of independence of random vectors in arbitrary dimensions. For an implementation see *dcov.test* function in the *energy* package for R.^{[4]}

The population value of **distance covariance** can be defined along the same lines. Let *X* be a random variable that takes values in a *p*-dimensional Euclidean space with probability distribution μ and let *Y* be a random variable that takes values in a *q*-dimensional Euclidean space with probability distribution ν, and suppose that *X* and *Y* have finite expectations. Write

Finally, define the population value of squared distance covariance of *X* and *Y* as

One can show that this is equivalent to the following definition:

where * E* denotes expected value, and and are independent and identically distributed. Distance covariance can be expressed in terms of Pearson’s covariance,

**cov**, as follows:

This identity shows that the distance covariance is not the same as the covariance of distances, cov(||*X*-*X' *||, ||*Y*-*Y' * ||). This can be zero even if *X* and *Y* are not independent.

Alternately, the squared distance covariance can be defined as the weighted *L*_{2} norm of the distance between the joint characteristic function of the random variables and the product of their marginal characteristic functions:^{[5]}

where *ϕ*_{X, Y}(*s*, *t*), *ϕ*_{X}(*s*), and *ϕ*_{Y}(*t*) are the characteristic functions of (*X*, *Y*), *X*, and *Y*, respectively, *p*, *q* denote the Euclidean dimension of *X* and *Y*, and thus of *s* and *t*, and *c*_{p}, *c*_{q} are constants. The weight function is chosen to produce a scale equivariant and rotation invariant measure that doesn't go to zero for dependent variables.^{[5]}^{[6]} One interpretation^{[7]} of the characteristic function definition is that the variables *e ^{isX}* and

*e*are cyclic representations of

^{itY}*X*and

*Y*with different periods given by

*s*and

*t*, and the expression

*ϕ*

_{X, Y}(

*s*,

*t*) -

*ϕ*

_{X}(

*s*)

*ϕ*

_{Y}(

*t*) in the numerator of the characteristic function definition of distance covariance is simply the classical covariance of

*e*and

^{isX}*e*. The characteristic function definition clearly shows that dCov

^{itY}^{2}(

*X*,

*Y*) = 0 if and only if

*X*and

*Y*are independent.

### Distance variance

The *distance variance* is a special case of distance covariance when the two variables are identical.
The population value of distance variance is the square root of

where denotes the expected value, is an independent and identically distributed copy of and is independent of and and has the same distribution as and .

The *sample distance variance* is the square root of

which is a relative of Corrado Gini’s mean difference introduced in 1912 (but Gini did not work with centered distances).

### Distance standard deviation

The *distance standard deviation* is the square root of the *distance variance*.

### Distance correlation

The *distance correlation* ^{[2]}^{[3]} of two random variables is obtained by dividing their *distance covariance* by the product of their *distance standard deviations*. The distance correlation is

and the *sample distance correlation* is defined by substituting the sample distance covariance and distance variances for the population coefficients above.

For easy computation of sample distance correlation see the *dcor* function in the *energy* package for R.^{[4]}

## Properties

### Distance correlation

(ii) if and only if and are independent.

(iii) implies that dimensions of the linear subspaces spanned by and samples respectively are almost surely equal and if we assume that these subspaces are equal, then in this subspace for some vector , scalar , and orthonormal matrix .

### Distance covariance

(ii) for all constant vectors , scalars , and orthonormal matrices .

(iii) If the random vectors and are independent then

Equality holds if and only if and are both constants, or and are both constants, or are mutually independent.

(iv) if and only if and are independent.

This last property is the most important effect of working with centered distances.

The statistic is a biased estimator of . Under independence of X and Y ^{[8]}

### Distance variance

(i) if and only if almost surely.

(ii) if and only if every sample observation is identical.

(iii) for all constant vectors , scalars , and orthonormal matrices .

(iv) If and are independent then .

Equality holds in (iv) if and only if one of the random variables or is a constant.

## Generalization

Distance covariance can be generalized to include powers of Euclidean distance. Define

Then for every , and are independent if and only if . It is important to note that this characterization does not hold for exponent ; in this case for bivariate , is a deterministic function of the Pearson correlation.^{[2]} If and are powers of the corresponding distances, , then sample distance covariance can be defined as the nonnegative number for which

One can extend to metric-space-valued random variables and : If has law in a metric space with metric , then define , , and (provided is finite, i.e., has finite first moment), . Then if has law (in a possibly different metric space with finite first moment), define

This is non-negative for all such iff both metric spaces have negative type.^{[9]}
Here, a metric space has negative type
if is isometric to a subset of a Hilbert space.^{[10]}
If both metric spaces have strong negative type, then iff are independent.^{[9]}

## Alternative formulation: Brownian covariance

Brownian covariance is motivated by generalization of the notion of covariance to stochastic processes. The square of the covariance of random variables X and Y can be written in the following form:

where E denotes the expected value and the prime denotes independent and identically distributed copies. We need the following generalization of this formula. If U(s), V(t) are arbitrary random processes defined for all real s and t then define the U-centered version of X by

whenever the subtracted conditional expected value exists and denote by Y_{V} the V-centered version of Y.^{[3]}^{[11]}^{[12]} The (U,V) covariance of (X,Y) is defined as the nonnegative number whose square is

whenever the right-hand side is nonnegative and finite. The most important example is when U and V are two-sided independent Brownian motions /Wiener processes with expectation zero and covariance
|s| + |t| - |s-t| = 2 min(s,t). (This is twice the covariance of the standard Wiener process; here the factor 2 simplifies the computations.) In this case the (U,V) covariance is called **Brownian covariance** and is denoted by

There is a surprising coincidence: The Brownian covariance is the same as the distance covariance:

and thus **Brownian correlation** is the same as distance correlation.

On the other hand, if we replace the Brownian motion with the deterministic identity function *id* then Cov_{id}(X,Y) is simply the absolute value of the classical Pearson covariance,

## See also

- RV coefficient
- For a related third-order statistic, see Distance skewness.

## Notes

- ↑ Pearson (1895)
- ↑
^{2.0}^{2.1}^{2.2}Székely, Rizzo and Bakirov (2007) Cite error: Invalid`<ref>`

tag; name "SR2007" defined multiple times with different content - ↑
^{3.0}^{3.1}^{3.2}^{3.3}Székely & Rizzo (2009) - ↑
^{4.0}^{4.1}energy package for R - ↑
^{5.0}^{5.1}Székely & Rizzo (2009) Theorem 7, (3.7), p. 1249. - ↑ {{#invoke:Citation/CS1|citation |CitationClass=journal }}
- ↑ Template:Cite web
- ↑ Székely and Rizzo (2009), Rejoinder
- ↑
^{9.0}^{9.1}Lyons, R. (2011) "Distance covariance in metric spaces". Template:ArXiv - ↑ Klebanov, L. B. (2005)
*N-distances and their Applications*, Karolinum Press, Charles University, Prague. - ↑ Bickel & Xu (2009)
- ↑ Kosorok (2009)

## References

- Bickel, P.J. and Xu, Y. (2009) "Discussion of: Brownian distance covariance",
*Annals of Applied Statistics*, 3 (4), 1266–1269. Template:Hide in printTemplate:Only in print Free access to article - Gini, C. (1912). Variabilità e Mutabilità. Bologna: Tipografia di Paolo Cuppini.
- Pearson, K. (1895). "Note on regression and inheritance in the case of two parents",
*Proceedings of the Royal Society*, 58, 240–242 - Pearson, K. (1920). "Notes on the history of correlation",
*Biometrika*, 13, 25–45. - Székely, G. J. Rizzo, M. L. and Bakirov, N. K. (2007). "Measuring and testing independence by correlation of distances",
*Annals of Statistics*, 35/6, 2769–2794. Template:Hide in printTemplate:Only in print Reprint - Székely, G. J. and Rizzo, M. L. (2009). "Brownian distance covariance",
*Annals of Applied Statistics*, 3/4, 1233–1303. Template:Hide in printTemplate:Only in print Reprint - Kosorok, M. R. (2009) "Discussion of: Brownian Distance Covariance",
*Annals of Applied Statistics*, 3/4, 1270–1278. Template:Hide in printTemplate:Only in print Free access to article