# Exponential family

Jump to navigation Jump to search
"Natural parameter" links here. For the usage of this term in differential geometry, see differential geometry of curves.

In probability and statistics, an exponential family is a set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, on account of some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The concept of exponential families is credited to[1] E. J. G. Pitman,[2] G. Darmois,[3] and B. O. Koopman[4] in 1935–36. The term exponential class is sometimes used in place of "exponential family".[5]

The exponential families include many of the most common distributions, including the normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, Wishart, Inverse Wishart and many others. A number of common distributions are exponential families only when certain parameters are considered fixed and known, e.g. binomial (with fixed number of trials), multinomial (with fixed number of trials), and negative binomial (with fixed number of failures). Examples of common distributions that are not exponential families are Student's t, most mixture distributions, and even the family of uniform distributions with unknown bounds. See the section below on examples for more discussion.

Consideration of exponential-family distributions provides a general framework for selecting a possible alternative parameterisation of the distribution, in terms of natural parameters, and for defining useful sample statistics, called the natural sufficient statistics of the family. See below for more information.

## Definition

The following is a sequence of increasingly more general definitions of an exponential family. A casual reader may wish to restrict attention to the first and simplest definition, which corresponds to a single-parameter family of discrete or continuous probability distributions.

### Scalar parameter

A single-parameter exponential family is a set of probability distributions whose probability density function (or probability mass function, for the case of a discrete distribution) can be expressed in the form

${\displaystyle f_{X}(x|\theta )=h(x)\exp \left(\eta (\theta )\cdot T(x)-A(\theta )\right)}$

where T(x), h(x), η(θ), and A(θ) are known functions.

An alternative, equivalent form often given is

${\displaystyle f_{X}(x|\theta )=h(x)g(\theta )\exp \left(\eta (\theta )\cdot T(x)\right)}$

or equivalently

${\displaystyle f_{X}(x|\theta )=\exp \left(\eta (\theta )\cdot T(x)-A(\theta )+B(x)\right)}$

The value θ is called the parameter of the family.

Note that x is often a vector of measurements, in which case T(x) is a function from the space of possible values of x to the real numbers.

If η(θ) = θ, then the exponential family is said to be in canonical form. By defining a transformed parameter η = η(θ), it is always possible to convert an exponential family to canonical form. The canonical form is non-unique, since η(θ) can be multiplied by any nonzero constant, provided that T(x) is multiplied by that constant's reciprocal.

Even when x is a scalar, and there is only a single parameter, the functions η(θ) and T(x) can still be vectors, as described below.

Note also that the function A(θ) or equivalently g(θ) is automatically determined once the other functions have been chosen, and assumes a form that causes the distribution to be normalized (sum or integrate to one over the entire domain). Furthermore, both of these functions can always be written as functions of η, even when η(θ) is not a one-to-one function, i.e. two or more different values of θ map to the same value of η(θ), and hence η(θ) cannot be inverted. In such a case, all values of θ mapping to the same η(θ) will also have the same value for A(θ) and g(θ).

Further down the page is the example of a normal distribution with unknown mean and known variance.

### Factorization of the variables involved

What is important to note, and what characterizes all exponential family variants, is that the parameter(s) and the observation variable(s) must factorize (can be separated into products each of which involves only one type of variable), either directly or within either part (the base or exponent) of an exponentiation operation. Generally, this means that all of the factors constituting the density or mass function must be of one of the following forms:

${\displaystyle f(x),g(\theta ),c^{f(x)},c^{g(\theta )},{[f(x)]}^{c},{[g(\theta )]}^{c},{[f(x)]}^{g(\theta )},{[g(\theta )]}^{f(x)},{[f(x)]}^{h(x)g(\theta )},{\text{ or }}{[g(\theta )]}^{h(x)j(\theta )},}$

where f and h are arbitrary functions of x; g and j are arbitrary functions of θ; and c is an arbitrary "constant" expression (i.e. an expression not involving x or θ).

There are further restrictions on how many such factors can occur. For example, the two expressions:

${\displaystyle {[f(x)g(\theta )]}^{h(x)j(\theta )},\qquad {[f(x)]}^{h(x)j(\theta )}[g(\theta )]^{h(x)j(\theta )},}$

are the same, i.e. a product of two "allowed" factors. However, when rewritten into the factorized form,

${\displaystyle {[f(x)g(\theta )]}^{h(x)j(\theta )}={[f(x)]}^{h(x)j(\theta )}[g(\theta )]^{h(x)j(\theta )}=e^{[h(x)\ln f(x)]j(\theta )+h(x)[j(\theta )\ln g(\theta )]},}$

it can be seen that it cannot be expressed in the required form. (However, a form of this sort is a member of a curved exponential family, which allows multiple factorized terms in the exponent.{{ safesubst:#invoke:Unsubst||date=__DATE__ |\$B= {{#invoke:Category handler|main}}{{#invoke:Category handler|main}}[citation needed] }})

To see why an expression of the form

${\displaystyle {[f(x)]}^{g(\theta )}}$

qualifies, note that

${\displaystyle {[f(x)]}^{g(\theta )}=e^{g(\theta )\ln f(x)}}$

and hence factorizes inside of the exponent. Similarly,

${\displaystyle {[f(x)]}^{h(x)g(\theta )}=e^{h(x)g(\theta )\ln f(x)}=e^{[h(x)\ln f(x)]g(\theta )}}$

and again factorizes inside of the exponent.

Note also that a factor consisting of a sum where both types of variables are involved (e.g. a factor of the form ${\displaystyle 1+f(x)g(\theta )}$) cannot be factorized in this fashion (except in some cases where occurring directly in an exponent); this is why, for example, the Cauchy distribution and Student's t distribution are not exponential families.

### Vector parameter

The definition in terms of one real-number parameter can be extended to one real-vector parameter

${\displaystyle {\boldsymbol {\theta }}=\left(\theta _{1},\theta _{2},\cdots ,\theta _{s}\right)^{T}.}$

A family of distributions is said to belong to a vector exponential family if the probability density function (or probability mass function, for discrete distributions) can be written as

${\displaystyle f_{X}(x|{\boldsymbol {\theta }})=h(x)\exp \left(\sum _{i=1}^{s}\eta _{i}({\boldsymbol {\theta }})T_{i}(x)-A({\boldsymbol {\theta }})\right)}$

Or in a more compact form,

${\displaystyle f_{X}(x|{\boldsymbol {\theta }})=h(x)\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x)-A({\boldsymbol {\theta }}){\Big )}}$

This form writes the sum as a dot product of vector-valued functions ${\displaystyle {\boldsymbol {\eta }}({\boldsymbol {\theta }})}$ and ${\displaystyle \mathbf {T} (x)}$.

An alternative, equivalent form often seen is

${\displaystyle f_{X}(x|{\boldsymbol {\theta }})=h(x)g({\boldsymbol {\theta }})\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x){\Big )}}$

As in the scalar valued case, the exponential family is said to be in canonical form if

${\displaystyle \forall i:\quad \eta _{i}({\boldsymbol {\theta }})=\theta _{i}.}$

A vector exponential family is said to be curved if the dimension of

${\displaystyle {\boldsymbol {\theta }}=\left(\theta _{1},\theta _{2},\ldots ,\theta _{d}\right)^{T}}$

is less than the dimension of the vector

${\displaystyle {\boldsymbol {\eta }}({\boldsymbol {\theta }})=\left(\eta _{1}({\boldsymbol {\theta }}),\eta _{2}({\boldsymbol {\theta }}),\ldots ,\eta _{s}({\boldsymbol {\theta }})\right)^{T}.}$

That is, if the dimension of the parameter vector is less than the number of functions of the parameter vector in the above representation of the probability density function. Note that most common distributions in the exponential family are not curved, and many algorithms designed to work with any member of the exponential family implicitly or explicitly assume that the distribution is not curved.

Note that, as in the above case of a scalar-valued parameter, the function ${\displaystyle A({\boldsymbol {\theta }})}$ or equivalently ${\displaystyle g({\boldsymbol {\theta }})}$ is automatically determined once the other functions have been chosen, so that the entire distribution is normalized. In addition, as above, both of these functions can always be written as functions of ${\displaystyle {\boldsymbol {\eta }}}$, regardless of the form of the transformation that generates ${\displaystyle {\boldsymbol {\eta }}}$ from ${\displaystyle {\boldsymbol {\theta }}}$. Hence an exponential family in its "natural form" (parametrized by its natural parameter) looks like

${\displaystyle f_{X}(x|{\boldsymbol {\eta }})=h(x)\exp {\Big (}{\boldsymbol {\eta }}\cdot \mathbf {T} (x)-A({\boldsymbol {\eta }}){\Big )}}$

or equivalently

${\displaystyle f_{X}(x|{\boldsymbol {\eta }})=h(x)g({\boldsymbol {\eta }})\exp {\Big (}{\boldsymbol {\eta }}\cdot \mathbf {T} (x){\Big )}}$

Note that the above forms may sometimes be seen with ${\displaystyle {\boldsymbol {\eta }}^{T}\mathbf {T} (x)}$ in place of ${\displaystyle {\boldsymbol {\eta }}\cdot \mathbf {T} (x)}$. These are exactly equivalent formulations, merely using different notation for the dot product.

Further down the page is the example of a normal distribution with unknown mean and variance.

### Vector parameter, vector variable

The vector-parameter form over a single scalar-valued random variable can be trivially expanded to cover a joint distribution over a vector of random variables. The resulting distribution is simply the same as the above distribution for a scalar-valued random variable with each occurrence of the scalar x replaced by the vector

${\displaystyle \mathbf {x} =\left(x_{1},x_{2},\cdots ,x_{k}\right).}$

Note that the dimension k of the random variable need not match the dimension d of the parameter vector, nor (in the case of a curved exponential function) the dimension s of the natural parameter ${\displaystyle {\boldsymbol {\eta }}}$ and sufficient statistic T(x).

The distribution in this case is written as

${\displaystyle f_{X}(\mathbf {x} |{\boldsymbol {\theta }})=h(\mathbf {x} )\exp \left(\sum _{i=1}^{s}\eta _{i}({\boldsymbol {\theta }})T_{i}(\mathbf {x} )-A({\boldsymbol {\theta }})\right)}$

Or more compactly as

${\displaystyle f_{X}(\mathbf {x} |{\boldsymbol {\theta }})=h(\mathbf {x} )\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} )-A({\boldsymbol {\theta }}){\Big )}}$

Or alternatively as

${\displaystyle f_{X}(\mathbf {x} |{\boldsymbol {\theta }})=h(\mathbf {x} )\ g({\boldsymbol {\theta }})\ \exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} ){\Big )}}$

### Measure-theoretic formulation

We use cumulative distribution functions (cdf) in order to encompass both discrete and continuous distributions.

Suppose H is a non-decreasing function of a real variable. Then Lebesgue–Stieltjes integrals with respect to dH(x) are integrals with respect to the "reference measure" of the exponential family generated by H.

Any member of that exponential family has cumulative distribution function

${\displaystyle dF(\mathbf {x} |{\boldsymbol {\eta }})=e^{{\boldsymbol {\eta }}^{\rm {T}}\mathbf {T} (\mathbf {x} )-A({\boldsymbol {\eta }})}dH(\mathbf {x} ).}$

If F is a continuous distribution with a density, one can write dF(x) = f(xdx.

H(x) is a Lebesgue–Stieltjes integrator for the reference measure. When the reference measure is finite, it can be normalized and H is actually the cumulative distribution function of a probability distribution. If F is absolutely continuous with a density, then so is H, which can then be written dH(x) = h(xdx. If F is discrete, then H is a step function (with steps on the support of F).

## Interpretation

In the definitions above, the functions T(x), η(θ) and A(η) were apparently arbitrarily defined. However, these functions play a significant role in the resulting probability distribution.

• T(x) is a sufficient statistic of the distribution. For exponential families, the sufficient statistic is a function of the data that fully summarizes the data x within the density function. This means that, for any data sets x and y, the density value is the same if T(x) = T(y). This is true even if x and y are quite different—that is, ${\displaystyle d(x,y)>0}$. The dimension of T(x) equals the number of parameters of θ and encompasses all of the information regarding the data related to the parameter θ. The sufficient statistic of a set of independent identically distributed data observations is simply the sum of individual sufficient statistics, and encapsulates all the information needed to describe the posterior distribution of the parameters, given the data (and hence to derive any desired estimate of the parameters). This important property is further discussed below.
${\displaystyle A(\eta )=\ln \left(\int _{x}h(x)\exp(\eta (\theta )\cdot T(x))\operatorname {d} x\right)}$

The function A is important in its own right, because the mean, variance and other moments of the sufficient statistic T(x) can be derived simply by differentiating A(η). For example, because ln(x) is one of the components of the sufficient statistic of the gamma distribution, ${\displaystyle \mathbb {E} [\ln x]}$ can be easily determined for this distribution using A(η). Technically, this is true because

${\displaystyle K(u|\eta )=A(\eta +u)-A(\eta ),}$

is the cumulant generating function of the sufficient statistic.

## Properties

Exponential families have a large number of properties that make them extremely useful for statistical analysis. In many cases, it can be shown that, except in a few exceptional cases, only exponential families have these properties. Examples:

{{#invoke:Category handler|main}}{{#invoke:Category handler|main}}[citation needed] }}

## Examples

It is critical, when considering the examples in this section, to remember the discussion above about what it means to say that a "distribution" is an exponential family, and in particular to keep in mind that the set of parameters that are allowed to vary is critical in determining whether a "distribution" is or is not an exponential family.

The normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, von Mises and von Mises-Fisher distributions are all exponential families.

Some distributions are exponential families only if some of their parameters are held fixed. The family of Pareto distributions with a fixed minimum bound xm form an exponential family. The families of binomial and multinomial distributions with fixed number of trials n but unknown probability parameter(s) are exponential families. The family of negative binomial distributions with fixed number of failures (a.k.a. stopping-time parameter) r is an exponential family. However, when any of the above-mentioned fixed parameters are allowed to vary, the resulting family is not an exponential family.

As mentioned above, as a general rule, the support of an exponential family must remain the same across all parameter settings in the family. This is why the above cases (e.g. binomial with varying number of trials, Pareto with varying minimum bound) are not exponential families — in all of the cases, the parameter in question affects the support (particularly, changing the minimum or maximum possible value). For similar reasons, neither the discrete uniform distribution nor continuous uniform distribution are exponential families regardless of whether one of the bounds is held fixed. (If both bounds are held fixed, the result is a single distribution, not a family at all.)

The Weibull distribution with fixed shape parameter k is an exponential family. Unlike in the previous examples, the shape parameter does not affect the support; the fact that allowing it to vary makes the Weibull non-exponential is due rather to the particular form of the Weibull's probability density function (k appears in the exponent of an exponent).

In general, distributions that result from a finite or infinite mixture of other distributions, e.g. mixture model densities and compound probability distributions, are not exponential families. Examples are typical Gaussian mixture models as well as many heavy-tailed distributions that result from compounding (i.e. infinitely mixing) a distribution with a prior distribution over one of its parameters, e.g. the Student's t-distribution (compounding a normal distribution over a gamma-distributed precision prior), and the beta-binomial and Dirichlet-multinomial distributions. Other examples of distributions that are not exponential families are the F-distribution, Cauchy distribution, hypergeometric distribution and logistic distribution.

Following are some detailed examples of the representation of some useful distribution as exponential families.

### Normal distribution: Unknown mean, known variance

As a first example, consider a random variable distributed normally with unknown mean μ and known variance σ2. The probability density function is then

${\displaystyle f_{\sigma }(x;\mu )={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}.}$

This is a single-parameter exponential family, as can be seen by setting

{\displaystyle {\begin{aligned}h_{\sigma }(x)&={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {x^{2}}{2\sigma ^{2}}}}\\T_{\sigma }(x)&={\frac {x}{\sigma }}\\A_{\sigma }(\mu )&={\frac {\mu ^{2}}{2\sigma ^{2}}}\\\eta _{\sigma }(\mu )&={\frac {\mu }{\sigma }}.\end{aligned}}}

If σ = 1 this is in canonical form, as then η(μ) = μ.

### Normal distribution: Unknown mean and unknown variance

Next, consider the case of a normal distribution with unknown mean and unknown variance. The probability density function is then

${\displaystyle f(x;\mu ,\sigma )={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}}.}$

This is an exponential family which can be written in canonical form by defining

{\displaystyle {\begin{aligned}{\boldsymbol {\eta }}&=\left({\frac {\mu }{\sigma ^{2}}},-{\frac {1}{2\sigma ^{2}}}\right)^{\rm {T}}\\h(x)&={\frac {1}{\sqrt {2\pi }}}\\T(x)&=\left(x,x^{2}\right)^{\rm {T}}\\A({\boldsymbol {\eta }})&={\frac {\mu ^{2}}{2\sigma ^{2}}}+\ln |\sigma |=-{\frac {\eta _{1}^{2}}{4\eta _{2}}}+{\frac {1}{2}}\ln \left|{\frac {1}{2\eta _{2}}}\right|\end{aligned}}}

### Binomial distribution

As an example of a discrete exponential family, consider the binomial distribution with known number of trials n. The probability mass function for this distribution is

${\displaystyle f(x)={n \choose x}p^{x}(1-p)^{n-x},\quad x\in \{0,1,2,\ldots ,n\}.}$

This can equivalently be written as

${\displaystyle f(x)={n \choose x}\exp \left(x\log \left({\frac {p}{1-p}}\right)+n\log(1-p)\right),}$

which shows that the binomial distribution is an exponential family, whose natural parameter is

${\displaystyle \eta =\log {\frac {p}{1-p}}.}$

This function of p is known as logit.

## Table of distributions

The following table shows how to rewrite a number of common distributions as exponential-family distributions with natural parameters. Refer to the flashcards[6] for main exponential families.

For a scalar variable and scalar parameter, the form is as follows:

${\displaystyle f_{X}(\mathbf {x} |{\boldsymbol {\theta }})=h(\mathbf {x} )\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} )-A({\boldsymbol {\eta }}){\Big )}}$

For a scalar variable and vector parameter:

${\displaystyle f_{X}(x|{\boldsymbol {\theta }})=h(x)\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x)-A({\boldsymbol {\theta }}){\Big )}}$
${\displaystyle f_{X}(x|{\boldsymbol {\theta }})=h(x)g({\boldsymbol {\theta }})\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (x){\Big )}}$

For a vector variable and vector parameter:

${\displaystyle f_{X}(\mathbf {x} |{\boldsymbol {\theta }})=h(\mathbf {x} )\exp {\Big (}{\boldsymbol {\eta }}({\boldsymbol {\theta }})\cdot \mathbf {T} (\mathbf {x} )-A({\boldsymbol {\eta }}){\Big )}}$

The above formulas choose the functional form of the exponential-family with a log-partition function ${\displaystyle A({\boldsymbol {\eta }})}$. The reason for this is so that the moments of the sufficient statistics can be calculated easily, simply by differentiating this function. Alternative forms involve either parameterizing this function in terms of the normal parameter ${\displaystyle {\boldsymbol {\theta }}}$ instead of the natural parameter, and/or using a factor ${\displaystyle g({\boldsymbol {\eta }})}$ outside of the exponential. The relation between the latter and the former is:

${\displaystyle A({\boldsymbol {\eta }})=-\ln g({\boldsymbol {\eta }})}$
${\displaystyle g({\boldsymbol {\eta }})=e^{-A({\boldsymbol {\eta }})}}$

To convert between the representations involving the two types of parameter, use the formulas below for writing one type of parameter in terms of the other.