Huber loss function

In statistical theory, the Huber loss function is a function used in robust estimation that allows construction of an estimate which allows the effect of outliers to be reduced, while treating non-outliers in a more standard way.

Definition

The Huber loss function describes the penalty incurred by an estimation procedure. Huber (1964[1]) defines the loss function piecewise by

${\displaystyle L_{\delta }(a)=(1/2){a^{2}}\qquad \qquad {\text{ for }}|a|\leq \delta ,}$
${\displaystyle L_{\delta }(a)=\delta (|a|-\delta /2),\qquad {\text{otherwise}}.}$

This function is quadratic for small values of a, and linear for large values, with equal values and slopes of the different sections at the two points where |a| = δ. In use, the variable ${\displaystyle a}$ often refers to the residuals, that is to the difference between the observed and predicted values, i.e. ${\displaystyle a=y-{\hat {y}}}$.

Motivation

For estimating parameters, it is desirable for a loss function to have the following properties (for all values of ${\displaystyle a}$ of the parameter space):

1. It is greater than or equal to the 0-1 loss function (which is defined as ${\displaystyle L(a)=0}$ if ${\displaystyle a=0}$ and ${\displaystyle L(a)=1}$ otherwise).
2. It is continuous (or lower semicontinuous).

Two very commonly used loss functions are the squared loss, ${\displaystyle L(a)=a^{2}}$, and the absolute loss, ${\displaystyle L(a)=|a|}$. While the absolute loss is not differentiable at exactly one point, ${\displaystyle a=0}$, where it is subdifferentiable with its convex subdifferential equal to the interval ${\displaystyle [-1+1]}$; the absolute-value loss function results in a median-unbiased estimator, which can be evaluated for particular data sets by linear programming. The squared loss has the disadvantage that it has the tendency to be dominated by outliers---when summing over a set of ${\displaystyle a}$'s (as in ${\displaystyle \sum _{i=1}^{n}L(a_{i})}$ ), the sample mean is influenced too much by a few particularly large a-values when the distribution is heavy tailed: in terms of estimation theory, the asymptotic relative efficiency of the mean is poor for heavy-tailed distributions

As defined above, the Huber loss function is convex in a uniform neighborhood of its minimum ${\displaystyle a=0}$, at the boundary of this uniform neighborhood, the Huber loss function has a differentiable extension to an affine function at points ${\displaystyle a=-\delta }$ and ${\displaystyle a=\delta }$. These properties allow it to combine much of the sensitivity of the mean-unbiased, minimum-variance estimator of the mean (using the quadratic loss function) and the robustness of the median-unbiased estimor (using the absolute value function).

The log cosh loss function, which is defined as ${\displaystyle L(a)=\log(\cosh(a))}$ has a behavior like that of the Huber loss function.

Pseudo-Huber loss function

The Pseudo-Huber loss function can be used as a smooth approximation of the Huber loss function, and ensures that derivatives are continuous for all degrees. It is defined as{{ safesubst:#invoke:Unsubst||date=__DATE__ |\$B= {{#invoke:Category handler|main}}{{#invoke:Category handler|main}}[citation needed] }}

${\displaystyle L_{\delta }(a)=\delta ^{2}({\sqrt {1+(a/\delta )^{2}}}-1).}$

As such, this function approximates ${\displaystyle a^{2}/2}$ for small values of ${\displaystyle a}$, and is parallel with slope ${\displaystyle \delta }$ for large values of ${\displaystyle a}$.

Applications

The Huber loss function is used in robust statistics, M-estimation and additive modelling.[2]