Leverage (statistics)

In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations.

High-leverage points are those observations, if any, made at extreme or outlying values of the independent variables such that the lack of neighboring observations means that the fitted regression model will pass close to that particular observation.^[1]

Modern computer packages for statistical analysis include, as part of their facilities for regression analysis, various quantitative measures for identifying influential observations: among these measures is partial leverage, a measure of how a variable contributes to the leverage of a datum.

Linear regression model

Definition

In the linear regression model, the leverage score for the i-th data unit is defined as:

h_{ii}=\left[\mathbf {H} \right]_{ii},

the i-th diagonal element of the projection matrix $\mathbf {H} =\mathbf {X} \left(\mathbf {X} ^{\mathsf {T}}\mathbf {X} \right)^{-1}\mathbf {X} ^{\mathsf {T}}$ , where $\mathbf {X}$ is the design matrix. The leverage score is also known as the observation self-sensitivity or self-influence,^[2] as shown by

h_{{ii}}={\frac {\partial {\hat {y}}_{i}}{\partial y_{i}}},

where ${\hat {y}}_{i}$ and ${y}_{i}$ are the fitted and measured observation, respectively.

Bounds on leverage

0 \leq h_{ii} \leq 1 .

Proof

First, note that H is an idempotent matrix: $H^2=X(X^{\top}X)^{-1}X^{\top}X(X^{\top}X)^{-1}X^{\top}=XI(X^{\top}X)^{-1}X^{\top}=H .$ Also, observe that $H$ is symmetric. So equating the ii element of H to that of H ², we have

h_{ii}=h_{ii}^2+\sum_{j\neq i}h_{ij}^2 \geq 0

and

h_{ii} \geq h_{ii}^2 \implies h_{ii} \leq 1 .

Effect on residual variance

If we are in an ordinary least squares setting with fixed X, regression errors $\epsilon_i,$ and

Y=X\beta +\epsilon

\operatorname{Var}(\epsilon)=\sigma^2I

then $\operatorname{Var}(e_i)=(1-h_{ii})\sigma^2$ where $e_{i}=Y_{i}-{\hat {Y}}_{i}$ (the i th regression residual).

In other words, if the model errors $\epsilon$ are homoscedastic, an observation's leverage score determines the degree of noise in the model's misprediction of that observation.

Proof

First, note that $I-H$ is idempotent and symmetric. This gives,

\operatorname{Var}(e)=\operatorname{Var}((I-H)Y)=(I-H)\operatorname{Var}(Y)(I-H)^{\top}=\sigma^2(I-H)^2=\sigma^2(I-H).

Thus $\operatorname{Var}(e_i)=(1-h_{ii})\sigma^2 .$

Studentized residuals

The corresponding studentized residual—the residual adjusted for its observation–specific residual variance—is then

t_i = {e_i\over \widehat{\sigma} \sqrt{1-h_{ii}\ }}

where $\widehat{\sigma}$ is an appropriate estimate of $\sigma .$

References

↑ Everitt, B. S. (2002). Cambridge Dictionary of Statistics. Cambridge University Press. ISBN 0-521-81099-X.
↑ Cardinali, C. (June 2013). "Data Assimilation: Observation influence diagnostic of a data assimilation system" (PDF).

This article is issued from Wikipedia - version of the 11/18/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.

Leverage (statistics)

Linear regression model

Definition

Bounds on leverage

Proof

Effect on residual variance

Proof

Studentized residuals

See also

References