Lecture 10

Author

Jiaye Xu

Published

May 2, 2022

Lecture 10 Overview

Outliers, High-leverage Observations and Influential Points
- Illustration: Three cases of isolated departure
Diagnostics for Individual Points: Measures of Consistency, Leverage and Influence
Detect Outliers
Detect High-Leverage Points
Identify Influential Points
Implementation in R

Outliers, High-Leverage Points, and Influential Points

An outlier is a data point whose response \(y\) does not follow the general trend of the rest of the data.

Note: We consider a point to be an outlier only if it is extreme with respect to the other \(y\) values, not the \(x\) values.

A data point has high leverage if it has “extreme” predictor \(x\) value(s).

Comment: With multiple predictors, extreme \(x\) values may be particularly high or low for one or more predictors.

A data point is influential if it unduly influences any part of a regression analysis, e.g., the predicted responses, estimated coefficients.

Note: Outliers and high-leverage data points have the potential to be influential, but we generally have to investigate further to determine whether or not they are actually influential.

Illustration: Three cases of isolated departure

An isolated departure loosely means a point that is well apart from the main cluster. For illustration, consider the following three types of configuration:

Interpretation:

Point marked with “\(\star\)” is not consistent of the remainder, thus it is an outlier. However, it has a small influence on fit since it has a small leverage.
Point marked with “\(\star\)” is consistent of the remainder (not an outlier), so it has a small influence on fit even though it has a large leverage.
Point marked with “\(\star\)” is not consistent of the remainder, so it has a large influence on fit since it has a large leverage.

Diagnostics for Individual Points: Measures of Consistency, Leverage and Influence

Consistency: an inconsistent point is called the outlier with a large residual.
Leverage: diagonal elements of the hat matrix \(\mathbf H\).
Influence: Cook’s distance measure based on single-case deletion, denoted as \(D_i\), is defined as:

\[ D_{i}=\frac{\sum_{j=1}^{n}\left[\hat{y}_{j}-\hat{y}_{j(i)}\right]^{2}}{p \times M S E} \]

where \(\hat{y}_{j(i)}\) is the predicted response value for the \(j\)th observation excluding \(i\)th.

Comments:

\(D_i\) directly summarizes how much all of the fitted values change when the \(i\)th observation is deleted.
A data point having a large \(D_i\) indicates that the data point strongly influences the fitted values.

Detect Outliers

We have known that the residual \(e_i= y_i - \hat y_t\) for \(i = 1, \ldots, n\) helps with identifying the outliers.

The major problem with ordinary residuals is that their magnitude depends on the units of measurement, thereby making it difficult to use the residuals as a way of detecting unusual \(y\) values.

We can eliminate the units of measurement by dividing the residuals by an estimate of their standard deviation, thereby obtaining what are known as standardized residuals.

Some books refer standardized residuals as studentized residuals.

standardized Residuals

standardized residuals (aka studentized residuals) are defined as an ordinary residual divided by an estimate of its standard deviation:

\[ r_{i}=\frac{e_{i}}{s\left(e_{i}\right)}=\frac{e_{i}}{\hat \sigma\sqrt{\left(1-h_{i i}\right)}} \]

(Recall \(\mathbf e = (\mathbf I - \mathbf H) \mathbf y\) and \(\operatorname{Var}(\mathbf y)= \sigma^2 \mathbf I\))

Comments:

Here, we see that the standardized residuals for a given data point depends not only on the ordinary residual, but also the size of \(\hat \sigma\) (equivalently the MSE) and the leverage \(h_{i i}\).
The good thing about standardized residuals is that they quantify how large the residuals are in standard deviation units, and therefore can be easily used to identify outliers.
Based on the experience, an observation with an standardized residuals that is larger than \(3\) (in absolute value) is generally deemed an outlier.

Detect High-Leverage Points

Understanding Hat Matrix and Leverage

Hat matrix \(\mathbf H = \mathbf X(\mathbf X^\prime \mathbf X)^{-1}\mathbf X^\prime\) contains the “leverages” that help us identify extreme \(x\) values.

If we actually perform the matrix multiplication on the right side of equation \(\hat{\mathbf y} = \mathbf H\mathbf y\), we can see that the predicted response for observation \(i\) can be written as a linear combination of the \(n\) observed responses

\[ \hat{y}_{i}=h_{i 1} y_{1}+h_{i 2} y_{2}+\ldots+h_{i i} y_{i}+\ldots+h_{i n} y_{n}, \quad i=1, \ldots, n \]

where the weights \(h_{i 1}, h_{i 2} , \ldots, h_{i n}\) depend only on the predictor values.

Comments:

The leverage, \(h_{i i}\), quantifies the influence that the observed response \(y_{i}\) has on the predicted value \(\hat{y}_{i}\).
- If \(h_{i i}\) is small, then the observed response \(y_{i}\) plays only a small role in the value of the predicted response \(\hat{y}_{i}\).
- If \(h_{i i}\) is large, then the observed response \(y_{i}\) plays a large role in the value of the predicted response \(\hat{y}_{i}\).
- It’s for this reason that the \(h_{i i}\) are called the “leverage”.

Properties of Leverages:

The leverage \(h_{i i}\) is a measure of the distance between the \(x\) value for the \(i\)th data point and the mean of the \(x\) values for all \(n\) data points.

That is, \(h_{i i}\) quantifies how far away the \(i\)th \(x\) value is from the rest of the \(x\) values. If the \(i\)th \(x\) value is far away, the leverage \(h_{i i}\) will be large; and otherwise not.

The leverage \(h_{i i}\) is a between \(0\) and \(1\), inclusive.
The sum of the \(h_{i i}\) equals \(p\), the number of parameters (regression coefficients including the intercept).

Identify Points Whose x Values are Extreme

A common empirical rule is to flag any observation whose leverage value, \(h_{i i}\), is more than \(3\) times larger than the mean leverage value, \(\bar{h}=\frac{\sum_{i=1}^{n} h_{i i}}{n}=\frac{p}{n}\), that is, if

\[ h_{i i}>3\left(\frac{p}{n}\right). \] the data point is considered as an extreme high-leverage point.

Comments:

As with many statistical “rules of thumb,” not everyone agrees about this \(3\left(\frac{p}{n}\right)\) cut-off and you may see \(2\left(\frac{p}{n}\right)\) used as a cut-off instead.
A refined rule-of-thumb that uses both cut-offs is to identify any observations with a leverage greater than \(3\left(\frac{p}{n}\right)\) or, failing this, any observations with a leverage that is greater than \(2\left(\frac{p}{n}\right)\) and very isolated.

Identify Influential Points

Besides the definition formula, Cook’s Distance measure can be further expressed using the leverage \(h_{ii}\) and the residual \(e_i\):

\[ D_{i}=\frac{e_{i}^{2}}{p \times M S E}\left[\frac{h_{i i}}{\left(1-h_{i i}\right)^{2}}\right] \]

Comments:

The main thing to recognize is that Cook’s \(D_i\) depends on both the residual, \(e_i\) and the leverage, \(h_{ii}\).
Both the \(x\) value and the \(y\) value of the data point play a role in the calculation of Cook’s distance. The influence is the combination of residual effect and leverage.

Empirical Guidelines for Identifying Influential Points

If \(D_i\) is greater than \(0.5\), then the \(i\)th data point is worthy of further investigation as it may be influential.
If \(D_i\) is greater than \(1\), then the \(i\)th data point is quite likely to be influential.
Or, if \(D_i\) is of a different magnitude than all of the other \(D_i\) values , it is almost certainly influential.

Implementation in R:

High-Leverage Points

# Data simulation, high-leverage only.
set.seed(1234)
x <- seq(1,10, length.out= 20)
x <- c(x, 14)
y <- x + rnorm(length(x))
dat = data.frame(cbind(y,x))

fit1 <- lm(y ~ x, data = dat)
fit2 = lm(y ~ x, data = dat[-21,])

plot(x, y, main = 'Scatterplot of Y vs x')
points(x[21], y[21], col = 2, pch = 20)
lines(x, fitted(fit1), col = "red")
lines(x[-21], fitted(fit2), col = 4, lty = 2)

# find leverages
hatvalues(fit1) # in the range [0,1]

         1          2          3          4          5          6          7 
0.15796069 0.13767705 0.11945172 0.10328471 0.08917601 0.07712562 0.06713355 
         8          9         10         11         12         13         14 
0.05919979 0.05332434 0.04950720 0.04774838 0.04804787 0.05040567 0.05482179 
        15         16         17         18         19         20         21 
0.06129622 0.06982896 0.08042001 0.09306938 0.10777706 0.12454305 0.34820094

# find the point with the largest leverage
hv = hatvalues(fit1)
which(hv == max(hv)) # 21st point is the extreme point with high-leverage in the simulated data

21 
21

# sum of hat values
sum(hv) # equals to p = 2

[1] 2

# criteria
p = 2
3*p/length(y) # max(hv) is greater than the criterion

[1] 0.2857143

Comments:

The leverage of the data point \(0.3482\) is greater than \(0.2857\). Therefore, the data point should be flagged as having high leverage.

# Cook's Distance
cooks.distance(fit1)

           1            2            3            4            5            6 
5.020969e-02 5.785598e-02 1.902841e-01 2.330145e-01 3.917266e-02 3.677885e-02 
           7            8            9           10           11           12 
1.774721e-03 1.560405e-03 2.068248e-03 1.055200e-02 1.460032e-03 1.650129e-02 
          13           14           15           16           17           18 
9.911403e-03 1.478707e-03 4.125882e-02 9.773923e-06 9.542605e-03 4.379723e-02 
          19           20           21 
4.722279e-02 4.694689e-01 8.060536e-03

Comment: the Cook’s distance measure for the red data point is less than \(0.5\). Therefore, we would not classify the red data point as being influential.

Outliers

# Data simulation, one outlier.
x <- seq(1,10, length.out= 20)
y <- 5*x + rnorm(length(x))
dat = data.frame(cbind(y,x))
dat <- rbind(dat, c(40, 4))

fit1 <- lm(y ~ x, data = dat)
fit2 = lm(y ~ x, data = dat[-21,])

x <- dat$x
y <- dat$y
plot(x, y, main = 'Scatterplot of Y vs x')
points(x[21], y[21], col = 2, pch = 20)
lines(x, fitted(fit1), col = "red")
lines(x[-21], fitted(fit2), col = 4, lty = 2)

rstandard(fit1)

           1            2            3            4            5            6 
-0.454921728 -0.412129772 -0.175214673 -0.412924258 -0.556277621 -0.075446854 
           7            8            9           10           11           12 
-0.408281670 -0.160013753 -0.340530667  0.132760880 -0.193157025 -0.222570294 
          13           14           15           16           17           18 
-0.154324623 -0.383927000 -0.259774136 -0.468100438 -0.257389628  0.005704693 
          19           20           21 
-0.010446406  0.471349384  4.276381556

# find the point with the largest standardized/studentized residual
rs = rstandard(fit1)
# abs() shows the absolute value
which(abs(rs) == max(abs(rs)))

21 
21

# Cook's Distance
round(cooks.distance(fit1),5)

      1       2       3       4       5       6       7       8       9      10 
0.02228 0.01510 0.00225 0.01026 0.01531 0.00023 0.00572 0.00076 0.00307 0.00044 
     11      12      13      14      15      16      17      18      19      20 
0.00095 0.00135 0.00073 0.00532 0.00292 0.01149 0.00423 0.00000 0.00001 0.02533 
     21 
0.59507

Comments:

The Cook’s distance measure for the red data point (\(0.5951\)) stands out a bit compared to the other Cook’s distance measures. Still, the Cook’s distance measure for the red data point is greater than \(0.5\) but less than \(1\).
Therefore, based on the Cook’s distance measure, we would perhaps investigate further but not necessarily classify the red data point as being influential.

Influential Points

# Data simulation, influential point.
x <- seq(1,10, length.out= 20)
y <- 5*x + rnorm(length(x))
dat = data.frame(cbind(y,x))
dat <- rbind(dat, c(13, 15))

fit1 <- lm(y ~ x, data = dat)
fit2 = lm(y ~ x, data = dat[-21,])

x <- dat$x
y <- dat$y
plot(x, y, main = 'Scatterplot of Y vs x')
points(x[21], y[21], col = 2, pch = 20)
lines(x, fitted(fit1), col = "red")
lines(x[-21], fitted(fit2), col = 4, lty = 2)

# Cook's Distance
round(cooks.distance(fit1),5)

      1       2       3       4       5       6       7       8       9      10 
0.06481 0.03964 0.02034 0.01508 0.00820 0.00436 0.00204 0.00017 0.00002 0.00000 
     11      12      13      14      15      16      17      18      19      20 
0.00114 0.00173 0.00356 0.00842 0.01597 0.02992 0.02605 0.06042 0.05291 0.10313 
     21 
6.19008

Comments:

In this case, we know from our previous investigation shown in the plot that the red data point does indeed highly influence the estimated regression function.
For reporting purposes in practice, it would be advisable to analyze the data twice–once with and once without the red data point–and to report the results of both analyses.