Lecture 4

Lecture 4 Overview

Diagnostics for SLR
- Graphical Tools for Residuals Checking
- Diagnosis with Residual Plots
Transformations
- Some Criterion for a Final Model
- Log-Transforming Only the Predictor
- Log-Transforming On the Response (and Predictor)
- Box-Cox Transformations
- Additional Transformations
Using the Fitted Model

We emphasized to conduct linear regression analysis, the four “LINE” assumptions must be satisfied. Now the question is:

How can we know whether or not the four “LINE” conditions hold?

Diagnostics for SLR

Recall the LINE assumptions for simple linear regression (SLR) model

Linearity: The mean of the response, \(\operatorname{E}(y_i) = \beta_0 + \beta_1 x_i\), is a Linear function of \(x_i\), which implies that \(\operatorname{E}(\epsilon_i) = \operatorname{E}[(y_i) -(\beta_0 + \beta_1 x_i)]=0\)
Independence: The errors, \(\epsilon_i\), are Independent.
Normality: The errors, \(\epsilon_i\), are Normally distributed.
Equal Variance: The errors, \(\epsilon_i\), have Equal variance (denoted \(\sigma^2\)).

Therefore, the alternative way to describe all four assumptions is that the errors \(\epsilon_i\)’s are (in order of importance):

zero mean, i.e., \(\operatorname{E}(\epsilon_i) = 0\),
independent,
with equal variances \(\sigma^2\),
normally distributed.

In order to check the assumptions of the errors \(\epsilon_i\)’s, we use the observed residual \(e_i\)’s as the estimate of the actual unknown “true error” terms.

Since \(e_i\)’s should reflect the properties assumed for \(\epsilon_i\)’s, we analyze the residuals to see if they support the LINE assumption for SLR.

Tools for Residuals Checking

“Residuals vs fitted values” plot. Check for the equal variance condition, linearity and outliers.
“Residuals vs predictor” plot. Check for the inadequacy of the model.

e.g., a systematic trend may indicate missing predictors (e.g. missing high order terms)

“Residuals vs order” plot. Check the independent condition:

This is a way of detecting a particular form of non-independence of the error terms, namely serial correlation.

“Q-Q plot”. To check the Normality condition and outliers.

“Residuals vs fitted values” plot

A “residuals vs fitted values” plot is a scatter plot of residuals on the \(y\) axis and fitted values (estimated responses) on the \(x\) axis. The plot is used to detect non-linearity, unequal error variances, and outliers.

Recall the example of the skin cancer mortality rates (\(Y\)) and latitude (\(x\)), and check the residual \(e_i = y_i - \hat y_i\) against the fitted value \(\hat y_i\). The plot is “well-behaved”.

plot(fit.skcan$fitted.values, fit.skcan$residuals, col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Residuals versus Fitted")
abline(h = 0, col = "darkorange", lwd = 2)

Characteristics of a “Well-Behaved” Residual vs Fit Plot:

The residuals “bounce randomly” around the horizontal \(0\) line. This suggests that the assumption that the relationship is linear is reasonable. (Note: Linearity implies zero mean of random errors. By contrapositive, the divergence of the “band of residuals” from zero line implies non-linearity.)
The residuals roughly form a “horizontal band” around the \(0\) line. This suggests that the variances of the error terms are equal.

Comments:

In the plot, there is no one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers. The detection of outliers is also valuable in diagnostics, but it is not in the scope of LINE assumption checking.

“Residuals vs predictor” Plot

Alternatively, check “Residuals vs predictor” plot. It is a scatter plot of residuals on the \(y\) axis and the predictor (\(x\)) values on the \(x\) axis.

plot(skcan$Lat, fit.skcan$residuals, col = "grey", pch = 20,
xlab = "Predictors", ylab = "Residuals", main = "Residuals versus Predictors")
abline(h = 0, col = "darkorange", lwd = 2)

Comments: the equivalence of the two types of plots in SLR

For a SLR model, the “residuals vs predictor plot” offers no new information to that which is already learned by the “residuals vs fitted values” plot.
A well-behaved “residuals vs predictor” plot should have the same characteristics as a well-behaved “residuals vs fitted values” plot.

Diagnosis with Residual Plots

Non-Linearity

What does a non-linear regression function look like on a residual vs fit plot?

Here are some examples of “misbehaving” plots.

Example of non-linearity. The residuals depart from 0 in a systematic manner.

Comments:

They are positive for small \(x\) values, negative for medium \(x\) values, and positive again for large \(x\) values. Clearly, a non-linear model would better describe the relationship between the two variables, e.g., a quadratic model.

Non-Constant Variance of Error

What does non-constant error variance look like on a residual vs fitted values plot?

Non-constant error variance shows up on a residuals vs fit (or predictor for SLR) plot in any of the following ways:

The plot has a “fanning” effect. That is, the residuals are close to \(0\) for small \(x\) values and are more spread out for large \(x\) values.
The plot has a “funneling” effect. That is, the residuals are spread out for small \(x\) values and close to \(0\) for large \(x\) values.
Or, the spread of the residuals in the residuals vs fit plot varies in some complex fashion.

Example of non-constant variance. Note that the residuals “fan out” from left to right rather than exhibiting a consistent spread around the residual = 0 line.

“Residuals vs order” Plot

A “residuals vs order plot” is a way of detecting a particular form of non-independence of the error terms, namely serial correlation.
The plot is only appropriate if you know the order in which the data were collected. (For example, time series.)
It is a scatter plot with residuals on the \(y\) axis and the order in which the data were collected on the \(x\) axis.
If the errors were uncorrelated, we would expect a random scatter of points above and below the residual \(= 0\) line.

plot(fit.skcan$residuals, col = "grey", pch = 20, xlab = "Order", ylab = "Residuals", main = "Residuals versus Order")
abline(h = 0, col = "darkorange", lwd = 2)

An alternative approach to checking for serial correlation is to plot successive pairs of residuals:

n <- length(residuals(fit.skcan))
plot(head(residuals(fit.skcan),n-1), tail(residuals(fit.skcan),n-1), xlab=expression(e[i]),ylab=expression(e[i+1])) # x-axis: e_1,...,e_{n-1}. y-axis: e_2,...,e_n.
abline(h=0,v=0,col="grey")

We cannot see any sign indicating serial correlation.

What does serial correlation error variance look like on a “residuals vs order” plot?

Example of serial correlation error variance. In this plot, we see long sequences of points above or below the zero line. This is an indication of positive serial correlation. (Note: positive serial correlation exists when residuals tend to be followed, in time, by residuals of the same sign and about the same magnitude.) In the plot of successive pairs of residuals, we can see a positive correlation again indicating positive serial correlation.

The plot suggests that the assumption of independent error terms is violated. Possible models will be covered in a time-series class.

Normal Q-Q (Quantile-Quantile) Plot

The basic idea behind any normal Q-Q probability plot:

If the data follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\), then a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles should be approximately linear.
Since we are concerned about the normality of the error terms, we create a normal Q-Q probability plot of the residuals.
If the resulting plot is approximately linear, we proceed assuming that the error terms are normally distributed.

qqnorm(residuals(fit.skcan))
qqline(residuals(fit.skcan), col = "blue")

The normal Q-Q plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed.

Comments:

We still could proceed with the assumption that the error terms are normally distributed with the existence of the exceptional one or two outliers.

What does non-normality look like on a Q-Q plot?

Skewed Residuals: U-Shape

The following histogram of residuals suggests that the residuals (and hence the error terms) are quite skewed and not normally distributed.

In the normal Q-Q plot, the relationship is far from being linear, which implies that the condition that the error terms are normally distributed is not met.

Heavy-tailed Residuals: S-Shape

Heavy-tail means that there is a larger probability of getting very extreme values. So heavy tail distributions typically represent wild as opposed to mild randomness.

There are too many extreme positive and negative residuals. We say the distribution is “heavy tailed.”

Histogram of residuals suggests that the residuals (and hence the error terms) are not normally distributed.

In the normal Q-Q plot, the relationship is not linear, instead, S-shape. Again, the condition that the error terms are normally distributed is not met.

Transformations

Transforming response and/or predictor variables has the potential to remedy a number of model problems.
Data transformation requires a “trial and error” approach. We continue this process until we’ve built a model that is appropriate and we can use.
The process of model building includes model formulation, model estimation, and model evaluation.
We don’t leave the model building process until we’ve convinced ourselves that the model meets the four conditions (“LINE”) of the linear regression model.

Some Criterion for a Final Model

There is often more than one viable model, what’s important is that the model

is not overly complicated (We will talk more about the balance of simplicity and goodness-of-fit when we come to model comparison.)
meets the “LINE” conditions of the linear regression model, and
allows you to answer your research question of interest.

Scenario I: Non-linearity is the only problem – Log-Transforming Only the Predictor

Transforming the \(x\) values is appropriate when non-linearity is the only problem.
However, it may be necessary to correct the non-linearity before you can assess the normality and equal variance assumptions.
Using transformations is part of an iterative process where all the linear regression assumptions are re-checked after each iteration.

Example:

Let’s consider the data from a memory retention experiment in which 13 subjects were asked to memorize a list of disconnected items. The subjects were then asked to recall the items at various times up to a week later.

Response \(y\) : the proportion of items correctly recalled.

Predictor \(x\): times (in minutes).

  time prop
1    1 0.84
2    5 0.71
3   15 0.61
4   30 0.56
5   60 0.54
6  120 0.47

[1] "time" "prop"

fit = lm(y ~ x)
yhat = fitted(fit)
e = y - yhat

plot(x, y, xlab = 'Time', ylab = 'Proportion', ylim = c(-.05, .86), main = 'Proportion of Items vs Time', cex.lab = 1.3, cex.main = 1.5)
lines(x, fitted(fit), col = 2)

plot(yhat, e, xlab = 'Fitted Values', ylab = 'Residual', main = 'Residual vs Fit')
abline(h = 0, lty = 2)

qqnorm(e)
qqline(e)

Interpretation:

The lack of linearity dominates the plot and we have to fix the non-linearity problem before we can assess the assumption of equal variances.
In the normal Q-Q plot, the relationship is approximately linear. The condition that the error terms are normally distributed is met.

Log-Transforming: a Remedy for Non-Linearity

It is the most common logarithmic scale used in scientific work. The effects of taking the natural logarithmic transformation are:

Small values that are close together are spread further out.
Large values that are spread out are brought closer together.

Let’s use the natural logarithm to transform the x values in the memory retention experiment data.

#############
#x vs log(x)#
#############
#x = seq(1, 1000, length = 500)
#plot(x, log(x), type='l')

#transform x by logarithm
lnx = log(x)
fit = lm(y ~ lnx)
yhat = fitted(fit)
e = y - yhat

plot(lnx, y, xlab = 'Logarithm of Time', ylab = 'Proportion', ylim = c(-.05, .86), main = 'Proportion of Items vs Logarithm of Time', cex.lab = 1.3, cex.main = 1.5)
lines(lnx, fitted(fit), col = 2)

plot(x, y, xlab = 'Time', ylab = 'Proportion', ylim = c(-.05, .86), main = 'Proportion of Items vs Time', cex.lab = 1.3, cex.main = 1.5)
lines(x, fitted(fit), col = 2)

plot(yhat, e, xlab = 'Fitted Values', ylab = 'Residual', main = 'Residual vs Fit')
abline(h = 0, lty = 2)

qqnorm(e, ylim = c(-.042,.05))
qqline(e)

Interpretations:

The new residual vs fit plot shows a significant improvement over the one based on the untransformed data.

The normal Q-Q probability plot of the residuals shows that transforming the x values had no effect on the normality of the error terms.

Using the Model:

Scenario II: When Non-Normality and/or Unequal Variances are the Problems.

In this case, we transform the \(y\) values. It may also help to “straighten out” a curved relationship.

For example, in the data set on the typical birth-weight and length of gestation for different kinds of mammals. Birth-weight is the predictor (\(x\), in kg) and the length of gestation is the response (\(y\), in number of days until birth).

The fitted line plot suggests that the relationship between gestation length (y) andbirthweight (x) is linear, but that the variance of the error terms might not be equal.

The residuals vs fit plot exhibits some fanning pattern and therefore provides yet more evidence that the variance of the error terms might not be equal.

The normal Q-Q probability plot supports the assumption of normally distributed error terms.

Now let’s take transformation on \(y\), by taking the natural logarithm of the lengths of gestation.

The new residual vs fit plot shows a marked improvement in the spread of the residuals.

The log transformation of the response did not adversely affect the normality of the error terms.

Comments:

Q: What if when “everything” seems wrong?

A: Transform both the response \(y\) and the predictor \(x\) values.

To correct non-linearity, we transform the \(x\) values.
To correct non-normality and/or unequal variances, we transform the \(y\) values, which may help the non-linearity.

Scenario III: When It Is Difficult to Determine Which Transformation on Y to Use.

It is often difficult to determine which transformation on Y to use. Box-Cox Transformations are a family of power transformations on \(y\) (suppose \(y>0\)) such that \[\begin{equation}g_{\lambda}(y)=\left\{\begin{array}{l} \frac{y^{\lambda}-1}{\lambda}, \lambda \neq 0 \\ \log (y), \lambda=0 \end{array}\right.\end{equation}\] where the parameter \(\lambda\) is determined by numerically maximizing a suitable log-likelihood function.(In case you are interested: This is a profile log-likelihood function denoted as \(I_{p}(\lambda)\), and the MLE of \(\lambda\) is the maximum of \(I_{p}(\lambda)\). The derivation and the form of \(I_{p}(\lambda)\) is beyond the scope of this course.) We use built-in R function to find the optimal \(\lambda\), denoted as \(\hat{\lambda}\).

Comments:

If some \(y_i < 0\), add a constant to all \(y\) such that they are all positive.
In practice, round off \(\hat{\lambda}\) to a more interpretable value.

An SLR with a Box-Cox transformation is \[y^\lambda_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Implementation in R:

library(MASS)
fit1 <- lm(dist ~ speed, data=cars)
fit1.bc = boxcox(fit1)

# 1. locate the best lambda
fit1.bc$x[which.max(fit1.bc$y)]

[1] 0.4242424

# 2. fit new lm model with Box-Cox transformed y
fit2 <- lm(dist^0.5 ~ speed, data=cars) 
# NOTE: in practice, we round up lambda to a more interpretable value, say 0.5.

# 3. diagnostics
plot(fit2$fitted.values, fit2$residuals, xlab = "Fitted", ylab = "Residuals")
abline(h=0,col="blue")

# Note: serial correlation checking is not necessary for this data, this is just for demonstration
n = length(fit2$residuals)
plot(head(fit2$residuals, n-1), tail(fit2$residuals, n-1), xlab = expression(e[i]), ylab = expression(e[i+1]))
abline(h=0,v=0,col="grey")

qqnorm(residuals(fit2))
qqline(residuals(fit2), col = "blue")

Additional transformations:

Polynomial regression:

One way to try to account for a nonlinear relationship is through a polynomial regression model. Generally, such a model for a single predictor, \(x\), is:

\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \ldots + \beta_h x_i^h + \epsilon_i\] where \(h\) is the degree of the polynomial.

Note:

Hierarchy principle of polynomial regression: if your model includes \(x^h\) and \(x^h\) is shown to be a statistically significant predictor of \(y\) , then your model should also include each \(x^j\) for all \(j < h\), whether or not the coefficients for these lower-order terms are significant.

other common transformations, e.g., \(\exp(x)\), \(1/x\) (reciprocal transformation) can be used.
problems that cannot be solved by transforming \(y\) could possibly be resolved by generalized linear models (GLM). (beyond the scope of this course)

Using the Fitted Model

Using transformations is part of an iterative process where all four “LINE” assumptions are re-checked after each iteration.

After the transformations on y or/and x, the model meets the four “LINE” conditions. Therefore, we can use the model to answer our research questions of interest, e.g., relationship between response and predictor, prediction of the mean response and a new observation.

To answer the scientific questions and the corresponding statistical questions, we need the topics cover in the previous lecture, i.e., estimation and inference for SLR.

FAQ

Q: Why we need Normal Q-Q plot? What does theoretical quantiles and sample quantiles mean? What does qqline() give us?

A: Histograms and boxplots are not suitable (or adequate) for checking normality.

In Normal Q-Q plot, we plot the sorted residuals against \(\Phi^{-1}(\frac{i}{n+1})\) for \(i = 1,..., n.\)
qqline() adds a line joining the first and third quartiles. It is not influenced by outliers. Normal residuals should follow the line approximately.

# data from a memory retention experiment
dat = read.table('wordrecall.txt', header = T)
#x: the predictor
#y: the response
x = dat$time
y = dat$prop
fit = lm(y ~ x)

Replicate Normal Q-Q plot manually in R

n <-  length(fit$residuals)

# 1. Theoretical and sample quantiles
TheoQ <- qnorm((1:n)/(n+1), 0, 1) 
SampQ <- sort(fit$residuals)

# 2. First and third quartiles
TheoQ1Q3 <- quantile(TheoQ, probs = c(0.25, 0.75))
SampQ1Q3 <- quantile(SampQ, probs = c(0.25, 0.75))
b0 = diff(SampQ1Q3)/diff(TheoQ1Q3) # slope
a0 = SampQ1Q3[1]-b0*TheoQ1Q3[1] # intercept

# Q-Q plot with Q-Q line
plot(TheoQ, SampQ, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", main = "Normal Q-Q Plot")
abline(a = a0, b= b0, col="blue")

Comparing to the plot by built-in R function:

qqnorm(residuals(fit))
qqline(residuals(fit), col = "blue")