plot(fit.skcan$fitted.values, fit.skcan$residuals, col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Residuals versus Fitted")
abline(h = 0, col = "darkorange", lwd = 2)
Diagnostics for SLR
Graphical Tools for Residuals Checking
Diagnosis with Residual Plots
Transformations
Some Criterion for a Final Model
Log-Transforming Only the Predictor
Log-Transforming On the Response (and Predictor)
Box-Cox Transformations
Additional Transformations
Using the Fitted Model
We emphasized to conduct linear regression analysis, the four “LINE” assumptions must be satisfied. Now the question is:
How can we know whether or not the four “LINE” conditions hold?
Recall the LINE assumptions for simple linear regression (SLR) model
Therefore, the alternative way to describe all four assumptions is that the errors \(\epsilon_i\)’s are (in order of importance):
zero mean, i.e., \(\operatorname{E}(\epsilon_i) = 0\),
independent,
with equal variances \(\sigma^2\),
normally distributed.
In order to check the assumptions of the errors \(\epsilon_i\)’s, we use the observed residual \(e_i\)’s as the estimate of the actual unknown “true error” terms.
Since \(e_i\)’s should reflect the properties assumed for \(\epsilon_i\)’s, we analyze the residuals to see if they support the LINE assumption for SLR.
“Residuals vs fitted values” plot. Check for the equal variance condition, linearity and outliers.
“Residuals vs predictor” plot. Check for the inadequacy of the model.
e.g., a systematic trend may indicate missing predictors (e.g. missing high order terms)
This is a way of detecting a particular form of non-independence of the error terms, namely serial correlation.
A “residuals vs fitted values” plot is a scatter plot of residuals on the \(y\) axis and fitted values (estimated responses) on the \(x\) axis. The plot is used to detect non-linearity, unequal error variances, and outliers.
Recall the example of the skin cancer mortality rates (\(Y\)) and latitude (\(x\)), and check the residual \(e_i = y_i - \hat y_i\) against the fitted value \(\hat y_i\). The plot is “well-behaved”.
plot(fit.skcan$fitted.values, fit.skcan$residuals, col = "grey", pch = 20,
xlab = "Fitted", ylab = "Residuals", main = "Residuals versus Fitted")
abline(h = 0, col = "darkorange", lwd = 2)
Characteristics of a “Well-Behaved” Residual vs Fit Plot:
The residuals “bounce randomly” around the horizontal \(0\) line. This suggests that the assumption that the relationship is linear is reasonable. (Note: Linearity implies zero mean of random errors. By contrapositive, the divergence of the “band of residuals” from zero line implies non-linearity.)
The residuals roughly form a “horizontal band” around the \(0\) line. This suggests that the variances of the error terms are equal.
Comments:
In the plot, there is no one residual “stands out” from the basic random pattern of residuals. This suggests that there are no outliers. The detection of outliers is also valuable in diagnostics, but it is not in the scope of LINE assumption checking.
Alternatively, check “Residuals vs predictor” plot. It is a scatter plot of residuals on the \(y\) axis and the predictor (\(x\)) values on the \(x\) axis.
plot(skcan$Lat, fit.skcan$residuals, col = "grey", pch = 20,
xlab = "Predictors", ylab = "Residuals", main = "Residuals versus Predictors")
abline(h = 0, col = "darkorange", lwd = 2)
Comments: the equivalence of the two types of plots in SLR
For a SLR model, the “residuals vs predictor plot” offers no new information to that which is already learned by the “residuals vs fitted values” plot.
A well-behaved “residuals vs predictor” plot should have the same characteristics as a well-behaved “residuals vs fitted values” plot.
Non-Linearity
What does a non-linear regression function look like on a residual vs fit plot?
Here are some examples of “misbehaving” plots.
Comments:
They are positive for small \(x\) values, negative for medium \(x\) values, and positive again for large \(x\) values. Clearly, a non-linear model would better describe the relationship between the two variables, e.g., a quadratic model.
Non-Constant Variance of Error
What does non-constant error variance look like on a residual vs fitted values plot?
Non-constant error variance shows up on a residuals vs fit (or predictor for SLR) plot in any of the following ways:
The plot has a “fanning” effect. That is, the residuals are close to \(0\) for small \(x\) values and are more spread out for large \(x\) values.
The plot has a “funneling” effect. That is, the residuals are spread out for small \(x\) values and close to \(0\) for large \(x\) values.
Or, the spread of the residuals in the residuals vs fit plot varies in some complex fashion.
A “residuals vs order plot” is a way of detecting a particular form of non-independence of the error terms, namely serial correlation.
The plot is only appropriate if you know the order in which the data were collected. (For example, time series.)
It is a scatter plot with residuals on the \(y\) axis and the order in which the data were collected on the \(x\) axis.
If the errors were uncorrelated, we would expect a random scatter of points above and below the residual \(= 0\) line.
plot(fit.skcan$residuals, col = "grey", pch = 20, xlab = "Order", ylab = "Residuals", main = "Residuals versus Order")
abline(h = 0, col = "darkorange", lwd = 2)
An alternative approach to checking for serial correlation is to plot successive pairs of residuals:
n <- length(residuals(fit.skcan))
plot(head(residuals(fit.skcan),n-1), tail(residuals(fit.skcan),n-1), xlab=expression(e[i]),ylab=expression(e[i+1])) # x-axis: e_1,...,e_{n-1}. y-axis: e_2,...,e_n.
abline(h=0,v=0,col="grey")
We cannot see any sign indicating serial correlation.
What does serial correlation error variance look like on a “residuals vs order” plot?
The plot suggests that the assumption of independent error terms is violated. Possible models will be covered in a time-series class.
The basic idea behind any normal Q-Q probability plot:
If the data follow a normal distribution with mean \(\mu\) and variance \(\sigma^2\), then a plot of the theoretical percentiles of the normal distribution versus the observed sample percentiles should be approximately linear.
Since we are concerned about the normality of the error terms, we create a normal Q-Q probability plot of the residuals.
If the resulting plot is approximately linear, we proceed assuming that the error terms are normally distributed.
qqnorm(residuals(fit.skcan))
qqline(residuals(fit.skcan), col = "blue")
The normal Q-Q plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed.
Comments:
We still could proceed with the assumption that the error terms are normally distributed with the existence of the exceptional one or two outliers.
What does non-normality look like on a Q-Q plot?
Heavy-tail means that there is a larger probability of getting very extreme values. So heavy tail distributions typically represent wild as opposed to mild randomness.
There are too many extreme positive and negative residuals. We say the distribution is “heavy tailed.”
Transforming response and/or predictor variables has the potential to remedy a number of model problems.
Data transformation requires a “trial and error” approach. We continue this process until we’ve built a model that is appropriate and we can use.
The process of model building includes model formulation, model estimation, and model evaluation.
We don’t leave the model building process until we’ve convinced ourselves that the model meets the four conditions (“LINE”) of the linear regression model.
There is often more than one viable model, what’s important is that the model
is not overly complicated (We will talk more about the balance of simplicity and goodness-of-fit when we come to model comparison.)
meets the “LINE” conditions of the linear regression model, and
allows you to answer your research question of interest.
Transforming the \(x\) values is appropriate when non-linearity is the only problem.
However, it may be necessary to correct the non-linearity before you can assess the normality and equal variance assumptions.
Using transformations is part of an iterative process where all the linear regression assumptions are re-checked after each iteration.
Example:
Let’s consider the data from a memory retention experiment in which 13 subjects were asked to memorize a list of disconnected items. The subjects were then asked to recall the items at various times up to a week later.
Response \(y\) : the proportion of items correctly recalled.
Predictor \(x\): times (in minutes).
time prop
1 1 0.84
2 5 0.71
3 15 0.61
4 30 0.56
5 60 0.54
6 120 0.47
[1] "time" "prop"
fit = lm(y ~ x)
yhat = fitted(fit)
e = y - yhat
plot(x, y, xlab = 'Time', ylab = 'Proportion', ylim = c(-.05, .86), main = 'Proportion of Items vs Time', cex.lab = 1.3, cex.main = 1.5)
lines(x, fitted(fit), col = 2)
plot(yhat, e, xlab = 'Fitted Values', ylab = 'Residual', main = 'Residual vs Fit')
abline(h = 0, lty = 2)
qqnorm(e)
qqline(e)
Interpretation:
The lack of linearity dominates the plot and we have to fix the non-linearity problem before we can assess the assumption of equal variances.
In the normal Q-Q plot, the relationship is approximately linear. The condition that the error terms are normally distributed is met.
Log-Transforming: a Remedy for Non-Linearity
It is the most common logarithmic scale used in scientific work. The effects of taking the natural logarithmic transformation are:
Small values that are close together are spread further out.
Large values that are spread out are brought closer together.
Let’s use the natural logarithm to transform the x values in the memory retention experiment data.
#############
#x vs log(x)#
#############
#x = seq(1, 1000, length = 500)
#plot(x, log(x), type='l')
#transform x by logarithm
lnx = log(x)
fit = lm(y ~ lnx)
yhat = fitted(fit)
e = y - yhat
plot(lnx, y, xlab = 'Logarithm of Time', ylab = 'Proportion', ylim = c(-.05, .86), main = 'Proportion of Items vs Logarithm of Time', cex.lab = 1.3, cex.main = 1.5)
lines(lnx, fitted(fit), col = 2)
plot(x, y, xlab = 'Time', ylab = 'Proportion', ylim = c(-.05, .86), main = 'Proportion of Items vs Time', cex.lab = 1.3, cex.main = 1.5)
lines(x, fitted(fit), col = 2)
plot(yhat, e, xlab = 'Fitted Values', ylab = 'Residual', main = 'Residual vs Fit')
abline(h = 0, lty = 2)
qqnorm(e, ylim = c(-.042,.05))
qqline(e)
Interpretations:
The new residual vs fit plot shows a significant improvement over the one based on the untransformed data.
The normal Q-Q probability plot of the residuals shows that transforming the x values had no effect on the normality of the error terms.
Using the Model:
In this case, we transform the \(y\) values. It may also help to “straighten out” a curved relationship.
For example, in the data set on the typical birth-weight and length of gestation for different kinds of mammals. Birth-weight is the predictor (\(x\), in kg) and the length of gestation is the response (\(y\), in number of days until birth).
Now let’s take transformation on \(y\), by taking the natural logarithm of the lengths of gestation.
Comments:
Q: What if when “everything” seems wrong?
A: Transform both the response \(y\) and the predictor \(x\) values.
To correct non-linearity, we transform the \(x\) values.
To correct non-normality and/or unequal variances, we transform the \(y\) values, which may help the non-linearity.
It is often difficult to determine which transformation on Y to use. Box-Cox Transformations are a family of power transformations on \(y\) (suppose \(y>0\)) such that \[\begin{equation}g_{\lambda}(y)=\left\{\begin{array}{l} \frac{y^{\lambda}-1}{\lambda}, \lambda \neq 0 \\ \log (y), \lambda=0 \end{array}\right.\end{equation}\] where the parameter \(\lambda\) is determined by numerically maximizing a suitable log-likelihood function.(In case you are interested: This is a profile log-likelihood function denoted as \(I_{p}(\lambda)\), and the MLE of \(\lambda\) is the maximum of \(I_{p}(\lambda)\). The derivation and the form of \(I_{p}(\lambda)\) is beyond the scope of this course.) We use built-in R function to find the optimal \(\lambda\), denoted as \(\hat{\lambda}\).
Comments:
If some \(y_i < 0\), add a constant to all \(y\) such that they are all positive.
In practice, round off \(\hat{\lambda}\) to a more interpretable value.
An SLR with a Box-Cox transformation is \[y^\lambda_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
Implementation in R:
library(MASS)
fit1 <- lm(dist ~ speed, data=cars)
fit1.bc = boxcox(fit1)
# 1. locate the best lambda
fit1.bc$x[which.max(fit1.bc$y)] [1] 0.4242424
# 2. fit new lm model with Box-Cox transformed y
fit2 <- lm(dist^0.5 ~ speed, data=cars)
# NOTE: in practice, we round up lambda to a more interpretable value, say 0.5.
# 3. diagnostics
plot(fit2$fitted.values, fit2$residuals, xlab = "Fitted", ylab = "Residuals")
abline(h=0,col="blue")
# Note: serial correlation checking is not necessary for this data, this is just for demonstration
n = length(fit2$residuals)
plot(head(fit2$residuals, n-1), tail(fit2$residuals, n-1), xlab = expression(e[i]), ylab = expression(e[i+1]))
abline(h=0,v=0,col="grey")
qqnorm(residuals(fit2))
qqline(residuals(fit2), col = "blue")
One way to try to account for a nonlinear relationship is through a polynomial regression model. Generally, such a model for a single predictor, \(x\), is:
\[y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \ldots + \beta_h x_i^h + \epsilon_i\] where \(h\) is the degree of the polynomial.
Note:
Hierarchy principle of polynomial regression: if your model includes \(x^h\) and \(x^h\) is shown to be a statistically significant predictor of \(y\) , then your model should also include each \(x^j\) for all \(j < h\), whether or not the coefficients for these lower-order terms are significant.
other common transformations, e.g., \(\exp(x)\), \(1/x\) (reciprocal transformation) can be used.
problems that cannot be solved by transforming \(y\) could possibly be resolved by generalized linear models (GLM). (beyond the scope of this course)
Using transformations is part of an iterative process where all four “LINE” assumptions are re-checked after each iteration.
After the transformations on y or/and x, the model meets the four “LINE” conditions. Therefore, we can use the model to answer our research questions of interest, e.g., relationship between response and predictor, prediction of the mean response and a new observation.
To answer the scientific questions and the corresponding statistical questions, we need the topics cover in the previous lecture, i.e., estimation and inference for SLR.
A: Histograms and boxplots are not suitable (or adequate) for checking normality.
In Normal Q-Q plot, we plot the sorted residuals against \(\Phi^{-1}(\frac{i}{n+1})\) for \(i = 1,..., n.\)
qqline() adds a line joining the first and third quartiles. It is not influenced by outliers. Normal residuals should follow the line approximately.
# data from a memory retention experiment
dat = read.table('wordrecall.txt', header = T)
#x: the predictor
#y: the response
x = dat$time
y = dat$prop
fit = lm(y ~ x)Replicate Normal Q-Q plot manually in R
n <- length(fit$residuals)
# 1. Theoretical and sample quantiles
TheoQ <- qnorm((1:n)/(n+1), 0, 1)
SampQ <- sort(fit$residuals)
# 2. First and third quartiles
TheoQ1Q3 <- quantile(TheoQ, probs = c(0.25, 0.75))
SampQ1Q3 <- quantile(SampQ, probs = c(0.25, 0.75))
b0 = diff(SampQ1Q3)/diff(TheoQ1Q3) # slope
a0 = SampQ1Q3[1]-b0*TheoQ1Q3[1] # intercept
# Q-Q plot with Q-Q line
plot(TheoQ, SampQ, xlab = "Theoretical Quantiles", ylab = "Sample Quantiles", main = "Normal Q-Q Plot")
abline(a = a0, b= b0, col="blue")
Comparing to the plot by built-in R function:
qqnorm(residuals(fit))
qqline(residuals(fit), col = "blue")