Lecture 7

Author

Jiaye Xu

Published

March 16, 2022

Lecture 7 Overview

Prediction
Diagnostics and Remedies
- Diagnostic Tools and Transformation
Problems with Predictors
- Collinearity
- Detect Collinearity
Problems with Errors
- Generalized Least Squares
- Weighted Least Squares

Prediction

Suppose after diagnostics and model selection, you have chosen a LM and want to make a prediction for a new set of predictors (a.k.a. covariates) \(\mathbf x_0\). A natural prediction is the fitted value

\[\hat y_0 =\mathbf x_0^T \hat{\boldsymbol \beta}\]

To compute the standard error and construct confidence interval for the prediction, one needs to distinguish between prediction of the future mean response and prediction of future observations:

a \(100\times(1-\alpha)\%\) confidence interval for the mean response at \(\mathbf x_0\)

\[\hat y_0 \pm t_{n-p}(1-\alpha/2) \hat\sigma \sqrt{\mathbf x_0^T(\mathbf X^T \mathbf X)^{-1}\mathbf x_0}\]

a \(100\times(1-\alpha)\%\) confidence interval for a future response at \(\mathbf x_0\)

\[\hat y_0 \pm t_{n-p}(1-\alpha/2) \hat\sigma \sqrt{1+\mathbf x_0^T(\mathbf X^T \mathbf X)^{-1}\mathbf x_0}\] Implementation in R

Example: fit a MLR with blood pressure (BP) as response, Age and Weight as predictors.

  Pt  BP Age Weight  BSA Dur Pulse Stress
1  1 105  47   85.4 1.75 5.1    63     33
2  2 115  49   94.2 2.10 3.8    70     14
3  3 116  49   95.3 1.98 8.2    72     10
4  4 117  50   94.7 2.01 5.8    73     99
5  5 112  51   89.4 1.89 7.0    72     95
6  6 121  48   99.5 2.25 9.3    71     10

fit1 = lm(BP ~ Age + Weight)
summary(fit1)


Call:
lm(formula = BP ~ Age + Weight)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.89968 -0.35242  0.06979  0.35528  0.82781 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -16.57937    3.00746  -5.513 3.80e-05 ***
Age           0.70825    0.05351  13.235 2.22e-10 ***
Weight        1.03296    0.03116  33.154  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5327 on 17 degrees of freedom
Multiple R-squared:  0.9914,    Adjusted R-squared:  0.9904 
F-statistic: 978.2 on 2 and 17 DF,  p-value: < 2.2e-16

# 95% CI for the mean response with Age=50, Weight=90
new = data.frame(Age=50, Weight=90)

ans1 = predict(fit1, new, se.fit = TRUE, interval = "confidence", level = 0.95, type = "response") 
ans1$fit

       fit      lwr      upr
1 111.7997 111.4053 112.1941

# 95% PI for a new observation with Age=50, Weight=90
ans2 = predict(fit1, new, se.fit = TRUE, interval = "prediction", level = 0.95, type = "response")
ans2$fit

       fit      lwr      upr
1 111.7997 110.6086 112.9908

Diagnostics and Remedies

The four conditions (“LINE”) that comprise the multiple linear regression (MLR) model generalize the SLR conditions to take account of the fact that we now have multiple predictors:

The mean of the response, \(\operatorname{E}(y_i)\), at each set of values of the predictors, \((x_{i1}, x_{i2},\ldots)\), is a Linear function.
The errors, \(\epsilon_i\), are Independent.
The errors, \(\epsilon_i\), at each set of values of the predictors,\((x_{i1}, x_{i2},\ldots)\), are Normally distributed.
The errors, \(\epsilon_i\), at each set of values of the predictors, \((x_{i1}, x_{i2},\ldots)\), have Equal variances (denoted \(\sigma^2\)).

Similarly to SLR, an alternative way to describe all four assumptions is that the errors, \(\epsilon_i\), are independent normal random variables with mean zero and constant variance, \(\sigma^2\).

As in SLR, we can assess whether these conditions seem to hold for a MLR model applied to a particular sample data set by looking at the the residuals, \(e_i\)’s.

Diagnostic Tools for Assessing the Model Assumptions

We can use all the diagnostic tool we’ve learned to assess the MLR model assumptions.

“Residuals vs fitted values” plot
- the (vertical) average of the residuals remains close to \(0\) as we scan the plot from left to right (this affirms the “L” condition);
- the (vertical) spread of the residuals remains approximately constant as we scan the plot from left to right (this affirms the “E” condition);
- there are no excessively outlying points.
- violation of any of these three may necessitate remedial action such as transforming one or more predictors and/or the response variable.
“Residuals vs order” Plot
- If the data observations were collected over time (or space) create a scatterplot with the residuals on the vertical axis and the time (or space) sequence on the horizontal axis and visual assess whether there is no systematic non-random pattern.
- This affirms the “I” condition.
- Violation may suggest the need for a time-series model, which is beyond the scope of the course.
“Residuals vs predictor” Plot
- the (vertical) average of the residuals remains close to \(0\) as we scan the plot from left to right (this affirms the “L” condition);
- the (vertical) spread of the residuals remains approximately constant as we scan the plot from left to right (this affirms the “E” condition);
- violation of either of these for at least one residual plot may suggest the need for transformations of one or more predictors and/or the response variable.
Normal Q-Q (Quantile-Quantile) Plot
- check for approximate normality (the “N” condition).
Create a series of scatterplots with the residuals, \(e_i\), on the vertical axis and each of the available predictors that have been omitted from the model on the horizontal axes and visual assess whether:
- there are no strong linear or simple nonlinear trends in the plot;
- violation may indicate the predictor in question (or a transformation of the predictor) might be usefully added to the model.
  - A strong linear or simple nonlinear trend in the resulting plot may indicate the variable plotted on the horizontal axis might be usefully added to the model.
  - it can sometimes be helpful to plot functions of predictor variables on the horizontal axis of a residual plot, for example interaction terms consisting of one quantitative predictor multiplied by another quantitative predictor.

Transformations as Remedies

Transformations of the response and/or predictors can improve the fit and correct violations of model assumptions such as non-constant error variance. We may also consider adding additional predictors that are functions of the existing predictors like quadratic or cross-product terms.

(Please review the details in lecture 4.)

Problems with the Predictors

Collinearity

When some predictors (a.k.a. covariates) are linear combinations of the others, then \(\mathbf X^T \mathbf X\) is singular and there is no unique LS estimate of \(\boldsymbol \beta\).

Collinearity refers to the case when \(\mathbf X^T \mathbf X\) is close to singular, and leads to imprecise and unstable estimate of \(\boldsymbol \beta\). To solve this problem:

Eliminate highly correlated covariates
If you must keep the information from all the predictors/covariates, you may need one of the shrink methods.

Comments:

Problems can arise when you have too many predictors, for example, collinearity. We will talk more about the details when we come to the topic of variable selection.
In the situation of collinearity, if you must keep all the predictors, you may need one of the shrink methods, e.g., principle components, ridge regression, lasso (invoking penalized least squares, beyond the scope of this course)

Detect Collinearity

Check correlation matrix of covariates may reveal pairwise collinearities.
Regress one covariate on all other covariates. \(R^2\) close to one indicates a problem. Look at variance inflation factor (VIF) \(=1/(1-R^2)\). The rule of thumb: \(VIF>5\) is considered as large, indicating collinearity problem.
Define conditional number as

\[\kappa = \sqrt{\lambda_1/\lambda_p}\] where \(\lambda_1\) and \(\lambda_p\) are the largest and the smallest eigenvalues of \(\mathbf X^T \mathbf X\) (\(\lambda_1 \geq \ldots \geq\lambda_p\geq0\)). Zero eigenvalues denote exact collinearity while the presence of some small eigenvalues indicates (multi-)collinearity. The rule of thumb: \(\kappa > 30\) is considered as large, indicating collinearity problem.

Implementation in R using data set of air quality:

attach(airquality)
air <- na.omit(airquality)
detach(airquality)
attach(air)
fit3 <- lm(Ozone ~ Solar.R + Wind + Temp, data=air)

# 1. correlation matix
round(cor(air[,1:4]),3)

         Ozone Solar.R   Wind   Temp
Ozone    1.000   0.348 -0.612  0.699
Solar.R  0.348   1.000 -0.127  0.294
Wind    -0.612  -0.127  1.000 -0.497
Temp     0.699   0.294 -0.497  1.000

pairs(air[,1:4]) # scatterplot matrices

# 2. VIF
library(car)
vif(fit3)

 Solar.R     Wind     Temp 
1.095253 1.329070 1.431367

# 3. conditional number kappa
x <- model.matrix(fit3)[,-1] # excluding the intercept in X
e <- eigen(t(x)%*%x) 
sqrt(e$val[1]/e$val)

[1]  1.000000  6.945275 51.639091

# No sign of collinearity

Problems with the Error

We have assumed that the error \(\epsilon\) is independent and identically distributed (i.i.d.). Furthermore, we have also assumed that the errors are normally distributed in order to carry out the usual statistical inference. We have seen that these assumptions can often be violated and we must then consider alternatives:

When the errors are dependent, we can use generalized least squares (GLS).
When the errors are independent, but not identically distributed, we can use weighted least squares (WLS), which is a special case of GLS.

Generalized Least Squares

Ideally we have assumed that \(\operatorname{Var} (\boldsymbol \epsilon) = \sigma^2 \mathbf I\). Now let’s suppose in reality \[\operatorname{Var} (\boldsymbol \epsilon) = \sigma^2 \mathbf V\]

which is equivalent to the stated violation of assumptions related to the dispersion matrix \(\operatorname{Var} (\boldsymbol \epsilon)\)

correlation
non-constant variance (aka heteroskedasticity).

Sometimes the difficulties with correlated errors cannot be removed by changing the structural part of the model (e.g., logarithm transformation on response and predictors), we must build the correlation directly into the model, that is, transforming the model equation to a new model whose errors are uncorrelated and have equal variances.

This can be achieved using the method of generalized least squares.

If \(\mathbf V\) is known, we can transform the correlated model into an independent model. Since \(\mathbf V\) is positive definite, we can write \(\mathbf V = \mathbf S\mathbf S^T\) (by Cholesky decomposition: assuming \(\mathbf A\) is symmetric positive definite square matrix, \(\mathbf A= \mathbf U^T \mathbf U = \mathbf L \mathbf L^T\) where \(\mathbf U\) is a unique upper triangular matrix and \(\mathbf L\) is a lower triangular matrix) and make the following transformations:

\[ \begin{align} \mathbf y^\prime & =\mathbf S^{-1}\mathbf y\\ \mathbf X^\prime & =\mathbf S^{-1}\mathbf X\\ \boldsymbol \epsilon^\prime & =\mathbf S^{-1}\boldsymbol \epsilon \end{align} \]

Multiplying \(\mathbf S^{-1}\) to both sides, the original regression model is transformed into

\[\mathbf y^\prime =\mathbf X^\prime \boldsymbol \beta+ \boldsymbol \epsilon^\prime\]

where \(\mathbf X^\prime\) is \(n \times p\) of rank \(p\), \(\operatorname{E}(\boldsymbol \epsilon^\prime)=\mathbf 0\), and \(\operatorname{Var}(\boldsymbol \epsilon^\prime)=\sigma^2\mathbf S^{-1} \mathbf V \mathbf S^{-T}=\sigma^2 \mathbf I\) (uncorrelated and constant variance).

Minimizing \((\boldsymbol \epsilon^\prime)^T\boldsymbol \epsilon^\prime\) with respect to \(\boldsymbol \beta\), the least squares estimate (LSE) of \(\boldsymbol \beta\) for the transformed model is

\[ \begin{align}\hat{\boldsymbol{\beta}}^\star &= \left((\mathbf X^\prime)^T \mathbf X^\prime\right)^{-1}(\mathbf X^\prime)^T \mathbf y\prime\\ &=\left(\mathbf X^T (\mathbf S \mathbf S^T)^{-1}\mathbf X\right)^{-1}\mathbf X^T (\mathbf S \mathbf S^T)^{-1}\mathbf y\\ &= (\mathbf X^T \mathbf V^{-1} \mathbf X)^{-1}\mathbf X^T \mathbf V^{-1} \mathbf y\end{align} \]

with expected value

\[\operatorname{E}(\hat{\boldsymbol \beta}^\star)= (\mathbf X^T \mathbf V^{-1} \mathbf X)^{-1}\mathbf X^T \mathbf V^{-1} \mathbf X \boldsymbol \beta =\boldsymbol \beta\]

and dispersion matrix \[\operatorname{Var}(\hat{\boldsymbol \beta}^\star)=\sigma^2\left((\mathbf X^\prime)^T \mathbf X^\prime\right)^{-1}=\sigma^2\mathbf X^{-1} \mathbf V \mathbf X^{-T}\]

Often \(\mathbf V\) is unknown. The gls function in the nlme package can be used to fit a LM using GLS with an estimate of \(\mathbf V\) through restricted maximum likelihood estimation (REML).
Special case of GLS: when \(\mathbf V\) is a diagonal matrix, \(\hat{\boldsymbol \beta}^\star\) is called the weighted least squares (WLS) estimate.

Implementation in R:

Data set: globwarm. The evidence of serial correlation has been found by looking at successive residuals. We demonstrate how we can model this using GLS.

globwarm data set: Northern Hemisphere temperatures and climate proxies in the last millenia.

Description: Average Northern Hemisphere Temperature from 1856-2000 and eight climate proxies from 1000-2000AD. Data can be used to predict temperatures prior to 1856.

Variables:

nhtemp: Northern hemisphere average temperature (Celsius, C).

wusa: Tree ring proxy information from the Western USA.

jasper: Tree ring proxy information from Canada.

westgreen: Ice core proxy information from west Greenland

chesapeake: Sea shell proxy information from Chesapeake Bay

tornetrask: Tree ring proxy information from Sweden

urals: Tree ring proxy information from the Urals

mongolia: Tree ring proxy information from Mongolia

tasman: Tree ring proxy information from Tasmania

year: Year 1000-2000AD.

data(globwarm,package="faraway")

# install.packages("nlme") # Nonlinear Mixed-Effects Models
library(nlme) # nlme::gls, fits a lm using GLS. The errors are allowed to be correlated and/or have unequal variances.
glmod <- gls(nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, correlation=corAR1(form= ~year), data=na.omit(globwarm)) # correlation: an optional 'corStruct' object describing the correlation structure. corAR1: autoregressive process of order 1. form: a one sided formula of the form '~ t', specifying a time covariate t.
summary(glmod)

Generalized least squares fit by REML
  Model: nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask +      urals + mongolia + tasman 
  Data: na.omit(globwarm) 
        AIC       BIC   logLik
  -108.2074 -76.16822 65.10371

Correlation Structure: AR(1)
 Formula: ~year 
 Parameter estimate(s):
      Phi 
0.7109922 

Coefficients:
                  Value  Std.Error   t-value p-value
(Intercept) -0.23010624 0.06702406 -3.433188  0.0008
wusa         0.06673819 0.09877211  0.675678  0.5004
jasper      -0.20244335 0.18802773 -1.076668  0.2835
westgreen   -0.00440299 0.08985321 -0.049002  0.9610
chesapeake  -0.00735289 0.07349791 -0.100042  0.9205
tornetrask   0.03835169 0.09482515  0.404446  0.6865
urals        0.24142199 0.22871028  1.055580  0.2930
mongolia     0.05694978 0.10489786  0.542907  0.5881
tasman       0.12034918 0.07456983  1.613913  0.1089

 Correlation: 
           (Intr) wusa   jasper wstgrn chespk trntrs urals  mongol
wusa       -0.517                                                 
jasper     -0.058 -0.299                                          
westgreen   0.330 -0.533  0.121                                   
chesapeake  0.090 -0.314  0.230  0.147                            
tornetrask -0.430  0.499 -0.197 -0.328 -0.441                     
urals      -0.110 -0.142 -0.265  0.075 -0.064 -0.346              
mongolia    0.459 -0.437 -0.205  0.217  0.449 -0.343 -0.371       
tasman      0.037 -0.322  0.065  0.134  0.116 -0.434  0.416 -0.017

Standardized residuals:
        Min          Q1         Med          Q3         Max 
-2.31122523 -0.53484054  0.02342908  0.50015642  2.97224724 

Residual standard error: 0.204572 
Degrees of freedom: 145 total; 136 residual

Comments:

The lm function drops missing value cases (which is what we want) but gls complains so we have removed these cases using the na.omit function.
In data collected over time, successive errors could be correlated. By computing the correlation between the vector of residuals (with the first and then the last term omitted), we can see a correlation of \(0.58\) between successive residuals (correlation pattern also detected in the plot of successive pairs of residuals).

lmod <- lm(nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, globwarm)
summary(lmod)


Call:
lm(formula = nhtemp ~ wusa + jasper + westgreen + chesapeake + 
    tornetrask + urals + mongolia + tasman, data = globwarm)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43668 -0.11170  0.00698  0.10176  0.65352 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.242555   0.027011  -8.980 1.97e-15 ***
wusa         0.077384   0.042927   1.803 0.073647 .  
jasper      -0.228795   0.078107  -2.929 0.003986 ** 
westgreen    0.009584   0.041840   0.229 0.819168    
chesapeake  -0.032112   0.034052  -0.943 0.347346    
tornetrask   0.092668   0.045053   2.057 0.041611 *  
urals        0.185369   0.091428   2.027 0.044567 *  
mongolia     0.041973   0.045794   0.917 0.360996    
tasman       0.115453   0.030111   3.834 0.000192 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1758 on 136 degrees of freedom
  (856 observations deleted due to missingness)
Multiple R-squared:  0.4764,    Adjusted R-squared:  0.4456 
F-statistic: 15.47 on 8 and 136 DF,  p-value: 5.028e-16

cor(residuals(lmod)[-1],residuals(lmod)[-length(residuals(lmod))])

[1] 0.583339

# visual checking successive pairs of residuals:
n <- length(residuals(lmod))
plot(head(residuals(lmod),n-1), tail(residuals(lmod),n-1), xlab=expression(e[i]),ylab=expression(e[i+1])) # x-axis: e_1,...,e_{n-1}. y-axis: e_2,...,e_n.
abline(h=0,v=0,col="grey")

The simplest way to model this is the autoregressive form:

\[\epsilon_{i+1} =\phi\epsilon_i+\delta_i\]

where \(\delta_i\sim \mathcal{N}(0, \tau^2)\).

We see that the estimated value of \(\phi\) is \(0.71\) in the output of summary(glmod). It is not surprising to see strong autocorrelation in this example because the proxies can only partially predict the temperature and we would naturally expect some carry-over effect from one year to the next. Also, we can check the confidence intervals for this term:

intervals(glmod,which="var-cov") # "var-cov" for the variance-covariance parameters, "coef" for the linear model coefficients

Approximate 95% confidence intervals

 Correlation structure:
       lower      est.     upper
Phi 0.509979 0.7109922 0.8383734

 Residual standard error:
    lower      est.     upper 
0.1540717 0.2045720 0.2716249

We see from the interval, \((0.51,0.84)\), that this term is clearly different from zero and that there is significant positive correlation. For this example, we might investigate whether a more sophisticated model should apply to errors perhaps using a ARMA model (beyond the scope of this course), which can be implemented using the corARMA function.

Issues in predictors: We notice that the residual standard error (RSE) \(\hat\sigma\) is a little larger in the GLS case as might be expected. The standard errors of \(\hat\beta\) are much larger and none of the predictors is statistically significant in the GLS output. However, there is substantially collinearity between the predictors so this should not be interpreted as “no predictor effect”.
The correlation between the predictors and correlation between the errors are different phenomena and there is no necessary link between the two.

Weighted Least Squares

If the problem is solely one of non-constant variance of errors (i.e., with no suggestion of violation of other assumptions like non-linearity, independent errors), then the use of weighted least squares (WLS) may be appropriate.
When observations are independent with unequal variances, \(\mathbf V\) is diagonal, \(\mathbf V= \operatorname{diag}(\sigma^2_1, \ldots, \sigma^2_n)\).
Define the reciprocal of each variance, \(\sigma^2_i\), as the weight, \(w_i =1/\sigma^2_i\), and the weight matrix is \(\mathbf W= \operatorname{diag}(w_1, \ldots, w_n)\). Rewrite matrix \(\mathbf V\)

\[\mathbf V= \operatorname{diag}(1/w_1, \ldots, 1/w_n)\]

That is, in the transformed model we regress \(\sqrt{w_i}y_i\) on \(\sqrt{w_i}x_i\) (although the column of ones in the \(\mathbf X\)-matrix needs to be replaced with \(\sqrt{w_i}\)). Note that when weights are used, the residuals is modified to use \(\sqrt{w_i}\hat\epsilon_i\).

Then, the GLS can be written as

\[(\mathbf y-\mathbf X \boldsymbol \beta)^T \mathbf V^{-1} (\mathbf y-\mathbf X \boldsymbol \beta) = \sum_{i=1}^n w_i(y_i-\sum_{j=0}^{p-1}\beta_j x_{ij})^2\]

which is called weighted LS with weights \(w_i\).

Assigning weights with suitable values:

Observations with large variance get small weights and those with small variance get large weights (remember \(w_i =1/\sigma^2_i\)). Here are some examples:
- Errors proportional to a predictor: \(\operatorname{var}(\epsilon_i) \propto x_i\) (thus \(\operatorname{var}(y_i) \propto x_i\)) suggests \(w_i = x^{-1}_i\). One might choose this option after observing a positive relationship in a plot of \(|e_i|\) against \(x_i\).
- When the \(y_i\)’s are the averages of \(n_i\) observations, and the variance in the response is proportional to the group size (inversely), that is, \(\operatorname{var}(y_i)=\frac{1}{n_i}\sigma^2\), which suggests weights is assigned \(w_i= n_i\). (beyond the scope of this course)
In practice, \(\mathbf W\) is usually unknown:
- Perform an ordinary least squares (OLS) regression first. Provided the regression function is appropriate, the \(i\)th squared residual \(e_i^2\) from the OLS fit is an estimate of \(\sigma_i^2\).
- However, the residuals are much too variable to be used directly as the estimates for the weights, so instead, we use \(e_i^2\) to estimate a variance function, and then use this variance function to estimate the weights. Some possible variance function estimates include:
  - If a “residual vs predictor” plot exhibits a fan (or funnel) shape, then regress the absolute values of the residuals \(|e_i|\)’s against that predictor. The resulting fitted values of this regression are estimates of \(\sigma_i\). (And remember \(w_i = 1/\sigma_i^2\)).
  - If a “residual vs fitted value” plot exhibits a fan (or funnel) shape, then regress the absolute values of the residuals \(|e_i|\)’s against the fitted values. The resulting fitted values of this regression are estimates of \(\sigma_i\).

Comments:

Weights can be specified by the weights option of the lm function

Implementation in R with simulated data set

Please Note: These two examples are for educational purpose only, and you may need them as building blocks in real data analysis. In fact, the OLS does its job for this data set. We do not need WLS.

Example 1: use the reciprocal of a predictor as weights.

attach(mtcars) # mpg: miles per gallon, wt: weight of cars, hp: horse power
w_model2 <- lm(mpg ~ wt + hp, data = mtcars, weights = 1/wt)

Example 2: when weights are unknown.

w_model <- lm(mpg ~ wt + hp, data = mtcars)
# 1. If "residual vs fitted value" plot exhibits a fan (or funnel) shape.
## (1) plot
plot(w_model$fitted.values, w_model$residual, col = "grey", pch = 20, xlab = "Fitted", ylab = "Residuals", main = "Residuals versus Fitted")
abline(h = 0, col = "darkorange", lwd = 2)

## (2) regress absolute value of residuals against fitted values, and use the resulting fitted values of the regression as the estimate of sigma
wts <- 1/fitted(lm(abs(w_model$residual) ~ w_model$fitted.values))^2
w_model3 <- lm(mpg ~ wt + hp, data = mtcars, weights = 1/wts)
# 2. If "residual vs predictor" plot exhibits a fan (or funnel) shape
## (1) plot
plot(mtcars$wt, w_model$residual, col = "grey", pch = 20, xlab = "Predictor wt", ylab = "Residuals", main = "Residuals versus Predictor")
abline(h = 0, col = "darkorange", lwd = 2)

## (2) regress absolute value of residuals against the predictor, and use the resulting fitted values of the regression as the estimate of sigma
wts <- 1/fitted(lm(abs(w_model$residual) ~ mtcars$wt))^2
w_model3 <- lm(mpg ~ wt + hp, data = mtcars, weights = 1/wts)