install.packages("alr4")
library(alr4)
# you may also need to install (or update) package "Rcpp" to load "alr4" properlyLecture 1
Lecture 1 Overview
- Introduction to R
- Review of Principles of Simple Linear Regression (SLR)
- Use of Regression
- Goal of Regression
- Relationship Types
- Data Types
- SLR Model
- Estimation
- Math Formulae of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
- Interpretation of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
- Understanding the Least Squares Estimate (LSE)
- Properties of LSE (Gauss-Markov Theorem)
R package “alr4” and lm()
To install the R package alr4, we can use the following code
Built-in R function: lm()
As the most commonly used R function in this course, lm() is used to fit linear models. To check the help page of this function, we can use the code help(lm) or ?lm .
?lm
# equivalently
help(lm)Relationship Types
Deterministic (or functional) relationships, e.g., the relationship between degrees Fahrenheit and degrees Celsius is known to be:
\[\text{Fahrenheit} = \frac{9}{5} \text{Celsius} + 32\]
plot(x = 1:10, y = (1:10)*9/5 + 32, type = "o", xlab = "Celsius", ylab = "Fahrenheit")
In this course, we are not interested in the perfect relationship.
Statistical relationships
skcan = read.table("skincancer.txt", header = TRUE)
plot(x = skcan$Lat, y = skcan$Mort, xlab = "Latitude (at center of state)", ylab = "Mortality (Deaths per 10 million)", title = "Skin cancer mortality vs. Latitude")
abline(lm(Mort~ Lat, data = skcan))
Uses of Regression
There are many areas of human endeavor in which we would like to learn and model, from relevant but noisy data, an unknown functional relationship between a variable \(X\) (or variables) and a variable \(Y\), the values of which we think of as dependent, in some sense, on those of \(X\).
The study of how best to do this, including which mathematical and statistical methods and algorithms to use, is the subject of Regression.
Explanation and insight:
Modeling the relationship between an input/inputs and an outcome, given observed, sampled data, in order to gain deeper understanding into that relationship.
For example, What is the functional relationship between the stopping distance of a car (that is, the safe stopping distance, without the driver’s loss of control) and the car’s speed?
Prediction:
Given a new input value, not previously sampled, estimate the corresponding outcome/output value using the trained regression model.
For example, given one’s CET-6 score, can TOEFL (or IELTS) scores be predicted?
History of Regression
The mathematicians Legendre (1805) and Gauss (1809) were the first known to have used the technique of statistical regression (that is, the method of least squares) as such, in order to find the best linear fit to a finite set of data points. They applied the method to analyze and predict planetary motion. Using the normal (or Gaussian) distribution to describe the behavior of errors, Gauss also developed a formula for this distribution, which plays such an important role in modeling errors in (linear) regression.
Techniques for Linear Regression can rightly be viewed as Artificial Intelligence/Machine Learning methods and indeed as, historically speaking, perhaps the original versions of the types of Machine Learning algorithms so widely used today.
Goal of Regression Analysis
Estimation
models the relationship between a predictor/predictors (\(x\)) and a response (\(y\)) with an observed data set.
Prediction
predict new outcomes given a new set of inputs with a built model.
Data Types
- numeric (quantitative):
- discrete: number of students, count data, number of pregnancies
- continuous: age, height
- categorical (qualitative):
- nominal: hair color, smoking status, registration status, pregnancy
- ordinal: degree of illness, ranks in musical instruments performance
Simple Linear Regression
Simple linear regression (SLR) is a statistical method that allows us to summarize and study relationships between two variables:
One variable, denoted \(x\), is regarded as the predictor, explanatory, or independent variable, which can be of any type.
The other variable, denoted \(Y\), is regarded as the response, outcome, or dependent variable, which is continuous.
Simple linear regression is “simple”, because it concerns the study of only one predictor variable.
The SLR model
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \quad i = 1, \ldots, n,\]
where
\(\epsilon_i\)’s are independent random errors with \(\operatorname{E}(\epsilon_i) = 0, \operatorname{Var}(\epsilon_i) =\sigma^2\).
\((x_i, y_i)\) are observed in data.
\(\beta_0, \beta_1\) and \(\sigma^2\) are unknown parameters.
Population Regression Line
\[\operatorname{E}(y_i) = \beta_0 + \beta_1 x_i\]
Sample Regression Line
If we only (realistically and practically) know the information of a sample, then we can estimate the “population regression line” by a “sample regression line”.
\[\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i\] where
\(y_i\) denotes the observed response for experimental unit \(i\)
\(x_i\) denotes the predictor value for experimental unit
\(\hat{y}_i\) is the predicted response (or fitted value) for experimental unit
An experimental unit is an object or person on which the measurement is made
Find the “Best” Fitting Line
students = read.table("student_height_weight.txt", header = T)
head(students) ht wt
1 63 127
2 64 121
3 66 142
4 69 157
5 69 162
6 71 156
names(students)[1] "ht" "wt"
weight = students$wt
height = students$ht
yweight1 = -331.2 + 7.1*height # fit1
yweight2 = -266.5 + 6.13*height # fit2
plot(x = height, y = weight)
abline(a = -331.2, b = 7.1, col = "red") # fit1
abline(a = -266.5, b = 6.13, col = "blue") # fit2
# # equivalently
# plot(x = height, y = weight)
# lines(height, yweight1, col = "red")
# lines(height, yweight2, col = "blue")Q: Which line is the best fitting line?
A: Hard to decide in between by visual checking. We need statistical tools to make a scientific decision.
Prediction Error and Least Squares Criterion
In general, when we use \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i\) to predict the actual response \(y_i\), we make a prediction error (or residual) of size
\[e_i = y_i - \hat{y_i}\]
\(e_i\) is called the prediction error for data point \(i\).
A line that fits the data “best” will be one for which the \(n\) prediction errors are as small as possible in some overall sense.
One way to achieve this goal is to invoke the “least squares criterion,” which says to “minimize the sum of the squared prediction errors (or residuals).” That is, we need to find the values \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimizes
\[Q = \sum_{i =1}^{n}(y_i - \hat{y}_i)^2= \sum_{i =1}^{n}e_i^2\]
to find the best line of all possible lines.
Math Formulae of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
By taking derivatives of the least squares criterion \[Q = \sum_{i =1}^{n}[(y_i -(\beta_0 + \beta_1 x_i)]^2\] w.r.t \(\beta_0\) and \(\beta_1\) respectively, we obtain \[\sum_{i =1}^{n}(y_i -\beta_0 - \beta_1 x_i)=0\]
\[\sum_{i =1}^{n}x_i(y_i -\beta_0 - \beta_1 x_i)=0\] Solving the equations above, we obtain
\[\hat{\beta}_1 = \frac{\sum_{i =1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i =1}^{n}(x_i-\bar{x})^2} = \frac{S_{xy}}{S_{xx}}\]
\[\hat{\beta}_0 = \bar{y}-\hat{\beta}_1\bar{x}\]
implement formulae in R
xbar = mean(height)
ybar = mean(weight)
Sxx = sum((height - xbar)^2)
Sxy = sum((height - xbar)*(weight - ybar))
b1 = Sxy/Sxx
b1 # beta1_hat[1] 6.137581
## intercept estimate, beta0_hat
b0 = ybar - b1*xbar
b0[1] -266.5344
We can also use the built-in function lm() to find the best fitting line and coefficient estimates.
fit = lm(weight~height, data = students)
coef(fit)(Intercept) height
-266.534395 6.137581
Now let’s look back to the two fitting line: “fit2” (the blue line) is the best fitting line. Convince yourself by comparing \(\hat{\beta}_0\), \(\hat{\beta}_1\) calculated in R chunk above and the coefficients in the given fitting lines. If we calculate “\(Q\)”,
sum((weight-yweight1)^2) # fit1[1] 766.51
sum((weight-yweight2)^2) # fit2[1] 599.8047
Interpretation of \(\hat{\beta}_0\) and \(\hat{\beta}_1\)
- \(\hat{\beta}_0\) is the predicted response value when \(x_i = 0\).
- In the example of 10 students’ height and weight, \(\hat{\beta}_0\) tells us that a person who is 0 inches tall is predicted to weigh -267 pounds, which is not meaningful.
- This happened because we “extrapolated” beyond the “scope of the model” (the range of the \(x\) values).
- \(\hat{\beta}_1\) is the estimate of the change in mean response value \(\operatorname{E}(y)\) for every additional one-unit increase in the predictor \(x\).
- In the example of 10 students’ height and weight, \(\hat{\beta}_1\) tells us that we predict the mean weight to increase by 6.14 pounds for every additional one-inch increase in height.
- In general, we can expect the mean response to increase or decrease by \(\hat{\beta}_1\) units for every one unit increase in the predictor \(x\).
Understanding the Least Squares Estimate (LSE)
- Because the formulas for \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are derived using the least squares criterion, the resulting equation
\[\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\]
is often referred to as the least squares regression line, or simply the least squares line.
Note that in deriving the above formulas, we made no distributional assumption about the data other than that they follow some sort of linear trend.
Least squares line passes through the point \((\bar{x}, \bar{y}).\)
Properties of LSE (Gauss-Markov Theorem)
The LS estimate (in SLR \(\hat{\beta}_0\), \(\hat{\beta}_1\)) is the unique best linear unbiased estimator (BLUE) in the sense that it has the minimum variance in the class of linear unbiased estimators.