In lecture 1, we introduced the data types: numeric and categorical data. And most or all of the data we have encountered in this course so far have been numeric. (e.g., people’s blood pressure, car speed data).
Categorical variables are those which may be non-numeric or, if numeric, the numerical values which they take on are essentially arbitrary in nature. So, for example, a variable taking on, say, the values “T” or “F” for “True” or “False” is categorical, but such a variable would still be categorical – and not a metric variable – even if it took on the values “1” or “0” in place of “T” or “F”.
A categorical variable given numerical value is said to be transformed into a “dummy variable”. In R, the “factor” variable-type allows for handling of categorical variables as dummy variables.
Let’s look at a scientific question related to categorical variable as a predictor in MLR:
Q: Is a baby’s birth weight related to the mother’s smoking during pregnancy?
To answer this question, a data set was collected on a random sample of \(n=32\) births:
Potential predictor (\(x_1\)): Smoking status of mother (yes or no)
Potential predictor (\(x_2\)): length of gestation (Gest) in weeks
Response (\(y\)): birth weight (Weight) in grams of baby
Note that smoking is a qualitative predictor (categorical variable). It is a “binary variable” with only two values (yes or no). The other predictor variable (Gest) is quantitative (numeric).
dat =read.table('birthsmokers.txt', header = T)head(dat)
colnames(dat)[1] ='Weight'dat$Smoke =as.factor(dat$Smoke) # convert character to factor (categorical variable)pairs(dat) # factor columns are converted to numeric
Interpretation:
The scatter plot matrix suggests that there is a positive linear relationship between length of gestation and birth weight.
It is hard to see if any kind of relationship exists between birth weight and smoking status, or between length of gestation and smoking status.
We can also plot birth weight against length of gestation, using different colors for the smoking status:
plot(dat$Gest, dat$Weight, type ='n', xlab ='Length of Gestation', ylab ='Birth Weight', main ='Birth Weight vs Length of Gestation') # type = "n" for no plottingpoints(dat$Gest[dat$Smoke =='yes'], dat$Weight[dat$Smoke =='yes'], col ='red')points(dat$Gest[dat$Smoke =='no'], dat$Weight[dat$Smoke =='no'], col ='blue')legend('topleft', bty ='n', col =c('blue', 'red'), legend =c('Non-Smoker', 'Smoker'), pch =1) # bty = "n", no box around legend
Recall that the scientific question remains – after taking into account length of gestation, is there a significant difference in the average birth weights of babies born to smoking and non-smoking mothers?
A first-order model with one binary predictor and one quantitative predictor (a parallel model) that helps us answer the question is:
\(x_{i2}\) is a binary variable coded as a \(1\), if the baby’s mother smoked during pregnancy and \(0\), if she did not and the independent error terms \(\epsilon_i\) follow a normal distribution with mean \(0\) and equal variance \(\sigma^2\).
Notice that in order to include a categorical variable (as a qualitative predictor) in a regression model, we have to “code” the variable, that is, assign a unique number to each of the possible categories.
A common coding scheme is “zero-one-indicator variable.” Using such a variable here, we code the binary predictor smoking as:
\(x_{i2} = 1\), if mother \(i\) smokes
\(x_{i2} = 0\), if mother \(i\) does not smoke
Comments:
In doing so, we use the tradition of assigning the value of \(1\) to those having the characteristic of interest and \(0\) to those not having the characteristic.
Incidentally, other terms sometimes used instead of “zero-one-indicator variable” are “dummy variable” or “binary variable”.
fit =lm(Weight ~ Gest + Smoke, data = dat)betas =coef(fit)yhat1 = betas[1] + betas[2]*(dat$Gest[dat$Smoke =='yes']) + betas[3]yhat2 = betas[1] + betas[2]*(dat$Gest[dat$Smoke =='no'])plot(dat$Gest, dat$Weight, type ='n', ylim =c(2370, 3621), xlab ='Length of Gestation', ylab ='Birth Weight', main ='Birth Weight vs Length of Gestation')points(dat$Gest[dat$Smoke =='yes'], dat$Weight[dat$Smoke =='yes'], col ='red')points(dat$Gest[dat$Smoke =='no'], dat$Weight[dat$Smoke =='no'], col ='blue')lines(sort(dat$Gest[dat$Smoke =='yes']), yhat1[order(dat$Gest[dat$Smoke =='yes'])], col ='red')lines(sort(dat$Gest[dat$Smoke =='no']), yhat2[order(dat$Gest[dat$Smoke =='no'])], col ='blue')legend('topleft', bty ='n', col =c('blue', 'red'), c('Non-Smoker', 'Smoker'), lty =c(1, 1))
Interpretations:
The blue circles represent the data on non-smoking mothers (\(x_2 = 0\)), while the red circles represent the data on smoking mothers (\(x_2 = 1\)).
The blue line represents the estimated linear relationship between length of gestation and birth weight for non-smoking mothers, while the red line represents the estimated linear relationship for smoking mothers.
summary(fit)
Call:
lm(formula = Weight ~ Gest + Smoke, data = dat)
Residuals:
Min 1Q Median 3Q Max
-223.693 -92.063 -9.365 79.663 197.507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2389.573 349.206 -6.843 1.63e-07 ***
Gest 143.100 9.128 15.677 1.07e-15 ***
Smokeyes -244.544 41.982 -5.825 2.58e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 115.5 on 29 degrees of freedom
Multiple R-squared: 0.8964, Adjusted R-squared: 0.8892
F-statistic: 125.4 on 2 and 29 DF, p-value: 5.289e-15
The estimated regression equation is \[\text{Weight} = -2389.6 + 143.10 \text{Gest} - 244.5 \text{Smoke}\]
Therefore, the estimated regression equation for non-smoking mothers (smoke = 0) is: \[\text{Weight} = -2389.6 + 143.10 \text{Gest} \] and the estimated regression equation for smoking mothers (smoke = 1) is: \[\text{Weight} = -2634.1 + 143.10 \text{Gest}\] That is, we obtain two different parallel estimated lines. The difference between the two lines, \(-244.5\) (grams), represents the difference in the average birth weights for a fixed gestation length for smoking and non-smoking mothers in the sample.
Comments:
The general interpretation of the regression coefficients in a regression model with one \((0, 1)\) binary indicator variable and one quantitative predictor is:
\(\beta_1\) represents the change in the mean response for each additional unit increase in the quantitative predictor \(x_1\) for both groups.
\(\beta_2\) represents how much higher (or lower) the mean response function of the second group is than that of the first group for any value of \(x_1\).
Now, let’s use our model and analysis to answer the research question:
Q: Is baby’s birth weight related to smoking during pregnancy, after taking into account length of gestation?
A: We can answer our research question by testing the null hypothesis \(H_0 : \beta_2 = 0\) vs \(H_1 : \beta_2\neq 0\).
In conclusion:
There is sufficient evidence to conclude that there is a statistically significant difference in the mean birth weight of all babies of smoking mothers and the mean birth weight of babies of all non-smoking mothers, after taking into account length of gestation.
A 95% confidence interval for \(\beta_2\) tells us the magnitude of the difference.
We can be 95% confident that the mean birth weight of smoking mothers is between \(158.7\) and \(330.4\) grams less than the mean birth weight of non-smoking mothers, regardless of the length of gestation.
Similarly, we can answer the question:
Q: How is birth weight related to gestation, after taking into account a mother’s smoking status, by testing the null hypothesis \(H_0 : \beta_1 = 0\) vs \(H_1 : \beta_1\neq 0\).
In conclusion:
There is sufficient evidence to conclude that there is a statistically significant linear relationship between birth weight and gestation, after taking into account a mother’s smoking status.
A 95% confidence interval for \(\beta_1\) tells us the magnitude of the difference.
We can be 95% confident that the mean birth weight is expected to increase between \(124.4\) and \(161.8\) grams given each unit increase in the length of gestation, regardless of the smoking status of the mother.
MLR with multiple categorical variables
In general, a categorical variable defining \(K\) groups should be represented by \(K - 1\) indicator variables. For example:
If your categorical variable defines 2 groups, then you need 1 indicator variable. (We have looked at an example of this case.)
If your categorical variable defines 3 groups, then you need 2 indicator variables.
Choose one group or category to be the “reference” group:
Often it will be clear from the application which group should be the reference group, such as a control group in a medical experiment (e.g., the “non-smoking” group as the reference group in the birth weight example), but, if not, then the group with the most observations is often a good choice.
What is Implied by the Parallelness of the Two Lines?
In light of the plot, let’s investigate those questions again:
Does the effect of the gestation length on mean birth weight depend on the mother’s smoking status?
The answer is NO. Regardless of whether or not the mother is a smoker, for each additional one-week of gestation, the mean birth weight is predicted to increase by 143 grams.
This lack of interaction between the two predictors is exhibited by the parallelness of the two lines.
Does the effect of smoking on mean birth weight depend on the length of gestation?
The answer is NO. For a fixed length of gestation, the mean birth weight of babies born to smoking mothers is predicted to be 245 grams lower than the mean birth weight of babies born to non-smoking mothers.
Again, this lack of interaction between the two predictors is exhibited by the parallelness of the two lines.
MLR with Interaction Terms
Loosely speaking, interaction effects occur in regression when the influence of an independent variable (predictor) on the target variable depends on the behavior of another independent variable. (When two predictors do not interact, we say that each predictor has an “additive effect” on the response.)
Now let’s take a look at an example where including “interaction terms” is appropriate.
Interaction between Categorical and Numerical Variables
Example: Some researchers were interested in comparing the effectiveness of three treatments for severe depression.
For the sake of simplicity, we denote the three treatments A, B, and C. The researchers collected the following data on a random sample of \(n = 36\) severely depressed individuals:
response: \(y_i\), measure of the effectiveness of the treatment for individual \(i\)
quantitative predictor: \(x_{i1}\), age (in years) of individual \(i\)
qualitative predictor: \(x_{i2} = 1\) if individual \(i\) received treatment A, and 0 if not
\(x_{i3} = 1\) if individual \(i\) received treatment B, and 0 if not
A scatter plot of the data with treatment effectiveness on the y-axis and age on the x-axis looks like:
dat =read.table('depression.txt', header = T)head(dat)
y age x2 x3 TRT
1 56 21 1 0 A
2 41 23 0 1 B
3 40 30 0 1 B
4 28 19 0 0 C
5 55 28 1 0 A
6 25 23 0 0 C
names(dat)
[1] "y" "age" "x2" "x3" "TRT"
y = dat$yage = dat$agex2 = dat$x2x3 = dat$x3plot(age, y, type ='n', xlab ='Age', ylab ='Treatment Effectiveness', main ='Treatment Effectiveness vs Age')points(age[x2 ==1], y[x2 ==1], col ='blue')points(age[x3 ==1], y[x3 ==1], col ='green')points(age[x3 ==0& x2 ==0], y[x3 ==0& x2 ==0], col ='red')legend('bottomright', bty ='n', col =c('blue','green', 'red'), c('A', 'B', 'C'), pch =c(1, 1, 1))
The blue circles represent the data for individuals receiving treatment A, the red circles treatment B, the green circles treatment C.
Fit a (second-order) multiple regression model with interaction terms
The predictors have an “interaction effect” on the mean response implies
the effect of the individual’s age (\(x_{i1}\)) on the treatment’s mean effectiveness (\(\operatorname{E}(y_i)\)) depends on the treatment (\(x_{i2}\) and \(x_{i3}\))
the effect of treatment (\(x_{i2}\) and \(x_{i3}\)) on the treatment’s mean effectiveness (\(\operatorname{E}(y_i)\)) depends on the individual’s age (\(x_{i1}\)).
In general, two predictors “to interact” means:
Two predictors interact if the effect on the response variable of one predictor depends on the value of the other.
A slope parameter can no longer be interpreted as the change in the mean response for each unit increase in the predictor, while the other predictors are held constant.
fit =lm(y ~ age + x2 + x3 + age*x2 + age*x3)summary(fit)
Call:
lm(formula = y ~ age + x2 + x3 + age * x2 + age * x3)
Residuals:
Min 1Q Median 3Q Max
-6.4366 -2.7637 0.1887 2.9075 6.5634
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.21138 3.34964 1.854 0.073545 .
age 1.03339 0.07233 14.288 6.34e-15 ***
x2 41.30421 5.08453 8.124 4.56e-09 ***
x3 22.70682 5.09097 4.460 0.000106 ***
age:x2 -0.70288 0.10896 -6.451 3.98e-07 ***
age:x3 -0.50971 0.11039 -4.617 6.85e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.925 on 30 degrees of freedom
Multiple R-squared: 0.9143, Adjusted R-squared: 0.9001
F-statistic: 64.04 on 5 and 30 DF, p-value: 4.264e-15
That is, we obtain the three “best fitting” lines – one for each treatment (A, B and C)
If patient receives A, then (\(x_{i2} = 1\), \(x_{i3} = 0\)), then
\[\hat y_i = 47.5 + 0.33x_{i1}\]
If patient receives B, then (\(x_{i2} = 0\), \(x_{i3} = 1\)), then
\[\hat y_i = 28.9 + 0.52x_{i1}\]
If patient receives C, then (\(x_{i2} = 0\), \(x_{i3} = 0\)), then
\[\hat y_i =6.21 + 1.03x_{i1}\]
Interpretations:
The effect of age on the predicted treatment effectiveness depends on the treatment given. That is, age appears to interact with treatment in its impact on treatment effectiveness. Specifically,
For patients in this study receiving treatment A, the effectiveness of the treatment is predicted to increase 0.33 units for every additional year in age.
For patients in this study receiving treatment B, the effectiveness of the treatment is predicted to increase 0.52 units for every additional year in age.
For patients in this study receiving treatment C, the effectiveness of the treatment is predicted to increase 1.03 units for every additional year in age.
The interaction is exhibited graphically by the “non-parallelness” of the lines.
bs =coef(fit)yhat1 = bs[1] + bs[3] + (bs[2] + bs[5])*(age[x2 ==1])yhat2 = bs[1] + bs[4] + (bs[2] + bs[6])*(age[x3 ==1])yhat3 = bs[1] + bs[2]*(age[x3 ==0& x2 ==0])plot(age, y, type ='n', xlab ='Age', ylab ='Treatment Effectiveness', main ='Treatment Effectiveness vs Age')points(age[x2 ==1], y[x2 ==1], col ='blue')points(age[x3 ==1], y[x3 ==1], col ='green')points(age[x3 ==0& x2 ==0], y[x3 ==0& x2 ==0], col ='red')lines(age[x2 ==1], yhat1, col ='blue')lines(age[x3 ==1], yhat2, col ='green')lines(age[x3 ==0& x2 ==0], yhat3, col ='red')legend('bottomright', bty ='n', col =c('blue','green', 'red'), c('A', 'B', 'C'), lty =c(1,1,1), pch =c(1, 1, 1))
Comments:
As usual, we first should evaluate the model (via diagnostics) before we use our estimated model to draw conclusions about the larger population of depressed individuals.
Answer scientific research questions with hypothesis testing tools. For example,
Q: For every age, is there a difference in the mean effectiveness for the three treatments? That is to test the significance of treatment effect,
\[H_0 : \beta_2 =\beta_3 = \beta_{12} = \beta_{13} = 0 \text{ vs } H_1 : \text{ at least one of these parameters is not } 0\]
Q: Does the effect of age on the treatment’s effectiveness depend on treatment? That is to test the interaction between age and treatment effect,
\[H_0 : \beta_{12} = \beta_{13} = 0 \text{ vs } H_1 : \text{ at least one of these parameters is not } 0\]
Implement F-test for nested model comparison in R:
Analysis of Variance Table
Model 1: y ~ age
Model 2: y ~ age + x2 + x3 + age * x2 + age * x3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 34 1970.57
2 30 462.15 4 1508.4 24.48 4.458e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 2. F-test for interaction between age and treatment effectmod.reduced =lm(y ~ age + x2 + x3)mod.full =lm(y ~ age + x2 + x3 + age*x2 + age*x3)anova(mod.reduced, mod.full)
Analysis of Variance Table
Model 1: y ~ age + x2 + x3
Model 2: y ~ age + x2 + x3 + age * x2 + age * x3
Res.Df RSS Df Sum of Sq F Pr(>F)
1 32 1165.57
2 30 462.15 2 703.43 22.831 9.41e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can conclude that there is a significant difference in the mean effectiveness for the three treatments, and the effect of age on the treatment’s effectiveness depends on the treatment.
A Supplementary Example: Interaction between Categorical and Numeric Variables
In data set UN11, the places are divided into three groups. Note group is a categorical variable.
Dataset Description
National health, welfare, and education statistics for 210 places, mostly UN members, but also other areas like Hong Kong that are not independent countries.
group: a factor with levels oecd for countries that are members of the OECD, the Organization for Economic Co-operation and Development, as of May 2012, africa for countries on the African continent, and other for all other countries. No OECD countries are located in Africa
fertility: number of children per woman
ppgdp: Per capita gross domestic product in US dollars
lifeExpF: Female life expectancy, years
pctUrban: Percent Urban
Research Goal: We might want to investigate how lifeExpF depends on this categorical variable group. (In constrast, recall the research goal of fitting one-way anova model.)
In R, categorical variables are defined as factors, so we should first check if group is a factor before we fit the model. Also recall the function to convert a numerical variable to a factor is as.factor().
# check if the class of object group is factoris.factor(group)
[1] TRUE
# different categories of nationslevels(group)
[1] "oecd" "other" "africa"
factor.lm =lm(lifeExpF ~ group, data = UN11)summary(factor.lm)
Call:
lm(formula = lifeExpF ~ group, data = UN11)
Residuals:
Min 1Q Median 3Q Max
-25.8367 -3.3045 0.3635 2.7183 18.2277
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.446 1.128 73.095 < 2e-16 ***
groupother -7.120 1.271 -5.602 7.1e-08 ***
groupafrica -22.674 1.420 -15.968 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.28 on 196 degrees of freedom
Multiple R-squared: 0.6191, Adjusted R-squared: 0.6152
F-statistic: 159.3 on 2 and 196 DF, p-value: < 2.2e-16
Notice that the fit does not contain an estimated slope for the group oecd. This is because R represents factor variables by numeric dummy variables, which are \(1\) if an observation comes from a group and \(0\) if not. In this case, R creates two dummy variables: \(x_{i1}=1\) if country \(i\) is in group other and \(x_{i1}=0\) otherwise; \(x_{i2}=1\) if country \(i\) is in group africa and \(x_{i2}=0\) otherwise. So, if a country has \(x_{i1}=x_{i2}=0\), it is in group oecd.
Now, we consider adding a numerical predictor \(\log(\text{ppdgp})\) to the model. Before fitting a model, let’s do an exploratory plot first. We create a scatterplot of lifeExpF against \(\log(\text{ppdgp})\). Because we also have categorical information in the variable group, we can use different plotting symbols for each group of nations to visualize the differences.
As we know, a multiple linear regression model (MLR) involving a categorical predictor and a continuous predictor essentially combines the separate simple linear regression models for the three groups into one. So, in order to model seperate slopes for each group, we should add interaction of numerical and categorical variables.
Let \(Y_i\) and \(x_{i3}\) be the values of lifeExpF and \(\log(\text{ppdgp})\) for the \(i\)th country, and \(x_{i1}\) and \(x_{i2}\) be the dummy variables for the categories in group. We fit a non-parallel model:
# fit non-parallel MLRnp.lm =lm(lifeExpF ~ group +log(ppgdp) + group*log(ppgdp), data = UN11)# alternatively, use an equivalent way to fit the same model:# np.lm = lm(lifeExpF ~ group*(log(ppgdp)))summary(np.lm)
Call:
lm(formula = lifeExpF ~ group + log(ppgdp) + group * log(ppgdp),
data = UN11)
Residuals:
Min 1Q Median 3Q Max
-18.634 -2.089 0.301 2.255 14.489
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.2137 15.2203 3.890 0.000138 ***
groupother -11.1731 15.5948 -0.716 0.474572
groupafrica -22.9848 15.7838 -1.456 0.146954
log(ppgdp) 2.2425 1.4664 1.529 0.127844
groupother:log(ppgdp) 0.9294 1.5177 0.612 0.540986
groupafrica:log(ppgdp) 1.0950 1.5785 0.694 0.488703
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.129 on 193 degrees of freedom
Multiple R-squared: 0.7498, Adjusted R-squared: 0.7433
F-statistic: 115.7 on 5 and 193 DF, p-value: < 2.2e-16