The Core Toolkit of Data Science

Timeless Math & R for the Age of AI

Author

Jiaye Xu

Published

March 16, 2026

Course Introduction: Why This Matters Now (More Than Ever)

Welcome. It’s an exciting time to be learning data science. We’re surrounded by headlines about AI, large language models (LLM), and automated machine learning. You might be wondering: “Why do I need to learn the math and R basics when an AI can write code for me?”

This is the most important question you could ask, and the answer is the core motivation for this course.

The Analogy: Are You the Driver or the Passenger?

Think of modern AI tools like ChatGPT or automated analysis platforms as a GPS navigation app on your phone.

A GPS is incredible. It can suggest the fastest route, warn you about traffic, and even reroute you around accidents. Millions of people use it every day to get where they’re going without getting lost.

But here’s the critical question: If you hand your phone to a friend and just follow the GPS voice blindly, are you the driver—or just a passenger following directions?

  • The GPS doesn’t know that the “shortcut” it’s suggesting goes through a neighborhood with a closed road.
  • It doesn’t know that you’re driving a moving truck that can’t fit under a low bridge.
  • It doesn’t know that the “faster route” avoids the scenic coastline you wanted to show your family.
  • And if the GPS loses signal, you’re completely lost.

Math is the map. It’s the understanding of how roads connect, why certain routes are shorter, and what those contour lines on a topographic map actually mean. When the GPS gives you a direction, the math we’ll learn is the knowledge that lets you look at that suggestion and think, “That doesn’t make sense—there’s a river in that direction,” or “Yes, that aligns with what I know about this city.”

R is the steering wheel, pedals, and dashboard. It’s the hands-on control you have over the vehicle. It lets you accelerate, brake, check your speed, and see your fuel level. While the GPS can suggest a route, you are the one who must navigate the unexpected pothole, yield to a pedestrian, or decide to take that unplanned detour to a lookout point.

So What Are You Becoming? Driver, Mechanic, or Something Else?

This is where the analogy gets interesting, and where we need to be precise.

If you only use AI to generate code and blindly run it, you’re a passenger. You’ll get places, but you won’t know how you got there, and you’ll be helpless when things go wrong.

If you learn just enough R to run basic analyses but don’t understand the math, you’re a driver who can’t read the map. You can operate the car, but you’re completely dependent on your GPS (AI) for direction, and you can’t tell if it’s leading you astray.

If you learn the math but never touch R, you’re a cartographer who can’t drive. You understand the territory beautifully on paper, but you can’t actually traverse it.

A data scientist is both a skilled driver AND someone who understands the map. You need to:

  • Read the terrain (understand the data context and business question)

  • Interpret the map (apply statistical thinking and mathematical principles) - Operate the vehicle (write R code to execute the analysis)

  • Use the GPS wisely (leverage AI tools to speed up coding, but always with a critical eye)

  • Know when to ignore the GPS (recognize when automated suggestions don’t fit your specific problem)

And yes—sometimes you also need to be the mechanic. When your usual tools don’t work, when the data is in a strange format, when you need to build a custom function that no existing package provides—that’s when your R skills let you pop the hood and tinker.

Your job as a data scientist is to be the driver, the navigator, and when needed, the mechanic. AI is an incredibly powerful co-pilot—but it’s not the pilot. This course gives you the skills to sit in the captain’s seat.

The Bottom Line

In the age of AI, foundational knowledge isn’t obsolete—it’s more valuable than ever. The easier it becomes to generate code and results, the more important it is to have professionals who can:

1. Ask the right questions before any code is written.

2. Evaluate the outputs with critical statistical thinking.

3. Recognize when the AI is wrong (and it will be, often).

4. Communicate findings in a way that drives real decisions.

AI handles the how. You need to know the why, the when, and the whether. That’s what this course builds.

Welcome to the driver’s seat.


Lecture 1: Foundations – Intuition, Tools, and First Steps

Lecture Goals:

  1. To understand the three pillars of data science math: Linear Algebra, Calculus, and Probability/Statistics.

  2. To reframe data not as rows and columns, but as points in a mathematical space.

  3. Build an intuitive understanding of the mathematical ideas behind data science

  4. To implement a simple model in R and see these concepts come together in the practical model.

The Big Picture: The Data Science Workflow

Before diving in, let’s place our math in a workflow. A data scientist’s job is to turn raw data into actionable insights. This involves:

  1. Ask a Question: What business problem are we solving?
  2. Get & Clean Data: Gather the raw material.
  3. Explore & Visualize: Understand the structure and patterns in the data. (This is where Linear Algebra helps us think about structure and distance).
  4. Model the Data: Use a mathematical function to represent the patterns or make predictions. (This is where Calculus helps us find the “best” function).
  5. Validate & Interpret: Quantify our certainty and communicate results. (This is where Probability & Statistics help us understand uncertainty and risk).

Today, we’re focusing on the conceptual tools for steps 3, 4, and 5.

Part 1: The Checklist - A Data Scientist’s Mathematical Toolkit

What’s in the toolbox of a successful data scientist? It’s not about memorizing formulas, but about having concepts at your fingertips.

Tool Category Key Concepts Why It’s in the Box
Linear Algebra Vectors, Matrices, Matrix Multiplication, Vector Spaces, Projections To work with multi-dimensional data. A dataset isn’t just a table; it’s a collection of points in a high-dimensional space. Linear algebra is the language of that space.
Calculus Derivatives, Gradients, Optimization, Chain Rule To learn from data. Almost all machine learning is an optimization problem: find the model parameters that minimize the error. Calculus tells us how to “walk downhill” to the best solution.
Probability & Statistics Distributions, Mean/Variance, Likelihood, p-values, Confidence Intervals To quantify uncertainty. Data is messy and incomplete. Statistics gives us the tools to make informed guesses and state how confident we are in them.

Part 2: From Vectors to Predictions - The Core Idea

Let’s build a simple predictive model from scratch to see how these tools fit together. We’ll imagine we want to predict a student’s final exam score (\(y\)) based on the number of hours they studied (\(x\)).

2.1. Data as Vectors (Linear Algebra)

Our data isn’t just two lists of numbers. It’s a collection of points in a 2D space.

  • Each student is a point, represented by a vector: \(\vec{x_i} = \begin{bmatrix} \text{hours}_i \end{bmatrix}\) and their score \(y_i\).
  • The entire dataset is a collection of these points. We want to find a line that best represents their relationship.

2.2. The Model as a Function (Calculus & Linear Algebra)

We propose a linear model: \[ \hat{y} = \beta_0 + \beta_1 x \] This is our hypothesis. “All models are wrong, but some are useful. — George Box”

\(\beta_0\) (intercept) and \(\beta_1\) (slope) are the parameters of our model. We can think of \(\hat{y}\) as a prediction.

2.3. The “Best” Model as an Optimization Problem (Calculus)

We need to define what “best” means. We’ll use a common measure: the sum of squared errors (SSE) . For each actual data point \((x_i, y_i)\) and our prediction \(\hat{y}_i\), the error is \(y_i - \hat{y}_i\). We want to minimize the sum of these errors squared.

Our loss function is: \[ L(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - (\beta_0 + \beta_1 x_i))^2 \]

Finding the best \(\beta_0\) and \(\beta_1\) means finding the values that make this function as small as possible. This is where calculus comes in. We take the derivative of \(L\) with respect to \(\beta_0\) and \(\beta_1\), set them to zero, and solve. This is the famous ordinary least squares (OLS) solution.

Part 3: Hands-On - Predicting Customer Spending

Scenario: An e-commerce company wants to understand the relationship between the time a user spends on their website (time_spent) and the amount they spend in a session (spend). They have collected data from 10 recent sessions.

Your Task as a Data Scientist in Training: Without using R’s built-in lm() function, you will manually calculate the coefficients for a simple linear regression to model this relationship. This will force you to implement the math we just discussed.

Data: time_spent (minutes): [5, 10, 15, 20, 25, 30, 35, 40, 45, 50] spend (dollars): [25, 40, 58, 70, 85, 95, 112, 130, 140, 160]

Step-by-step R Implementation:

We’ll use R as a powerful calculator to implement the Ordinary Least Squares (OLS) formulas. We’ll need the means, variances, and covariance.

# --- Step 1: Enter the data as vectors ---
time_spent <- c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)
spend <- c(25, 40, 58, 70, 85, 95, 112, 130, 140, 160)

# --- Step 2: Calculate the means (x_bar, y_bar) ---
x_bar <- mean(time_spent)
y_bar <- mean(spend)

cat("Mean time spent:", x_bar, "\n")
Mean time spent: 27.5 
cat("Mean spend:", y_bar, "\n\n")
Mean spend: 91.5 
# --- Step 3: Calculate the components for beta_1 (the slope) ---
# beta_1 = sum( (x_i - x_bar)*(y_i - y_bar) ) / sum( (x_i - x_bar)^2 )

# Calculate deviations from the mean for x and y
x_dev <- time_spent - x_bar
y_dev <- spend - y_bar

# Calculate the numerator (cross-product of deviations)
# This is related to the covariance
numerator <- sum(x_dev * y_dev)

# Calculate the denominator (sum of squared deviations for x)
# This is the variance of x times (n-1)
denominator <- sum(x_dev^2)

beta_1 <- numerator / denominator

# --- Step 4: Calculate beta_0 (the intercept) ---
# beta_0 = y_bar - beta_1 * x_bar
beta_0 <- y_bar - beta_1 * x_bar

cat("Our manually calculated model:\n")
Our manually calculated model:
cat("spend =", beta_0, "+", beta_1, "* time_spent\n\n")
spend = 11.13333 + 2.922424 * time_spent
# --- Step 5: Make a prediction for a new value ---
# What would we predict for a user who spends 22 minutes on the site?
new_time <- 22
predicted_spend <- beta_0 + beta_1 * new_time
cat("Predicted spend for", new_time, "minutes: $", predicted_spend, "\n")
Predicted spend for 22 minutes: $ 75.42667 
# --- Step 6: Visualize our model ---
plot(time_spent, spend, 
     main = "Spend vs. Time Spent: Our First Model",
     xlab = "Time Spent (minutes)", 
     ylab = "Spend (dollars)",
     pch = 16, col = "blue")
# Add the regression line
abline(a = beta_0, b = beta_1, col = "red", lwd = 2)
# Add the prediction point
points(new_time, predicted_spend, col = "darkgreen", pch = 18, cex = 2)
legend("topleft", legend = c("Data", "Our Model", "Prediction (22 min)"),
       col = c("blue", "red", "darkgreen"), pch = c(16, NA, 18), 
       lty = c(NA, 1, NA), lwd = 2, cex=0.6)

Discussion: - What does the slope (\(\beta_1\)) tell us in practical terms? (For every extra minute on the site, spend increases by about $3.11).

  • What does the intercept (\(\beta_0\)) mean in this context? (A user who spends 0 minutes is predicted to spend about \(\$9.18\). Does that make sense? It might represent a baseline purchase or just the mathematical artifact of the line).

  • How would we use statistics to tell if this relationship is “significant”? (We’d need to calculate standard errors and a p-value, which R’s lm() does for us, but now we know what’s happening under the hood).


Lecture 2: Deep Dive – The Math That Powers the Models

Lecture Goals:

  1. To derive the ordinary least squares (OLS) estimators rigorously, verify they give a minimum,

  2. To interpret the statistical output and understand statistical inference,

  3. To explore the beautiful geometry of regression. (Optional)

Part 4: Deriving the Least Squares Line

Let’s continue with our simple linear regression model: we want to predict a student’s final exam score (\(y\)) based on hours studied (\(x\)). We have \(n\) data points \((x_i, y_i)\). We propose the model:

\[ \hat{y}_i = \beta_0 + \beta_1 x_i \]

We need to find \(\beta_0\) and \(\beta_1\) that minimize the sum of squared errors (SSE):

\[ L(\beta_0, \beta_1) = \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i)^2 \]

4.1. Minimizing the Loss Function Using Calculus

This is a classic optimization problem in two variables. To find the minimum, we take partial derivatives and set them to zero.

First-order conditions:

\[ \frac{\partial L}{\partial \beta_0} = -2 \sum_{i=1}^{n} (y_i - \beta_0 - \beta_1 x_i) = 0 \]

\[ \frac{\partial L}{\partial \beta_1} = -2 \sum_{i=1}^{n} x_i (y_i - \beta_0 - \beta_1 x_i) = 0 \]

Dividing both equations by \(-2\) and rearranging gives the normal equations:

\[ \begin{cases} \sum y_i = n \beta_0 + \beta_1 \sum x_i \\[4pt] \sum x_i y_i = \beta_0 \sum x_i + \beta_1 \sum x_i^2 \end{cases} \]

From the first equation, we get:

\[ \beta_0 = \bar{y} - \beta_1 \bar{x}, \quad \text{where } \bar{y} = \frac{1}{n}\sum y_i,\; \bar{x} = \frac{1}{n}\sum x_i. \]

Substitute this expression for \(\beta_0\) into the second normal equation:

\[ \sum x_i y_i = (\bar{y} - \beta_1 \bar{x}) \sum x_i + \beta_1 \sum x_i^2 \]

\[ \sum x_i y_i = \bar{y} \sum x_i - \beta_1 \bar{x} \sum x_i + \beta_1 \sum x_i^2 \]

Now isolate \(\beta_1\):

\[ \sum x_i y_i - \bar{y} \sum x_i = \beta_1 \left( \sum x_i^2 - \bar{x} \sum x_i \right) \]

Notice that \(\bar{y} \sum x_i = n \bar{x} \bar{y}\) and \(\bar{x} \sum x_i = n \bar{x}^2\). Also, we can rewrite the left side as \(\sum (x_i y_i - \bar{x} \bar{y})\) and the right side’s bracket as \(\sum (x_i^2 - \bar{x}^2)\). More conveniently, we use the deviations from the mean formulation.

Let \(S_{xy} = \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})\) and \(S_{xx} = \sum_{i=1}^{n} (x_i - \bar{x})^2\). Then:

\[ \hat\beta_1 = \frac{S_{xy}}{S_{xx}} \]

And \(\hat\beta_0 = \bar{y} - \beta_1 \bar{x}\).

These are the ordinary least squares (OLS) estimators.

4.2. Verifying It’s a Minimum: The Second Derivative Test

To ensure we have found a minimum (and not a maximum or saddle point), we examine the second-order partial derivatives. Compute:

\[ \frac{\partial^2 L}{\partial \beta_0^2} = 2n \] \[ \frac{\partial^2 L}{\partial \beta_1^2} = 2 \sum x_i^2 \] \[ \frac{\partial^2 L}{\partial \beta_0 \partial \beta_1} = 2 \sum x_i \]

The Hessian matrix \(H\) is:

\[ H = \begin{bmatrix} 2n & 2\sum x_i \\ 2\sum x_i & 2\sum x_i^2 \end{bmatrix} \]

For a minimum, \(H\) must be positive definite. A symmetric matrix is positive definite if its determinant and the leading principal minors (the determinant of the top-left submatrix) are positive. The first leading principal minor is \(2n > 0\), and the second leading principal is the determinant of \(H\):

\[ \det(H) = (2n)(2\sum x_i^2) - (2\sum x_i)^2 = 4 \left( n \sum x_i^2 - (\sum x_i)^2 \right) \]

But note that \(n \sum x_i^2 - (\sum x_i)^2 = n \sum (x_i - \bar{x})^2 = n S_{xx}\). Since \(S_{xx} > 0\) unless all \(x_i\) are identical (a degenerate case we exclude), the determinant is positive. Hence \(H\) is positive definite, confirming that the critical point is indeed a global minimum. This ties directly to your multivariable calculus knowledge: In Theorem C of Second Partial Test in Section 12.8 , we define a quantity \(D\) and check if \(D>0\) and \(f_{xx}>0\) for a local minimum value.

So the unique solution minimizes the SSE.

Part 5: Evaluating the Model – Uncertainty Quatification and Goodness-of-Fit

Now that we have our coefficient estimates, we need to answer two crucial questions:

  1. How reliable are these estimates? (uncertainty)

  2. How well does the model explain the data? (goodness‑of‑fit)

This part brings together all the statistical concepts you need.

5.1. Quantifying Uncertainty: Standard Errors, t-statistics, and p-values

Now that we have our line, we need to ask: how reliable are these estimates?

If we collected a different sample of students, we’d get slightly different \(\hat{\beta}_0\) and \(\hat{\beta}_1\). This sampling variability is quantified by standard errors.

First, we need an estimate of the variance of the errors \(\epsilon_i = y_i - (\beta_0 + \beta_1 x_i)\). The residual variance (mean squared error) is:

\[ \hat{\sigma}^2 = \frac{1}{n-2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

We divide by \(n-2\) because we’ve estimated two parameters (\(\beta_0\) and \(\beta_1\)), losing two degrees of freedom.

The standard errors of the coefficients are:

\[ \text{se}(\hat{\beta}_1) = \sqrt{ \frac{\hat{\sigma}^2}{S_{xx}} } \] \[ \text{se}(\hat{\beta}_0) = \sqrt{ \hat{\sigma}^2 \left( \frac{1}{n} + \frac{\bar{x}^2}{S_{xx}} \right) } \]

These formulas come from the variance-covariance matrix of the OLS estimators, which is \(\hat{\sigma}^2 (X^TX)^{-1}\) in matrix notation.

Hypothesis Testing: Is the slope significantly different from zero?

We often want to test the null hypothesis \(H_0: \beta_1 = 0\) (no linear relationship) against the alternative \(H_1: \beta_1 \neq 0\). Under \(H_0\), the test statistic:

\[ t = \frac{\hat{\beta}_1}{\text{se}(\hat{\beta}_1)} \]

follows a t-distribution with \(n-2\) degrees of freedom (assuming the errors are normally distributed, or approximately so for large samples). The p-value is the probability of observing a t-statistic as extreme as, or more extreme than, the one computed, if \(H_0\) were true.

  • A small p-value (typically < 0.05) indicates that such an extreme result is unlikely under the null, so we reject \(H_0\) and conclude there is evidence of a linear relationship.
  • A large p-value means the data are consistent with no relationship; we fail to reject \(H_0\).

Confidence Intervals: A 95% confidence interval for \(\beta_1\) is:

\[ \hat{\beta}_1 \pm t_{n-2, 0.025} \cdot \text{se}(\hat{\beta}_1) \]

where \(t_{n-2, 0.025}\) is the 2.5% critical value from the t-distribution. This interval gives a range of plausible values for the true slope.

5.2. Goodness of Fit: \(R^2\) and the Decomposition of Variance

How well does our model explain the variation in \(y\)? \(R^2\) (the coefficient of determination) answers this. It is the proportion of the total variance in \(y\) that is explained by the regression.

Consider the following sums of squares:

  • Total Sum of Squares (SST): measures the total variability in \(y\) around its mean. \[ SST = \sum_{i=1}^{n} (y_i - \bar{y})^2 \]

  • Regression Sum of Squares (SSR): measures the variability explained by the model (the spread of predictions around the mean). \[ SSR = \sum_{i=1}^{n} (\hat{y}_i - \bar{y})^2 \]

  • Error Sum of Squares (SSE): measures the unexplained variability (residuals around the regression line). \[ SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 \]

A fundamental identity (which can be derived using the orthogonality of residuals and fitted values) is:

\[ SST = SSR + SSE \]

In linear algebra terms, this is like the Pythagorean theorem: the vector of observations (centered) is the sum of two orthogonal vectors—the projection onto the model space and the residual vector. This is a beautiful geometric interpretation that connects to Lagrange’s identity (which states that for any two vectors, the squared length of their sum can be decomposed). Here, we have:

\[ \| \mathbf{y} - \bar{y}\mathbf{1} \|^2 = \| \hat{\mathbf{y}} - \bar{y}\mathbf{1} \|^2 + \| \mathbf{y} - \hat{\mathbf{y}} \|^2 \]

where \(\|\cdot\|^2\) denotes the squared Euclidean norm.

Then \(R^2\) is defined as:

\[ R^2 = \frac{SSR}{SST} = 1 - \frac{SSE}{SST} \]

  • \(R^2\) ranges from 0 to 1. A value close to 1 indicates that most of the variability in \(y\) is explained by the model.
  • In simple linear regression, \(R^2\) is also the square of the correlation coefficient \(r\) between \(x\) and \(y\): \(R^2 = r^2\).

Interpretation: If \(R^2 = 0.85\), we say “85% of the variance in exam scores is explained by hours studied.” The remaining 15% is due to other factors or random variation.

Supplementary Readings: Geometry of \(\mathbf{R^2}\)

A deeper understanding of how linear algebra and calculus underpin regression.

5.2.1 The Vector Geometry of Least Squares

In simple linear regression, we have \(n\) observations. Let’s stack them into vectors:

  • \(\mathbf{y} = (y_1, y_2, \dots, y_n)'\) – the observed responses.
  • \(\mathbf{1} = (1, 1, \dots, 1)'\) – a vector of ones (for the intercept).
  • \(\mathbf{x} = (x_1, x_2, \dots, x_n)'\) – the predictor.
  • The design matrix \(\mathbf{X} = [\mathbf{1} \; \mathbf{x}]\) (an \(n \times 2\) matrix).

The fitted values are \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\), where \(\hat{\boldsymbol{\beta}} = (\hat{\beta}_0, \hat{\beta}_1)'\) is the OLS estimator. The residuals are \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\).

A key property from the normal equations is that the residuals are orthogonal to every column of \(\mathbf{X}\): \[ \mathbf{X}'\mathbf{e} = \mathbf{0}. \] In particular:

  • \(\mathbf{1}'\mathbf{e} = \sum e_i = 0\) (residuals sum to zero).

  • \(\mathbf{x}'\mathbf{e} = \sum x_i e_i = 0\).

Because \(\hat{\mathbf{y}}\) is a linear combination of the columns of \(\mathbf{X}\), it follows that \(\hat{\mathbf{y}}\) is also orthogonal to \(\mathbf{e}\): \[ \hat{\mathbf{y}}'\mathbf{e} = (\mathbf{X}\hat{\boldsymbol{\beta}})'\mathbf{e} = \hat{\boldsymbol{\beta}}'\mathbf{X}'\mathbf{e} = 0. \] Thus, in \(\mathbb{R}^n\), the vectors \(\hat{\mathbf{y}}\) and \(\mathbf{e}\) are perpendicular.

5.2.2 Pythagorean Decomposition of \(\mathbf{y}\)

Since \(\mathbf{y} = \hat{\mathbf{y}} + \mathbf{e}\) and \(\hat{\mathbf{y}} \perp \mathbf{e}\), the Pythagorean theorem gives: \[ \|\mathbf{y}\|^2 = \|\hat{\mathbf{y}}\|^2 + \|\mathbf{e}\|^2. \] Here \(\|\mathbf{v}\|^2 = \sum_{i=1}^n v_i^2\) is the squared Euclidean norm. This is a decomposition of the raw (uncorrected) sum of squares:

  • \(\|\mathbf{y}\|^2 = \sum y_i^2\),

  • \(\|\hat{\mathbf{y}}\|^2 = \sum \hat{y}_i^2\),

  • \(\|\mathbf{e}\|^2 = \sum e_i^2 = SSE\).

However, in regression we usually work with centered sums of squares, i.e., deviations from the mean. Let’s define the centered response vector: \[ \tilde{\mathbf{y}} = \mathbf{y} - \bar{y}\mathbf{1}, \] where \(\bar{y} = \frac{1}{n}\sum y_i\). Similarly, the centered fitted values are \(\tilde{\hat{\mathbf{y}}} = \hat{\mathbf{y}} - \bar{y}\mathbf{1}\) (because, for a model with an intercept, the average of the fitted values equals \(\bar{y}\)). The residuals remain unchanged because they already sum to zero: \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}} = \tilde{\mathbf{y}} - \tilde{\hat{\mathbf{y}}}\).

Now observe that \(\tilde{\hat{\mathbf{y}}}\) lies in the column space of \(\mathbf{X}\) (since it’s a linear combination of \(\mathbf{1}\) and \(\mathbf{x}\) after centering, but note that \(\mathbf{1}\) is removed; actually \(\tilde{\hat{\mathbf{y}}}\) is in the span of \(\mathbf{x} - \bar{x}\mathbf{1}\)). Moreover, \(\mathbf{e}\) is orthogonal to that column space. Consequently, \(\tilde{\hat{\mathbf{y}}}\) and \(\mathbf{e}\) are also orthogonal: \[ \tilde{\hat{\mathbf{y}}}'\mathbf{e} = 0. \] And \(\tilde{\mathbf{y}} = \tilde{\hat{\mathbf{y}}} + \mathbf{e}\). Applying the Pythagorean theorem again: \[ \|\tilde{\mathbf{y}}\|^2 = \|\tilde{\hat{\mathbf{y}}}\|^2 + \|\mathbf{e}\|^2. \] But \(\|\tilde{\mathbf{y}}\|^2 = \sum (y_i - \bar{y})^2 = SST\) (total sum of squares). \(\|\tilde{\hat{\mathbf{y}}}\|^2 = \sum (\hat{y}_i - \bar{y})^2 = SSR\) (regression sum of squares). And \(\|\mathbf{e}\|^2 = SSE\). Thus we obtain the fundamental identity: \[ SST = SSR + SSE. \]

Geometric insight: The vector \(\tilde{\mathbf{y}}\) (centered observations) is decomposed into two perpendicular components: its projection onto the subspace spanned by the centered predictor (which gives \(\tilde{\hat{\mathbf{y}}}\)) and the residual vector \(\mathbf{e}\). The squared lengths of these vectors add up exactly as in a right triangle.

5.2.3 Connection to Lagrange’s Identity

For those who enjoy a deeper mathematical connection, consider the centered predictor vector \(\tilde{\mathbf{x}} = \mathbf{x} - \bar{x}\mathbf{1}\). Then \(\tilde{\hat{\mathbf{y}}} = \hat{\beta}_1 \tilde{\mathbf{x}}\), where \(\hat{\beta}_1 = \frac{\tilde{\mathbf{x}}'\tilde{\mathbf{y}}}{\|\tilde{\mathbf{x}}\|^2}\).

The residual sum of squares can be written as: \[ SSE = \|\tilde{\mathbf{y}} - \hat{\beta}_1 \tilde{\mathbf{x}}\|^2 = \|\tilde{\mathbf{y}}\|^2 - \frac{(\tilde{\mathbf{x}}'\tilde{\mathbf{y}})^2}{\|\tilde{\mathbf{x}}\|^2}. \] Multiply both sides by \(\|\tilde{\mathbf{x}}\|^2\): \[ SSE \cdot \|\tilde{\mathbf{x}}\|^2 = \|\tilde{\mathbf{x}}\|^2 \|\tilde{\mathbf{y}}\|^2 - (\tilde{\mathbf{x}}'\tilde{\mathbf{y}})^2. \]

The right‑hand side is exactly Lagrange’s identity for the two vectors \(\tilde{\mathbf{x}}\) and \(\tilde{\mathbf{y}}\) in \(\mathbb{R}^n\): \[ \|\tilde{\mathbf{x}}\|^2 \|\tilde{\mathbf{y}}\|^2 - (\tilde{\mathbf{x}}'\tilde{\mathbf{y}})^2 = \sum_{1 \le i < j \le n} \big( (\tilde{x}_i \tilde{y}_j - \tilde{x}_j \tilde{y}_i) \big)^2. \] The left‑hand side is the squared area of the parallelogram spanned by \(\tilde{\mathbf{x}}\) and \(\tilde{\mathbf{y}}\) (the sum of squared areas of its projections onto coordinate planes). So:

  • When \(\tilde{\mathbf{x}}\) and \(\tilde{\mathbf{y}}\) are perfectly collinear (i.e., \(r^2 = 1\)), the right‑hand side is zero, so \(SSE = 0\) – all points lie exactly on a line.

  • When they are orthogonal (no linear relationship), the right‑hand side equals \(\|\tilde{\mathbf{x}}\|^2 \|\tilde{\mathbf{y}}\|^2\), and \(SSE = \|\tilde{\mathbf{y}}\|^2 = SST\) – the model explains nothing.

Thus, \(SSE\) (scaled by \(\|\tilde{\mathbf{x}}\|^2\)) measures the squared area (or “non‑collinearity”) between \(\tilde{\mathbf{x}}\) and \(\tilde{\mathbf{y}}\). This is a beautiful geometric interpretation that ties the algebraic Lagrange identity directly to the concept of unexplained variation in regression.

5.2.4 R Illustration: Verifying the Decompositions

Using the spend vs. time_spent data, we can compute these quantities numerically.

# Using the same data as before
time_spent <- c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)
spend <- c(25, 40, 58, 70, 85, 95, 112, 130, 140, 160)

n <- length(time_spent)
x_bar <- mean(time_spent)
y_bar <- mean(spend)

# OLS from earlier
beta_1 <- sum((time_spent - x_bar)*(spend - y_bar)) / sum((time_spent - x_bar)^2)
beta_0 <- y_bar - beta_1 * x_bar
y_hat <- beta_0 + beta_1 * time_spent
e <- spend - y_hat

# --- Raw (uncentered) decomposition ---
raw_y <- sum(spend^2)
raw_yhat <- sum(y_hat^2)
raw_e <- sum(e^2)
cat("Raw sums of squares:\n")
Raw sums of squares:
cat("  ||y||^2   =", raw_y, "\n")
  ||y||^2   = 101383 
cat("  ||ŷ||^2   =", raw_yhat, "\n")
  ||ŷ||^2   = 101337.4 
cat("  ||e||^2   =", raw_e, "\n")
  ||e||^2   = 45.58788 
cat("  ||ŷ||^2 + ||e||^2 =", raw_yhat + raw_e, " equals ||y||^2? ", 
    all.equal(raw_y, raw_yhat + raw_e), "\n\n")
  ||ŷ||^2 + ||e||^2 = 101383  equals ||y||^2?  TRUE 
# --- Centered decomposition (SST, SSR, SSE) ---
y_centered <- spend - y_bar
yhat_centered <- y_hat - y_bar
SST <- sum(y_centered^2)
SSR <- sum(yhat_centered^2)
SSE <- sum(e^2)

cat("Centered sums of squares:\n")
Centered sums of squares:
cat("  SST =", SST, "\n")
  SST = 17660.5 
cat("  SSR =", SSR, "\n")
  SSR = 17614.91 
cat("  SSE =", SSE, "\n")
  SSE = 45.58788 
cat("  SSR + SSE =", SSR + SSE, " equals SST? ", 
    all.equal(SST, SSR + SSE), "\n\n")
  SSR + SSE = 17660.5  equals SST?  TRUE 
# --- Lagrange identity check ---
x_centered <- time_spent - x_bar
lhs <- sum(x_centered^2) * sum(y_centered^2) - (sum(x_centered * y_centered))^2
# Compute the right-hand side sum over i<j of (x_i*y_j - x_j*y_i)^2
rhs <- 0
for (i in 1:(n-1)) {
  for (j in (i+1):n) {
    rhs <- rhs + (x_centered[i]*y_centered[j] - x_centered[j]*y_centered[i])^2
  }
}
cat("Lagrange identity check:\n")
Lagrange identity check:
cat("  LHS =", lhs, "\n")
  LHS = 94025 
cat("  RHS =", rhs, "\n")
  RHS = 94025 
cat("  Difference =", lhs - rhs, "\n")
  Difference = 0 
cat("  SSE * ||x̃||^2 =", SSE * sum(x_centered^2), "\n")
  SSE * ||x̃||^2 = 94025 

The output confirms that \(SST = SSR + SSE\) and that Lagrange’s identity holds exactly. Moreover, \(SSE \cdot \|\tilde{\mathbf{x}}\|^2\) equals the “area” term, reinforcing the geometric interpretation.

5.5.5. Summary
  • The decomposition \(SST = SSR + SSE\) is a direct consequence of the orthogonality between fitted values and residuals, visualized as a right triangle in \(\mathbb{R}^n\).
  • Lagrange’s identity shows that the unexplained variation \(SSE\) (scaled by \(S_{xx}\)) is exactly the squared area of the parallelogram spanned by the centered predictor and response vectors.
  • This connects the statistical concept of “variance explained” to fundamental geometric and algebraic ideas, illustrating the deep unity between linear algebra, calculus, and data science.

This enriched section should satisfy curious students who want to see the math behind the numbers and appreciate the elegance of the subject.

5.3. Summary: Quantifying Our Confidence (Statistics)

We have our line, \(\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\). But is it a good line? Statistics provides the tools to answer that.

  • Hypothesis Testing: We can ask, “Is the relationship we found real, or could it have happened by chance?” We can test the null hypothesis that \(\beta_1 = 0\) (no relationship). The p-value tells us the probability of seeing our data if the null hypothesis were true.
  • Confidence Intervals: Instead of just saying the slope is 2.5, we can say, “We are 95% confident that the true slope is between 2.1 and 2.9.” This captures our uncertainty due to having only a sample of data.
  • R-squared: This tells us the proportion of the variance in the exam scores that is explained by our model (hours studied). It’s a measure of model fit.

Part 6: Complete Analysis in R - Revisiting Customer Spending with Full Statistical Analysis

Let’s return to our e‑commerce data and now compute not just the coefficients, but also standard errors, t‑statistics, p‑values, and \(R^2\). We’ll do this both manually (to see the machinery) and using R’s built‑in functions.

Data: time_spent (minutes): [5, 10, 15, 20, 25, 30, 35, 40, 45, 50] spend (dollars): [25, 40, 58, 70, 85, 95, 112, 130, 140, 160]

Step-by-step R Implementation with Full Statistics

# --- Step 1: Data ---
time_spent <- c(5, 10, 15, 20, 25, 30, 35, 40, 45, 50)
spend <- c(25, 40, 58, 70, 85, 95, 112, 130, 140, 160)
n <- length(time_spent)

# --- Step 2: Means and deviations ---
x_bar <- mean(time_spent)
y_bar <- mean(spend)

x_dev <- time_spent - x_bar
y_dev <- spend - y_bar

# --- Step 3: Compute OLS coefficients ---
Sxx <- sum(x_dev^2)
Sxy <- sum(x_dev * y_dev)

beta_1 <- Sxy / Sxx
beta_0 <- y_bar - beta_1 * x_bar

cat("Coefficients:\n")
Coefficients:
cat("  beta_0 =", beta_0, "\n")
  beta_0 = 11.13333 
cat("  beta_1 =", beta_1, "\n\n")
  beta_1 = 2.922424 
# --- Step 4: Fitted values and residuals ---
y_hat <- beta_0 + beta_1 * time_spent
residuals <- spend - y_hat

# --- Step 5: Estimate error variance (sigma^2) ---
SSE <- sum(residuals^2)
sigma2_hat <- SSE / (n - 2)   # unbiased estimate

# --- Step 6: Standard errors ---
se_beta1 <- sqrt(sigma2_hat / Sxx)
se_beta0 <- sqrt(sigma2_hat * (1/n + x_bar^2 / Sxx))

cat("Standard errors:\n")
Standard errors:
cat("  se(beta_0) =", se_beta0, "\n")
  se(beta_0) = 1.630734 
cat("  se(beta_1) =", se_beta1, "\n\n")
  se(beta_1) = 0.05256331 
# --- Step 7: t-statistics and p-values ---
t_beta1 <- beta_1 / se_beta1
t_beta0 <- beta_0 / se_beta0

# Two-tailed p-value from t-distribution with n-2 df
df <- n - 2
p_beta1 <- 2 * pt(abs(t_beta1), df = df, lower.tail = FALSE)
p_beta0 <- 2 * pt(abs(t_beta0), df = df, lower.tail = FALSE)

cat("t-statistics and p-values:\n")
t-statistics and p-values:
cat("  t for beta_0 =", t_beta0, ", p =", p_beta0, "\n")
  t for beta_0 = 6.827192 , p = 0.000134059 
cat("  t for beta_1 =", t_beta1, ", p =", p_beta1, "\n\n")
  t for beta_1 = 55.59818 , p = 1.215325e-11 
# --- Step 8: R-squared ---
SST <- sum((spend - y_bar)^2)
SSR <- sum((y_hat - y_bar)^2)   # could also compute as SST - SSE
R_sq <- SSR / SST

cat("R-squared =", R_sq, "\n")
R-squared = 0.9974187 
cat("  (SST =", SST, ", SSR =", SSR, ", SSE =", SSE, ")\n")
  (SST = 17660.5 , SSR = 17614.91 , SSE = 45.58788 )
cat("  Check: SSR + SSE =", SSR + SSE, " equals SST? ", 
    all.equal(SSR+SSE, SST), "\n\n")
  Check: SSR + SSE = 17660.5  equals SST?  TRUE 

Compare with lm() Output

Now let’s see how R’s built‑in function summarizes all this for us.

model <- lm(spend ~ time_spent)
summary(model)

Call:
lm(formula = spend ~ time_spent)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8061 -1.2500  0.0303  1.6788  3.0303 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 11.13333    1.63073   6.827 0.000134 ***
time_spent   2.92242    0.05256  55.598 1.22e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.387 on 8 degrees of freedom
Multiple R-squared:  0.9974,    Adjusted R-squared:  0.9971 
F-statistic:  3091 on 1 and 8 DF,  p-value: 1.215e-11

Notice that the coefficients, standard errors, t‑values, p‑values, and \(R^2\) match our manual calculations. The summary() also provides the residual standard error (which is \(\sqrt{\hat{\sigma}^2}\)) and an F‑statistic (which for simple linear regression is just the square of the t‑statistic for the slope).

Interpreting the Output

  • Intercept (\(\beta_0 = 9.18\)): When time spent = 0, predicted spend is $9.18. This might not be practically meaningful (maybe baseline purchases), but mathematically it’s the y‑intercept.
  • Slope (\(\beta_1 = 3.11\)): For each additional minute on the site, spend increases by about $3.11 on average.
  • p‑value for slope (\(2.64 \times 10^{-8}\)): Extremely small, so we reject the null hypothesis of no relationship. There is strong evidence that time spent is associated with spend.
  • \(R^2 = 0.991\): About 99.1% of the variation in spend is explained by time spent. This is an exceptionally high \(R^2\), suggesting a very strong linear relationship (in real data it’s rarely this perfect).

Visualizing the Model with Residuals

We can also check the residuals (the differences between actual and predicted spend) to see if our model assumptions hold.

# Residual plot
plot(time_spent, residuals,
     main = "Residuals vs. Time Spent",
     xlab = "Time Spent (minutes)", 
     ylab = "Residuals",
     abline(h = 0, col = "red", lty = 2))

Ideally, residuals should be randomly scattered around zero with no clear pattern. Here, they appear random, supporting the linear model.


Summary of Key Statistical Concepts

  • Standard error measures the precision of an estimate; smaller se means more precise.
  • t‑statistic = estimate / se; it quantifies how many standard deviations the estimate is from zero (under null).
  • p‑value is the probability of observing a t‑statistic as extreme as we did, if the null were true. A low p‑value (<0.05) is evidence against the null.
  • \(R^2\) is the proportion of variance explained by the model. It ranges from 0 to 1, with higher values indicating a better fit (but beware of overfitting).

All of these concepts rest on the foundation of calculus (optimization, derivatives) and linear algebra (orthogonality, projections). You now see how the math you’re learning directly powers the tools used by data scientists every day.


Lecture 3: The R Toolkit - From Spreadsheet to Reproducible Workflow

Lecture Goals:

1. To master the core data structures of R for data science.

2. To learn the “split-apply-combine” strategy for data analysis using dplyr.

3. To build a complete, reproducible analysis pipeline from data import to a first model.

The Big Picture: The Tidyverse Philosophy

Last time, we used R as a fancy calculator. Today, we’ll use it as a true data analysis environment. We’ll be using the Tidyverse, a collection of R packages designed for data science. Its core philosophy is:

  1. Tidy Data: Every variable is a column, every observation is a row, every value is a cell. This consistent structure is the foundation.
  2. Piping (%>% or |>): You can chain operations together, reading from left to right. It transforms code from nested function calls (f(g(x))) to a readable pipeline (x %>% g() %>% f()).
  3. Verbs for Data Analysis: Common tasks (like filtering, grouping, summarizing) are intuitive “verbs.”

Part 1: The Checklist - A Data Scientist’s R Toolbox

This is your practical, daily-driver toolkit.

Tool Category Key R Packages & Functions Why It’s in the Box
Data Import readr::read_csv(), readxl::read_excel() To get data from the world into R.
Data Wrangling dplyr: filter(), select(), mutate(), group_by(), summarise(), arrange() To clean, transform, and prepare data for analysis. This is 80% of the work.
Data Visualization ggplot2: ggplot(), geom_point(), geom_histogram(), etc. To explore data visually and communicate findings.
Modeling & Reporting broom::tidy(), tidyr::pivot_*() To get model results out of R and into a format you can use in reports or further analysis.

Part 2: Theoretical Deep Dive - The Grammar of Data Analysis

2.1. The Pipe: %>% (or |>)

The pipe takes the output of the function on its left and passes it as the first argument to the function on its right. This is revolutionary for readability.

Without Pipe:

result <- summarise(group_by(filter(data, condition), group_var), mean_col = mean(col))

With Pipe:

result <- data %>%
  filter(condition) %>%
  group_by(group_var) %>%
  summarise(mean_col = mean(col))

You read it as a story: “Start with the data, then filter it, then group it, then summarize it.”

2.2. Tidy Data and Split-Apply-Combine

The most powerful pattern in data analysis is Split-Apply-Combine.

1. Split: Your data is split into groups based on one or more variables (e.g., by department, by gender, by treatment).

2. Apply: You apply a function to each group independently (e.g., calculate the mean, fit a model).

3. Combine: The results from each group are combined back into a new, tidy data frame.

In dplyr, this is done in one step:

data %>%
  group_by(group_var) %>%   # SPLIT
  summarise(avg = mean(target_var)) # APPLY & COMBINE

Part 3: Practical Problem - Analyzing the diamonds Dataset

Scenario: You are a junior data analyst at a jewelry retailer. You’ve been asked to investigate the factors that determine the price of diamonds. You have a dataset of nearly 54,000 diamonds with their prices and characteristics (carat, cut, color, clarity, etc.). Your goal is to produce a short report with key insights.

Your Task: Use the Tidyverse to explore the data, answer specific business questions, and build a simple linear model to quantify the relationship between a diamond’s weight (carat) and its price.

Step-by-step R Implementation:

First, we load the core packages.

library(tidyverse)  # This loads dplyr, ggplot2, and others
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
library(broom)      # For converting model outputs to tidy data frames

# The diamonds dataset comes with ggplot2
data("diamonds")

Phase 1: Exploratory Data Analysis (EDA)

Let’s get to know our data.

# --- 1. Examine the data ---
# Look at the first few rows
head(diamonds)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
# Get a summary of all columns
glimpse(diamonds)  # A tidyverse version of str()
Rows: 53,940
Columns: 10
$ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
$ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
$ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
$ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
$ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
$ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
$ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
$ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
$ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
$ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
# Summary statistics for numerical columns
summary(diamonds)
     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
 Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
                                                                  
       z         
 Min.   : 0.000  
 1st Qu.: 2.910  
 Median : 3.530  
 Mean   : 3.539  
 3rd Qu.: 4.040  
 Max.   :31.800  
                 
# --- 2. Ask and answer questions with the pipe ---

# Q1: What is the average price of diamonds, by cut?
diamonds |>
  group_by(cut) |>
  summarise(avg_price = mean(price), count = n()) |>
  arrange(desc(avg_price))
# A tibble: 5 × 3
  cut       avg_price count
  <ord>         <dbl> <int>
1 Premium       4584. 13791
2 Fair          4359.  1610
3 Very Good     3982. 12082
4 Good          3929.  4906
5 Ideal         3458. 21551
# Insight: "Fair" cut diamonds have the lowest average price? That seems counterintuitive. 
# We'll need to investigate further (confounding with carat!).

# Q2: Create a new column for price per carat, then find the average price per carat by color.
diamonds |>
  mutate(price_per_carat = price / carat) |>
  group_by(color) |>
  summarise(avg_ppc = mean(price_per_carat)) |>
  arrange(desc(avg_ppc))
# A tibble: 7 × 2
  color avg_ppc
  <ord>   <dbl>
1 G       4163.
2 F       4135.
3 H       4008.
4 I       3996.
5 D       3953.
6 J       3826.
7 E       3805.
# Q3: Filter for only "Ideal" cut diamonds with a carat > 1.0, and show the top 5 most expensive.
diamonds |>
  filter(cut == "Ideal", carat > 1.0) |>
  select(carat, color, clarity, price) |>
  arrange(desc(price)) |>
  head(5)
# A tibble: 5 × 4
  carat color clarity price
  <dbl> <ord> <ord>   <int>
1  1.51 G     IF      18806
2  2.07 G     SI2     18804
3  2.15 G     SI2     18791
4  2.05 G     SI1     18787
5  1.6  F     VS1     18780

Phase 2: Visualization with ggplot2

Now, let’s see the data. ggplot2 builds plots layer by layer.

# --- 3. Visualize relationships ---

# Plot 1: The relationship we suspect is strongest: Price vs. Carat
ggplot(diamonds, aes(x = carat, y = price)) +
  geom_point(alpha = 0.1) +  # Alpha makes points transparent to see density
  geom_smooth(method = "lm", se = FALSE, color = "red") + # Add a linear model line
  labs(title = "Diamond Price vs. Carat Weight",
       subtitle = "There is a strong, non-linear relationship.",
       x = "Carat", y = "Price (USD)")
`geom_smooth()` using formula = 'y ~ x'

# Plot 2: Price distribution by Cut (using boxplots)
ggplot(diamonds, aes(x = cut, y = price)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 10000)) + # Zoom in to see the boxes clearly
  labs(title = "Price Distribution by Cut Quality",
       x = "Cut", y = "Price (USD)")

# Insight: The boxplots confirm our earlier finding. Better cuts (Ideal, Premium) 
# have a *wider range* of prices and higher medians than "Fair" cuts.

Phase 3: Modeling and Reporting

Finally, let’s build a simple model and extract the results in a tidy format. We’ll use lm() again, but now we’ll use broom::tidy() to get the coefficients as a nice data frame.

# --- 4. Build a linear model ---
# Model price as a function of carat
model <- lm(price ~ carat, data = diamonds)

# --- 5. Tidy the model output ---
model_coefficients <- tidy(model)
print(model_coefficients)
# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)   -2256.      13.1     -173.       0
2 carat          7756.      14.1      551.       0
# Insight: The model suggests price = -2256 + 7756 * carat.
# For every 1 carat increase, the price increases by about $7756.
# The p-value for carat is extremely small (`r model_coefficients$p.value[2]`), 
# indicating a statistically significant relationship.

# We can also get model-level statistics like R-squared
glance(model)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df   logLik     AIC     BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>    <dbl>   <dbl>   <dbl>
1     0.849         0.849 1549.   304051.       0     1 -472730. 945467. 945493.
# ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# The R-squared is `r round(glance(model)$r.squared, 3)`, meaning about 85% of the 
# variance in price can be explained by carat weight alone.

Phase 4: Your Challenge

Problem: The CEO suspects that the relationship between carat and price might be different for different colors of diamonds. They want you to investigate.

Task: Using the diamonds data and the |> pipe, write a code chunk that: 1. Groups the data by color. 2. For each color group, fits a linear model (price ~ carat). 3. Uses broom::tidy() to get the coefficients for each model. 4. Filters for only the carat coefficient (the slope) and color. 5. Visualizes the result with a bar chart showing how the slope (price increase per carat) changes with diamond color.

This is a real-world, “split-apply-combine” modeling task.

# Hint: You'll need to use nest(), map(), and unnest() or the simpler
# group_by() + do() is now deprecated. The modern approach is nest-map-unnest.
# Let's try a simpler, elegant approach using `lm` inside a summarise?

# --- Your solution here ---
# diamonds |>
  # ... your code ...
# Solution:
diamonds |>
  group_by(color) |>
  summarise(model = list(lm(price ~ carat, data = cur_data()))) |>
  mutate(tidied = map(model, tidy)) |>
  unnest(tidied) |>
  filter(term == "carat") |>
  select(color, estimate, std.error, p.value) |>
  ggplot(aes(x = color, y = estimate)) +
  geom_col(fill = "steelblue") +
  labs(title = "Price per Carat (Slope) by Diamond Color",
       x = "Color (D is best, J is worst)", 
       y = "Estimated Price Increase per Carat ($)") +
  theme_minimal()
Warning: There was 1 warning in `summarise()`.
ℹ In argument: `model = list(lm(price ~ carat, data = cur_data()))`.
ℹ In group 1: `color = D`.
Caused by warning:
! `cur_data()` was deprecated in dplyr 1.1.0.
ℹ Please use `pick()` instead.

# Insight: The price premium for an extra carat is much higher for 
# top-color (D, E) diamonds than for lower-color (I, J) diamonds.

Conclusion: Your Journey Forward

Over these two lectures, you’ve built the core of a data scientist’s toolkit. You’ve learned that:

  1. Math is the map. It tells you where you can go and helps you navigate. Now you have seen how vectors, calculus, and statistics are not abstract classroom concepts but the very terrain you’ll navigate as a data scientist.

  2. R is the steering wheel, pedals and dashboard. It’s the hands-on control you have over your analysis. From manually calculating coefficients to piping data through dplyr verbs, from visualizing with ggplot2 to fitting models, you’ve learned to actually drive—not just follow instructions.

The age of AI doesn’t make this knowledge obsolete; it makes it more valuable.

You now have your driver’s license. But unlike a real license, this one doesn’t expire when technology changes. The math is timeless. The R skills are transferable. The mindset—being the driver, not the passenger—will serve you whether you’re building models in 2026 or 2036.

Keep practicing, stay curious, and always be the one behind the wheel.

Enjoy the journey!