**Introduction**

In this (optional) lab at home, you will learn how to plot a linear regression with confidence and prediction intervals, and various tools to assess model fit: calculating the MSE, making train-test splits, and writing a function for cross validation. Solutions to this lab will be posted on Monday May 17th on the course website.

We will use the `Boston`

dataset, which is in the `MASS`

package that comes with `R`

.

```
library(ISLR)
library(MASS)
library(tidyverse)
```

**Inspect the Boston dataset using the**`View()`

function

`view(Boston)`

The `Boston`

dataset contains the housing values and other information about Boston suburbs. We will use the dataset to predict housing value (the variable `medv`

, here the outcome/dependent variable) by socio-economic status (the variable `lstat`

, here the predictor / independent variable).

Let’s explore socio-economic status and housing value in the dataset using visualization.

**Create a scatter plot from the**`Boston`

dataset with`lstat`

mapped to the x position and`medv`

mapped to the y position. Store the plot in an object called`p_scatter`

.

```
p_scatter <-
Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_point() +
theme_minimal()
p_scatter
```

**Plotting linear regression with confidence intervals**

We’ll start with making and visualizing the linear model. As you know, a linear model is fitted in `R`

using the function `lm()`

, which then returns a `lm`

object. We are going to walk through the construction of a plot with a fit line and prediction / confidence intervals from an `lm`

object.

First, we will create the linear model. This model will be used to predict outcomes for the current data set, and - further along in this lab - to create new data.

**Create a linear model object called**`lm_ses`

using the formula`medv ~ lstat`

and the`Boston`

dataset.

`lm_ses <- lm(formula = medv ~ lstat, data = Boston)`

You have now trained a regression model with `medv`

(housing value) as the outcome/dependent variable and `lstat`

(socio-economic status) as the predictor / independent variable.

Remember that a regression estimates β (the intercept) and β_{1} (the slope) in the following equation:

y = β_{0} + β_{1} • x_{1} + ε

**Use the function**`coef()`

to extract the intercept and slope from the`lm_ses`

object. Interpret the slope coefficient.

`coef(lm_ses)`

```
## (Intercept) lstat
## 34.5538409 -0.9500494
```

`# for each point increase in lstat, the median housing value drops by 0.95`

**Use**`summary()`

to get a summary of the`lm_ses`

object. What do you see? You can use the help file`?summary.lm`

.

`summary(lm_ses)`

```
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
```

We now have a model object `lm_ses`

that represents the formula

medv_{i} = 34.55 - 0.95 • lstat_{i} + ε_{i}

With this object, we can predict a new `medv`

value by inputting its `lstat`

value. The `predict()`

method enables us to do this for the `lstat`

values in the original dataset.

**Save the predicted y values to a variable called**`y_pred`

`y_pred <- predict(lm_ses)`

**Create a scatter plot with**`y_pred`

mapped to the x position and the true y value (`Boston$medv`

) mapped to the y value. What do you see? What would this plot look like if the fit were perfect?

```
tibble(pred = y_pred,
obs = Boston$medv) %>%
ggplot(aes(x = pred, y = obs)) +
geom_point() +
theme_minimal() +
geom_abline(slope = 1)
```

```
# I've added an ideal line where all the points would lie on if the
# fit were perfect.
```

We can also generate predictions from new data using the `newdat`

argument in the `predict()`

method. For that, we need to prepare a data frame with new values for the original predictors.

One method of number generation, is through using the function `seq()`

, this function from `base R`

generates a sequence of number using a standardized method. Typically length of the requested sequence divided by the range between `from`

to `to`

. For more information call `?seq`

.

**Use the**`seq()`

function to generate a sequence of 1000 equally spaced values from 0 to 40. Store this vector in a data frame with (`data.frame()`

or`tibble()`

) as its column name`lstat`

. Name the data frame`pred_dat`

.

`pred_dat <- tibble(lstat = seq(0, 40, length.out = 1000))`

**Use the newly created data frame, from Question 8, as the**`newdata`

argument to a`predict()`

call for`lm_ses`

. Store it in a variable named`y_pred_new`

.

`y_pred_new <- predict(lm_ses, newdata = pred_dat)`

Now, we’ll continue with the plotting part by adding a prediction line to the plot we previously constructed.

**Add the vector**`y_pred_new`

to the`pred_dat`

data frame with the name`medv`

.

```
# this can be done in several ways. Here are two possibilities:
# pred_dat$medv <- y_pred_new
pred_dat <- pred_dat %>% mutate(medv = y_pred_new)
```

**Add a geom_line() to**`p_scatter`

from Question 2, with`pred_dat`

as the`data`

argument. What does this line represent?

`p_scatter + geom_line(data = pred_dat)`

`# This line represents predicted values of medv for the values of lstat `

**The**`interval`

argument can be used to generate confidence or prediction intervals. Create a new object called`y_pred_95`

using`predict()`

(again with the`pred_dat`

data) with the`interval`

argument set to “confidence”. What is in this object?

```
y_pred_95 <- predict(lm_ses, newdata = pred_dat, interval = "confidence")
head(y_pred_95)
```

```
## fit lwr upr
## 1 34.55384 33.44846 35.65922
## 2 34.51580 33.41307 35.61853
## 3 34.47776 33.37768 35.57784
## 4 34.43972 33.34229 35.53715
## 5 34.40168 33.30690 35.49646
## 6 34.36364 33.27150 35.45578
```

`# it's a matrix with an estimate and a lower and an upper confidence interval.`

**Using the data from Question 11, and the sequence created in Question 8; create a data frame with 4 columns:**`medv`

,`lstat`

,`lower`

, and`upper`

.

```
gg_pred <- tibble(
lstat = pred_dat$lstat,
medv = y_pred_95[, 1],
lower = y_pred_95[, 2],
upper = y_pred_95[, 3]
)
gg_pred
```

```
## # A tibble: 1,000 x 4
## lstat medv lower upper
## <dbl> <dbl> <dbl> <dbl>
## 1 0 34.6 33.4 35.7
## 2 0.0400 34.5 33.4 35.6
## 3 0.0801 34.5 33.4 35.6
## 4 0.120 34.4 33.3 35.5
## 5 0.160 34.4 33.3 35.5
## 6 0.200 34.4 33.3 35.5
## 7 0.240 34.3 33.2 35.4
## 8 0.280 34.3 33.2 35.4
## 9 0.320 34.2 33.2 35.3
## 10 0.360 34.2 33.1 35.3
## # … with 990 more rows
```

**Add a**`geom_ribbon()`

to the plot with the data frame you just made. The ribbon geom requires three aesthetics:`x`

(`lstat`

, already mapped),`ymin`

(`lower`

), and`ymax`

(`upper`

). Add the ribbon below the`geom_line()`

and the`geom_points()`

of before to make sure those remain visible. Give it a nice colour and clean up the plot, too!

```
# Create the plot
Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_ribbon(aes(ymin = lower, ymax = upper), data = gg_pred, fill = "#00008b44") +
geom_point(colour = "#883321") +
geom_line(data = pred_dat, colour = "#00008b", size = 1) +
theme_minimal() +
labs(x = "Proportion of low SES households",
y = "Median house value",
title = "Boston house prices")
```

**Explain in your own words what the ribbon represents.**

```
# The ribbon represents the 95% confidence interval of the fit line.
# The uncertainty in the estimates of the coefficients are taken into
# account with this ribbon.
# You can think of it as:
# upon repeated sampling of data from the same population, at least 95% of
# the ribbons will contain the true fit line.
```

**Do the same thing, but now with the prediction interval instead of the confidence interval.**

```
# pred with pred interval
y_pred_95 <- predict(lm_ses, newdata = pred_dat, interval = "prediction")
# create the df
gg_pred <- tibble(
lstat = pred_dat$lstat,
medv = y_pred_95[, 1],
l95 = y_pred_95[, 2],
u95 = y_pred_95[, 3]
)
# Create the plot
Boston %>%
ggplot(aes(x = lstat, y = medv)) +
geom_ribbon(aes(ymin = l95, ymax = u95), data = gg_pred, fill = "#00008b44") +
geom_point(colour = "#883321") +
geom_line(data = pred_dat, colour = "#00008b", size = 1) +
theme_minimal() +
labs(x = "Proportion of low SES households",
y = "Median house value",
title = "Boston house prices")
```

`# You can look at ISLR p.81-82 for a discussion of prediction intervals`

**Model fit using the mean square error**

Next, we will write a function to assess the model fit using the mean square error: the square of how much our predictions on average differ from the observed values.

**Write a function called**`mse()`

that takes in two vectors: true y values and predicted y values, and which outputs the mean square error.

Start like so:

```
mse <- function(y_true, y_pred) {
# your function here
}
```

Wikipedia may help for the formula.

```
# there are many ways of doing this.
mse <- function(y_true, y_pred) {
mean((y_true - y_pred)^2)
}
```

**Make sure your**`mse()`

function works correctly by running the following code.

`mse(1:10, 10:1)`

`## [1] 33`

In the code, we state that our observed values correspond to \(1, 2, ..., 9, 10\), while our predicted values correspond to \(10, 9, ..., 2, 1\). This is graphed below, where the blue dots correspond to the observed values, and the yellow dots correspond to the predicted values. Using your function, you have now calculated the mean squared length of the dashed lines depicted in the graph below. If your function works correctly, the value returned should equal 33.

**Calculate the mean square error of the**`lm_ses`

model. Use the`medv`

column as`y_true`

and use the`predict()`

method to generate`y_pred`

.

`mse(Boston$medv, predict(lm_ses))`

`## [1] 38.48297`

You have calculated the mean squared length of the dashed lines in the plot below. As the MSE is computed using the data that was used to fit the model, we actually obtained the training MSE. Below we continue with splitting our data in a training, test and validation set such that we can calculate the out-of sample prediction error during model building using the validation set, and estimate the true out-of-sample MSE using the test set.

Note that you can also easily obtain how much the predictions on average differ from the observed values in the original scale of the outcome variable. To obtain this, you take the root of the mean square error. This is called the Root Mean Square Error, abbreviated as RMSE.

**Obtaining train-validation-test splits**

Now we will use the `sample()`

function to randomly select observations from the `Boston`

dataset to go into a training, test, and validation set. The training set will be used to fit our model, the validation set will be used to calculate the out-of sample prediction error during model building, and the test set will be used to estimate the true out-of-sample MSE.

**The**`Boston`

dataset has 506 observations. Use`c()`

and`rep()`

to create a vector with 253 times the word “train”, 152 times the word “validation”, and 101 times the word “test”. Call this vector`splits`

.

`splits <- c(rep("train", 253), rep("validation", 152), rep("test", 101))`

**Use the function**`sample()`

to randomly order this vector and add it to the`Boston`

dataset using`mutate()`

. Assign the newly created dataset to a variable called`boston_master`

.

`boston_master <- Boston %>% mutate(splits = sample(splits))`

**Now use**`filter()`

to create a training, validation, and test set from the`boston_master`

data. Call these datasets`boston_train`

,`boston_valid`

, and`boston_test`

.

```
boston_train <- boston_master %>% filter(splits == "train")
boston_valid <- boston_master %>% filter(splits == "validation")
boston_test <- boston_master %>% filter(splits == "test")
```

We will set aside the `boston_test`

dataset for now.

**Train a linear regression model called**`model_1`

using the training dataset. Use the formula`medv ~ lstat`

like in the first`lm()`

exercise. Use`summary()`

to check that this object is as you expect.

```
model_1 <- lm(medv ~ lstat, data = boston_train)
summary(model_1)
```

```
##
## Call:
## lm(formula = medv ~ lstat, data = boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.520 -4.210 -1.620 2.465 23.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.85048 0.82439 42.27 <2e-16 ***
## lstat -0.94302 0.05684 -16.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.459 on 251 degrees of freedom
## Multiple R-squared: 0.5231, Adjusted R-squared: 0.5212
## F-statistic: 275.3 on 1 and 251 DF, p-value: < 2.2e-16
```

**Calculate the MSE with this object. Save this value as**`model_1_mse_train`

.

`model_1_mse_train <- mse(y_true = boston_train$medv, y_pred = predict(model_1))`

**Now calculate the MSE on the validation set and assign it to variable**`model_1_mse_valid`

. Hint: use the`newdata`

argument in`predict()`

.

```
model_1_mse_valid <- mse(y_true = boston_valid$medv,
y_pred = predict(model_1, newdata = boston_valid))
```

This is the estimated out-of-sample mean squared error.

**Create a second model**`model_2`

for the train data which includes`age`

and`tax`

as predictors. Calculate the train and validation MSE.

```
model_2 <- lm(medv ~ lstat + age + tax, data = boston_train)
model_2_mse_train <- mse(y_true = boston_train$medv, y_pred = predict(model_2))
model_2_mse_train
```

`## [1] 39.15776`

```
model_2_mse_valid <- mse(y_true = boston_valid$medv,
y_pred = predict(model_2, newdata = boston_valid))
model_2_mse_valid
```

`## [1] 31.58578`

**Compare model 1 and model 2 in terms of their training and validation MSE. Which would you choose and why?**

```
# If you are interested in out-of-sample prediction, the
# answer may depend on the random sampling of the rows in the
# dataset splitting: everyond has a different split. However, it
# is likely that model_2 has both lower training and validation MSE.
```

In choosing the best model, you should base your answer on the validation MSE. Using the out of sample mean square error, we have made a model decision (which parameters to include, only `lstat`

, or using `age`

and `tax`

in addition to `lstat`

to predict housing value). Now we have selected a final model.

**For your final model, retrain the model one more time using both the training***and*the validation set. Then, calculate the test MSE based on the (retrained) final model. What does this number tell you?

```
model_2b <- lm(medv ~ lstat + age + tax, data = boston_master %>% filter(splits != "test"))
model_2_mse_test <- mse(y_true = boston_test$medv,
y_pred = predict(model_2b, newdata = boston_test))
model_2_mse_test
```

`## [1] 40.20577`

```
# The estimate for the expected amount of error when predicting
# the median value of a not previously seen town in Boston when
# using this model is (the root mean square error):
sqrt(model_2_mse_test)
```

`## [1] 6.340802`

As you will see during the remainder of the course, usually we set apart the test set at the beginning and on the remaining data perform the train-validation split multiple times. Performing the train-validation split multiple times is what we for example do in cross validation (see below). The validation sets are used for making model decisions, such as selecting predictors or tuning model parameters, so building the model. As the validation set is used to base model decisions on, we can not use this set to obtain a true out-of-sample MSE. That’s where the test set comes in, it can be used to obtain the MSE of the final model that we choose when all model decisions have been made. As all model decisions have been made, we can use all data except for the test set to retrain our model one last time using as much data as possible to estimate the parameters for the final model.

**Cross-validation (advanced)**

This is an advanced exercise. Some components we have seen before in this lab, but some things will be completely new. Try to complete it by yourself, but don’t worry if you get stuck. If you don’t know about `for loops`

in `R`

, read up on those before you start the exercise, for example by reading the Basics: For Loops tab on the course website.

Use help in this order:

- R help files
- Internet search & stack exchange
- Your peers
- The answer, which shows one solution

You may also just read the answer when they have been made available and try to understand what happens in each step.

**Create a function that performs k-fold cross-validation for linear models.**

Inputs:

`formula`

: a formula just as in the`lm()`

function`dataset`

: a data frame`k`

: the number of folds for cross validation- any other arguments you need necessary

Outputs:

- Mean square error averaged over folds

```
# Just for reference, here is the mse() function once more
mse <- function(y_true, y_pred) mean((y_true - y_pred)^2)
cv_lm <- function(formula, dataset, k) {
# We can do some error checking before starting the function
stopifnot(is_formula(formula)) # formula must be a formula
stopifnot(is.data.frame(dataset)) # dataset must be data frame
stopifnot(is.integer(as.integer(k))) # k must be convertible to int
# first, add a selection column to the dataset as before
n_samples <- nrow(dataset)
select_vec <- rep(1:k, length.out = n_samples)
data_split <- dataset %>% mutate(folds = sample(select_vec))
# initialise an output vector of k mse values, which we
# will fill by using a _for loop_ going over each fold
mses <- rep(0, k)
# start the for loop
for (i in 1:k) {
# split the data in train and validation set
data_train <- data_split %>% filter(folds != i)
data_valid <- data_split %>% filter(folds == i)
# calculate the model on this data
model_i <- lm(formula = formula, data = data_train)
# Extract the y column name from the formula
y_column_name <- as.character(formula)[2]
# calculate the mean square error and assign it to mses
mses[i] <- mse(y_true = data_valid[[y_column_name]],
y_pred = predict(model_i, newdata = data_valid))
}
# now we have a vector of k mse values. All we need is to
# return the mean mse!
mean(mses)
}
```

**Use your function to perform 9-fold cross validation with a linear model with as its formula**`medv ~ lstat + age + tax`

. Compare it to a model with as formula`medv ~ lstat + I(lstat^2) + age + tax`

.

`cv_lm(formula = medv ~ lstat + age + tax, dataset = Boston, k = 9)`

`## [1] 37.77446`

`cv_lm(formula = medv ~ lstat + I(lstat^2) + age + tax, dataset = Boston, k = 9)`

`## [1] 28.14269`