Code
<- function(y, x, b) {
obtain_cost <- x %*% b
y_pred <- y - y_pred
e <- e^2
se <- mean(se)
mse return(mse)
}
June 10, 2019
June 17, 2019
Already familiar with OLS regression and wanted to learn about gradient descent? This blog post will provide a brief tutorial on solving OLS using gradient descent using R.
Remember that the equation for OLS is:
\[ \tag{1} Y = X\beta + \epsilon \]
where \(Y\) is a column-wise vector of DVs, \(X\) is a matrix of IVs, \(\beta\) is a column-wise vector of regression coefficients, and \(\epsilon\) is a column-wise vector of error.
\[ \begin{bmatrix} \tag{2} y_1 \\ y_2 \\ . \\ . \\ . \\ y_n \\ \end{bmatrix} = \begin{bmatrix} 1 & x_{1, 1} & x_{1, 2} & . & . & . & x_{1, n} \\ 1 & x_{2, 1} & x_{2, 2} & . & . & . & x_{2, n}\\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ . & . & . & . & . & . & .\\ 1 & x_{n, 1} & x_{n, 2} & . & . & . & x_{n, n}\\ \end{bmatrix} \begin{bmatrix} b_0 \\ b_1 \\ . \\ . \\ . \\ b_n \\ \end{bmatrix} + \begin{bmatrix} e_1 \\ e_2 \\ . \\ . \\ . \\ e_n \\ \end{bmatrix} \]
For a review of matrix multiplication, check out my previous blog post.
OLS tries to minimize error using the mean squared error (MSE) formula:
\[ \tag{3} MSE = \frac{\Sigma (Y-\hat{Y})^2}{N} = \frac{\Sigma e_i^2}{N} = mean(e_i^2) \]
Note: MSE is a single cost function (or formula) from many that we could choose from. For example, another cost function is mean absolute error (MAE):
\[ \tag{4} MAE = \frac{\Sigma |Y-\hat{Y}|}{N} = \frac{\Sigma|e_i|}{N} = mean(abs(e_i)) \]
Let’s say that we are interested in understanding if the number of years a professor has had their Ph.D. is associated with a higher 9-month academic salary.
First, let’s load some packages.
packages <- c("tidyverse", # data manipulation and visualization
"carData", # data set to obtain professor's 9-month salary
"broom", # nice table format of cofficients from lm()
"ggrepel", # ggplot extension to repel text and extra features
"glue", # concatenate strings and alias
"RColorBrewer", # load color palettes
"plotly" # plot 3D figures
)
xfun::pkg_attach2(packages, message = F)
n_round <- 3
Then, let’s define our independent variable (IV: yrs.since.phd
) as x
and our dependent variable (DV: salary
) as y
using the Salaries
dataset in the carData
package.
[1] 397 2
(Intercept) yrs.since.phd
1 1 -3.31
2 1 -2.31
3 1 -18.31
4 1 22.69
5 1 17.69
6 1 -16.31
[1] 397 1
salary
[1,] 139750
[2,] 173200
[3,] 79750
[4,] 115000
[5,] 141500
[6,] 97000
What is gradient descent?
Starting at any position (e.g., intercept and slope combination), gradient descent takes the partial derivative of each coefficient (\(\beta\)) from the cost function (MSE) and moves (or descends) in the direction that will continue to minimize the cost (or gradient) function.
# random sample of possible intercepts and slopes
# then calculate cost function (mse) for each intercept and slope combination
n_sample <- 200000
df_gd <- tibble(
int = sample(seq(0, 200000), n_sample, T),
slope = sample(seq(-10000, 10000), n_sample, T)
) %>%
rowwise() %>%
mutate(cost = obtain_cost(y, x, b = c(int, slope)))