Which value should I pick?
I remember the first time I stumbled upon regression analysis a few (or actually many) years ago. I was sitting there in that huge lecture hall with 100+ other students trying to figure out what all those mathematical notations that our teacher had written out on that blackboard actually meant. While other students were scratching their heads, I sat there totally amazed over how the lecturer was able to draw that red line running perfectly through all those points time after time. I mean… there is almost an infinite number of possible slope coefficients he could choose from, how could he know which one to pick? If I only knew back then that his “magic trick” was plainly a question of mathematical optimization.
Mathematical optimization is, in its essences I would say, a question of selecting the best element among a set of elements with respect to some criterion (the objective function). In its simplest form, an optimization problem consists of maximizing or minimizing the criterion by systematically choosing elements from within an allowed set of elements. In the world of Statistics and Machine Learning, the objective function is often called the loss function. The function itself can be defined in many ways (usually according to a certain problem) but one common one (especially in linear regression) is the mean square error (MSE) loss function, which is written as:
\[ \text{MSE} = \dfrac{1}{n}\sum_i^n (\hat{Y_i} - Y_i)^2 \]
where \(\hat{Y_i}\) is the predicted value. A simple way of illustrating the “magic” behind linear regression is by using the optimization function optim() in R. In this example, I’ll use a simple linear regression with only one coefficient and without an intercept. Let us start with generating data and implementing our needed functions according to:
nrObservation <- 50
x <- runif(n = nrObservation, min = 0, max = 10)
Beta <- 2
y = Beta*x + rnorm(n = nrObservation, mean = 0, sd = 1) # Adding white noise
plot(x = x, y = y, pch = 16)
prediction <- function(coeff, data){
return(coeff*data)
}
lossMSE <- function(coefficient, data, trueTarget){
prediction <- prediction(coeff = coefficient, data = data)
diffSquare <- (prediction - trueTarget)^2
return(mean(diffSquare))
}
Let us now define a range of possible values for the true (unknown) beta coefficient (for the sake of this simple example) and visualize our loss function over this grid. As we can see from the plot, the loss function seems to be at its minimum somewhere between 0 and 5.
betaGrid <- seq(-5,10, length.out = 200)
lossValues <- vapply(betaGrid,
FUN = function(beta){lossMSE(coefficient = beta,
data = x,
trueTarget = y)},
FUN.VALUE = numeric(1))
plot(x = betaGrid, y = lossValues,
type = "l", ylab = "MSE", xlab = expression(beta))
Let us now use the optim() function in R for a better solution than our previous interval guess. In this case, we’ll use an unconstrained optimization approach using “BFGS” as the argument for the method (quasi-Newton method). For you that are unfamiliar with optimization in R, the optim() function is general-purpose optimization based on Nelder–Mead, quasi-Newton and conjugate-gradient algorithms, which also includes an option for box-constrained optimization and simulated annealing.
The function requires you to specify an initial value (par), the objective function itself (fn) but also the method to use. You could also add the gradient for the objective function as well (gr) but it’s not necessary for this particular case.
optimRes <- optim(par = -5,
fn = lossMSE,
gr = NULL,
data = x,
trueTarget = y,
method = "BFGS")
lmEstimation <- coefficients(lm(y ~ x - 1, data = data.frame(x,y)))
plot(x = x, y = y, pch = 16)
lines(x = x, y = prediction(coeff = optimRes$par, data = x),
lty = 2, col = "red")
lines(x = x, y = prediction(coeff = lmEstimation, data = x),
lty = 3, col = "blue")
lines(x = x, y = prediction(coeff = Beta, data = x),
lty = 2, col = "orange")
legend("bottomright", c("True", "Optim()", "lm"),
col = c("orange", "red", "blue"), lty = c(2,3,2),
bty = "n")
By plotting our beta coefficient from the optimization step we can observe that our value is more or less the same as the true value of 2 (I also added an estimation from R’s own lm() function for comparison). It might be hard to separate the three lines from the graph so I added a table here below for more precise presentation. Recall that I added some white noise when generating the data, in case you wonder why the lm()-value and optim()-value aren’t exactly the same as the true beta
Beta | |
---|---|
True | 2.000000 |
Optim() | 2.008421 |
lm | 2.008421 |
So there you have it, optimization in a nutshell! It’s worth pointing out that I’m not an expert in optimization but I hope that this post might result in fewer head scratches for you guys..