Gradient Descent & SGD

Module 5 — Machine Learning for Time Series Forecasting
Created by Dr. Pedram Jahangiry | Enhanced with Claude

Batch Gradient Descent

We have a simple linear model: ŷ = w·x + b. The goal is to find the slope w and intercept b that minimize the Mean Squared Error (MSE) over all training data. Gradient descent starts at a random point and iteratively updates both parameters by moving in the direction of the negative gradient — the steepest descent on the loss surface.

Each step uses the entire dataset to compute the gradient. The path is smooth and direct, but can be slow for large datasets.

θj := θjα · ∂J(θ) / ∂θj
θj = parameter (w or b)  |  α = learning rate  |  gradient = direction of steepest ascent
0.05
Iteration
0
Slope (w)
Intercept (b)
MSE Loss
Loss Surface — MSE(w, b)
Data & Current Fit
Loss Over Iterations
Parameter Trajectories

Stochastic Gradient Descent (SGD)

Instead of computing the gradient over all N data points each step, SGD picks a single random sample (or a small mini-batch) and uses only that to estimate the gradient. This makes each step much faster, but the path is noisy and zigzags toward the optimum instead of following a smooth curve.

The noise can actually be helpful — it can escape shallow local minima in complex loss landscapes. For our simple linear MSE, the noise is just noise, but the speed advantage is dramatic on large datasets.

θj := θjα · ∂Ji(θ) / ∂θj
Same update rule, but Ji = loss on a single random sample instead of the full dataset
0.05
Iteration
0
Slope (w)
Intercept (b)
MSE Loss
Loss Surface — SGD Path (noisy!)
Data & Current Fit
Loss Over Iterations
Parameter Trajectories