We have a simple linear model: ŷ = w·x + b. The goal is to find the slope w and intercept b that minimize the Mean Squared Error (MSE) over all training data. Gradient descent starts at a random point and iteratively updates both parameters by moving in the direction of the negative gradient — the steepest descent on the loss surface.
Each step uses the entire dataset to compute the gradient. The path is smooth and direct, but can be slow for large datasets.
Instead of computing the gradient over all N data points each step, SGD picks a single random sample (or a small mini-batch) and uses only that to estimate the gradient. This makes each step much faster, but the path is noisy and zigzags toward the optimum instead of following a smooth curve.
The noise can actually be helpful — it can escape shallow local minima in complex loss landscapes. For our simple linear MSE, the noise is just noise, but the speed advantage is dramatic on large datasets.