Gradient Descent Interactive Tool | Dr. Pedram Jahangiry

Batch Gradient Descent

We have a simple linear model: ŷ = w·x + b. The goal is to find the slope w and intercept b that minimize the Mean Squared Error (MSE) over all training data. Gradient descent starts at a random point and iteratively updates both parameters by moving in the direction of the negative gradient — the steepest descent on the loss surface.

Each step uses the entire dataset to compute the gradient. The path is smooth and direct, but can be slow for large datasets.

θ_j := θ_j − α · ∂J(θ) / ∂θ_j

θ_j = parameter (w or b) | α = learning rate | gradient = direction of steepest ascent

Learning Rate (α): 0.05

Start:

Iteration

Slope (w)

—

Intercept (b)

—

MSE Loss

—

Loss Surface — MSE(w, b)

Data & Current Fit

Loss Over Iterations

Parameter Trajectories

Stochastic Gradient Descent (SGD)

Instead of computing the gradient over all N data points each step, SGD picks a single random sample (or a small mini-batch) and uses only that to estimate the gradient. This makes each step much faster, but the path is noisy and zigzags toward the optimum instead of following a smooth curve.

The noise can actually be helpful — it can escape shallow local minima in complex loss landscapes. For our simple linear MSE, the noise is just noise, but the speed advantage is dramatic on large datasets.

θ_j := θ_j − α · ∂J_i(θ) / ∂θ_j

Same update rule, but J_i = loss on a single random sample instead of the full dataset