## Momentum

To continue to make rapid progress towards the optimum, a simple improvement to vanilla gradient descent might look like this: use the velocity from the previous iteration to inform the current iteration.
Use a scalar multiplier, \( \beta \) to control the amount of information from the previous iteration into the current.

Thus, the velocity of the \( k+1 \)-th iteration becomes

$$ \vv_{k+1} = \beta \vv_k + \alpha_{k+1} \nabla_{\vx_k} $$

If we consider \( \beta \) as the *mass* and \( \vv_k \) as the velocity, then the term \( \beta \vv_k \) is \( \text{mass} \times \text{velocity} \), a well-studied concept in Physics known as **momentum**.
And the intuition remains the same too! If you are moving faster, the momentum keeps you moving faster. If you abruptly go in the opposite direction, the momentum makes you slow down.

Owing the momentum, gradient descent is able to rapidly make progress towards the optimum.
If it overshoots and goes beyond the minimum, its step size is automatically reduced due to the negating effects of the previous gradient — an informed dampening factor, leading to reduced oscillations.