Intelligent learning rate
Given the importance of the learning rate, is there a good learning rate? Can we find it analytically? Let's try to design some conditions for describing a good learning rate.
First and foremost, given that we are performing minimization, we definitely want the function to decrease sufficiently at each iteration.
Sufficient in this case means that the decrease should be at least proportional to the gradient step taken.
It shouldn't be the case that we take a big step and the decrease is miniscule or vice versa.
So, for some constant \( c_1 \), such that \( 0 < c_1 < 1 \), we wish that
\begin{equation}
f(\vx + \alpha \vp) \le f(\vx) + c_1 \alpha \nabla f(\vx)^T \vp
\label{eqn:armijo-condition}
\end{equation}
This first condition, which ensures there is sufficient decrease in function value with a step is known as the Armijo condition.
Now, let's consider another criteria.
The slope or gradient of the function at the new point should also be larger (less negative) than the one before it.
This means, for some constant \( c_2 \), such that \( 0 < c_2 < 1 \), we wish that
\begin{equation}
\nabla f(\vx + \alpha \vp)^T \vp \ge c_2 \nabla f(\vx)^T \vp
\label{eqn:curvature-condition}
\end{equation}
The second one, which ensures that the curvature is flattening out over time is known as curvature condition.
These two conditions for a desirable step length are known as Wolfe conditions.
It is usually the case that \( c_1 < c_2 \).
Wolfe conditions can be a good strategy to identify a suitable learning rate at any point. At each iteration, we can select a value of the learning rate the satisfies both the conditions.
If none of the conditions can be satisfied, we just use a default learning rate to move to another point and try finding a satisfactory learning rate at the next iteration.
We use this strategy in the next two demos applying gradient descent to multivariate problems.