How do we find a good \( \mB \)?
The answer lies in Calculus.
Suppose the optimization routine is at iteration \( k \).
Let us denote the current iterate as \( \vx_k \).
Let's denote our second-order approximation \( f(\vx_k + \vp) \) as \( m_k(\vp) \) — a function of \( \vp \).
\begin{equation}
m_k(\vp) = f(\vx_k) + \nabla_{\vx_k}^T \vp + \frac{1}{2} \vp^T B_k \vp
\end{equation}
Suppose the iterates were \( \vx_k \) and \( \vx_{k+1}\) across two adjacent iterations \( k \) and \( k + 1 \), respectively.
Note that \vx_{k+1} = \vx_k + \alpha_k \vp_k \), where \( \alpha_k \) is the learning rate.
A good second-order approximation should lead to consistent gradients across two consecutive iterations. Why? Because the function we are modeling is not erratic! Hopefully, at least.
This means, the gradient of \( m_k \) should equal the gradient of \( m_{k+1} \) for the same point \( x_k \). Note that \( x_k = \vx_{k+1} - \alpha \vp \).
Thus, the gradients for \( m_k(0) \) and \( m_{k+1}(-\alpha \vp) \) should match.
Matching the gradients across the two adjacent iterations, we get
\begin{aligned}
& \frac{\partial m_k(0)}{\partial \vp} = \frac{\partial m_{k+1}(-\alpha \vp_k)}{\partial \vp} \\
& \implies \nabla_k = \nabla_{k+1} - \alpha_k\mB_{k+1}\vp_k \\
& \implies \mB_{k+1} \alpha_k \vp_k = \nabla_{k+1} - \nabla_k \\
& \implies \mB_{k+1} (\vx_{k+1} - \vx_k) = \nabla_{k+1} - \nabla_k \\
\end{aligned}
(Note: To avoid notation clutter, we have written \( \nabla_{\vx_k} \) as \( \nabla_k \). The meaning should be clear from context).
This beautiful equation is known as the secant equation.
It is analogous to calculating the second-order derivative as a rate of change of the first-order derivative; the finite-difference approximation.
It was discovered over 3000 years before the Newton's method for the task of finding roots of an equation!