This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        BFGS method
        Optimization
      

The BFGS method is an approach for unconstrained optimization. For reasons that will be clear shortly, it is a type of quasi-Newton method to optimization.

In this article, we will motivate the formulation of quasi-Newton methods in general. Then, we will focus on the BFGS approach, and provide interactive demos over multiple univariate and multivariate functions to show it in action.

Prerequisites

To understand this article, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem statement

In this article, we will consider an objective function $ f : \real^\ndim \to \real $ that we wish to minimize. We express the unconstrained optimization problem as,

\begin{aligned} \min_{\vx} \text{ }& f(\vx) \\ \end{aligned}

BFGS method is an iterative approach. Starting from an arbitrary point, we will progressively discover points with lower function values till we converge to a point with minimum function value.

The next few sections will outline the strategy used by BFGS method to discover the next point $ \va + \vp $, from the current point $ \va $ on its journey to the optimal solution.

Taylor's theorem and optimization: A recap

Owing to the Taylor's Theorem, for a twice differentiable function $ f: \real^\ndim \to \ndim $, the value of the function at a point $\va+\vp $ can be written in terms of the function value at a point $ \va $ as follows.

\begin{equation} f(\va+\vp) = f(\va) + \nabla_{\va}^T \vp + \frac{1}{2} \vp^T \mH_{\va} \vp \label{eqn:second-order-taylor-approx} \end{equation}

Here, $ \nabla_{\va} $ and $ \mH_{\va} $ denote the gradient and Hessian of the function, evaluated at the point $ \va $.

To choose the next point $ \va + \vp $, we need to discover a suitable value of $ \vp $. Since $ \va $ is already known (the current point), this is a function in $ \vp $. At its local minimizer, $ \star{\vp} $, the gradient with respect to $ \vp $ will be zero.

\begin{aligned} \doh{f(\va + \star{\vp})}{\vp} = 0 \\\\ \label{eqn:first-order-condition-on-p} \end{aligned}

Thus, we now have a system of equations to solve for the value of $ \vp $ to choose our next point $ \va + \vp $.

Quasi-Newton methods

Newton's method works with the second-order Taylor approximation. Gradient descent works with the first order terms and completely ignores the Hessian term. Can there be a middle ground?

What if we could model the second-order terms without explicitly calculating the Hessian? Then the method can pretend to be Newton's method, without actually being one. Such methods are called quasi-Newton methods, which literally means seemingly and apparently but not really, Newton.

In quasi-Newton methods, instead of the true Hessian $ \mH $, we work with an alternative matrix $ \mB \in \real^{\ndim \times \ndim}$ that models the second-order terms in the vicinity of the current iterate $ \va $. So, essentially, we discover $ \vp $ such that

\begin{equation} \frac{\partial}{\partial \vp} \left(f(\va) + \nabla{\va}^T \vp + \frac{1}{2} \vp^T \mB_{\va} \vp \right) = 0 \label{eqn:quasi-newton-taylor} \end{equation}

If we can find a good way to estimate $ \mB $, then a quasi-Newton method is just a re-statement of the Newton's method.

Start from a suitable point $ \vx $
Apply the following update to $ \vx $ till a local minimizer has been reached: $ \vx \leftarrow \vx - \inv{\mB_{\vx}} \nabla_{\vx} $.

How do we find a good $ \mB $?

The answer lies in Calculus.

Suppose the optimization routine is at iteration $ k $. Let us denote the current iterate as $ \vx_k $.

Let's denote our second-order approximation $ f(\vx_k + \vp) $ as $ m_k(\vp) $ — a function of $ \vp $.

\begin{equation} m_k(\vp) = f(\vx_k) + \nabla_{\vx_k}^T \vp + \frac{1}{2} \vp^T B_k \vp \end{equation}

Suppose the iterates were $ \vx_k $ and $ \vx_{k+1}$ across two adjacent iterations $ k $ and $ k + 1 $, respectively. Note that \vx_{k+1} = \vx_k + \alpha_k \vp_k \), where $ \alpha_k $ is the learning rate.

A good second-order approximation should lead to consistent gradients across two consecutive iterations. Why? Because the function we are modeling is not erratic! Hopefully, at least.

This means, the gradient of $ m_k $ should equal the gradient of $ m_{k+1} $ for the same point $ x_k $. Note that $ x_k = \vx_{k+1} - \alpha \vp $. Thus, the gradients for $ m_k(0) $ and $ m_{k+1}(-\alpha \vp) $ should match. Matching the gradients across the two adjacent iterations, we get

\begin{aligned} & \frac{\partial m_k(0)}{\partial \vp} = \frac{\partial m_{k+1}(-\alpha \vp_k)}{\partial \vp} \\ & \implies \nabla_k = \nabla_{k+1} - \alpha_k\mB_{k+1}\vp_k \\ & \implies \mB_{k+1} \alpha_k \vp_k = \nabla_{k+1} - \nabla_k \\ & \implies \mB_{k+1} (\vx_{k+1} - \vx_k) = \nabla_{k+1} - \nabla_k \\ \end{aligned}

(Note: To avoid notation clutter, we have written $ \nabla_{\vx_k} $ as $ \nabla_k $. The meaning should be clear from context).

This beautiful equation is known as the secant equation. It is analogous to calculating the second-order derivative as a rate of change of the first-order derivative; the finite-difference approximation. It was discovered over 3000 years before the Newton's method for the task of finding roots of an equation!

Constraining $\mB $

Note that the equation $ \mB_{k+1} (\vx_{k+1} - \vx_k) = \nabla_{k+1} - \nabla_k $ has $ n^2 $ unknowns and only $ n $ equations. Many matrices will satisfy such relationship. We can afford to put constraints to select a more desirable $ \mB_{k+1} $.

We will impose two desirable conditions on $ \mB_{k+1} $

It should be close to our current approximation $ \mB_k $, so that the function is not changing drastically with each iteration.
It should be symmetric, $ \mB_{k+1} = \mB_{k+1}^T $, just like a true Hessian of a twice differentiable function.

Solving for $ \mB $

Given the desirability constraints in the previous section and the secant equation, we can find a suitable $ \mB_{k+1} $ by solving the following intermediate minimization problem

\begin{aligned} \min_{\mB} & ||\mB - \mB_k|| \\ \text{subject to } & \\ & \mB = \mB^T, \\ & \mB(x_{k+1} - x_k) = \nabla_{k+1} - \nabla_k \\ \end{aligned}

The objective function ensures proximity with current iterate, while the two constraints solve for symmetry and the secant equation.

The various quasi-Newton methods differ in kind of norms they use for $ ||\mB - \mB_k|| $. The original algorithm, proposed by Davidon in 1959 uses the Frobenius norm.

The BFGS method

Here's the update rule for quasi-Newton methods: $ \vx_{k+1} \leftarrow \vx_k - \inv{\mB_k} \nabla_{\vx_k} $. Note that what we really need is the inverse $ \inv{\mB_k} $. We do not have any explicit use for $ \mB $. So why not directly solve for the inverse?

That's the rationale of the popular BFGS algorithm, named after its founders: Broyden, Fletcher, Goldfarb, and Shanno. The BFGS approach also uses the Frobenius norm, but directly works with the inverse of $ \mB $ since that is what we need for the update rule anyways. Re-stating the above equation in terms of the inverse $ \mC_k = \inv{\mB_k} $

\begin{aligned} \min_{\mC} & ||\mC - \mC_k|| \\ \text{subject to } & \\ & \mC = \mC^T, \\ & \mC(\nabla_{k+1} - \nabla_k) = x_{k+1} - x_k \\ \end{aligned}

This simplifies the update rule for BFGS to $ \vx_{k+1} = \vx_k -\mC_k \nabla_k $. No inverse needed, ever!

There are approaches to compute a closed form solution to these intermediate minimization problems, for example, using the Sherman-Morrison-Woodbury formula, but that is out of scope for this article.

Quasi-Newton demo: $ \min x^2 $

Let's see quasi-Newton methods in action with a simple univariate function $ f(x) = x^2 $, where $ x \in \real $.

Note that the function has a global minimum at $ x = 0 $. The goal of the optimization approach method is to discover this point of least function value, starting at any arbitrary point.

In this accompanying demo, change the starting point by dragging the orange circle and see the trajectory of Newton's method towards the global minimum.

quasi-Newton demo analysis: $ \min x^2 $

Although, we can choose the learning rate $ \alpha $ dynamically based on Wolfe conditions, we have purposely set it to 1. This will enable us to see how quickly the quasi-Newton method models the true second-order model. We have set the initial value of $ \mB $ to the identity matrix.

Notice in the demo that no matter where you start, the algorithm converges in 2 steps.

The first step makes a straight jump to the opposite side due to the identity matrix and unit learning rate. However, within that step, the matrix $ \mB $ approximates the second-order model of $ x^2 $ quite well.

Let's do the math, which is significantly simplified due to the univariate problem. If you start from point $ x_0 $, then the next point is $ x_1 = x_0 - \alpha \frac{\nabla_0}{b_0} = \frac{2x_0}{1} $, where $ b $ denotes the univariate version of $ \mB $ which is set to identity (1 for univariate) initially.

In the next step, we will be solving for $ b_1 $ as follows:

\begin{aligned} b_1(x_1 - x_0) & = \nabla_1 - \nabla_0 \\ & \Rightarrow b_1 = \frac{2x_1 - 2x_0}{x_1 - x_0} \\ & \Rightarrow b_1 = 2 \\ \end{aligned}

Thus, with the first step, the BFGS method has figured out the same multiplier as the Newton's method and converged in exactly one step thereafter.

quasi-Newton method demo: $ \min x^4 - 8x^2 $

Let's see quasi-Newton method on a nonconvex function with multiple minima and a single maximum.

$$ f(x) = x^4 - 8x^2 $$

quasi-Newton demo analysis: $ \min x^4 - 8x^2 $

For this nonconvex objective function, again the trajectory of the BFGS algorithm is quite similar to that of the Newton's method. Depending on how close you start to a minimum, you either land up at the minimum or the maximum.

As in the previous example, the first jump with the quasi-Newton method is rather erratic, far from the vicinity of the solution. That is expected, given that we have started from an arbitrary second-order approximation of the function. But as soon as the quasi-Newton approach starts closely approximating the true Hessian, the iterations starts behaving like the Newton's method.

BFGS: Rosenbrock's function

Now let's try out BFGS on a multivariate problem. We introduced Rosenbrock's function in our article on multivariate functions.

Just like our demo of Newton's method on the Rosenbrock function, this function continues to be hard for BFGS. Once in the valley, the convergence towards the minimum is really slow.

BFGS: Himmelblaus function

The Rosenbrock function had a single minimum. Such functions, albeit easy to understand, are rare in machine learning. Let us study a function with multiple minima. We have introduced Himmelblau's function in the module on Calculus.

$$ f(x,y) = (x^2 + y - 11)^2 + (x + y^2 - 7)^2 $$

Himmelblau's function is nonconvex with 4 minima (marked in green) and 1 local maximum (marked in blue).

The performance of BFGS on the Himmeblaus function is similar to that we saw in the demo of the Newton's method. As long as the iterations start close enough to an optimum, the convergence is rather quick. Otherwise, the trajectory is rather erratic, as the BFGS method approximates this bumpy function's Hessian.

Where to next?

Expand your knowledge of optimization approaches with our detailed interactive articles on this subject.

To brush up on calculus, explore our articles on topics in calculus and multivariate calculus. Or start with strong mathematical foundations.

Already an optimization expert? Check out comprehensive courses on machine learning or deep learning.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.

BFGS method

Optimization

Prerequisites

Problem statement

Taylor's theorem and optimization: A recap

Quasi-Newton methods

How do we find a good \( \mB \)?

Constraining \(\mB \)

Solving for \( \mB \)

The BFGS method

Quasi-Newton demo: \( \min x^2 \)

quasi-Newton demo analysis: \( \min x^2 \)

quasi-Newton method demo: \( \min x^4 - 8x^2 \)

quasi-Newton demo analysis: \( \min x^4 - 8x^2 \)

BFGS: Rosenbrock's function

BFGS: Himmelblaus function

Where to next?

Please support us

Let's connect