Binary cross entropy
The negative log-likelihood is differentiable but it requires probabilistic outputs for estimating \( P(\yhat=1|\mTheta) \) and \( P(\yhat=0|\mTheta) \).
An alternative, with a similar formulation, for non-probabilistic models is the binary cross-entropy (BCE) loss that directly works with the score of the positive class \( f(\vx,\mTheta) \) , where, \( \mTheta \) is the set of parameters of the model.
The per-instance BCE Loss is
\begin{aligned}
\loss_{\text{BCE}}(y_\nlabeledsmall, \vx_\nlabeledsmall, \mTheta) = - y_\nlabeledsmall \log f(\vx_\nlabeledsmall,\mTheta) - \left(1 - y_\nlabeledsmall \right) \log \left(1 - f(\vx_\nlabeledsmall,\mTheta)\right)
\end{aligned}
Just like other losses, the per-instance BCE loss is aggregated as a sum over all the examples in the training set (or minibatch, if optimizing using minibatch stochastic gradient descent).
Note that the formulation is similar to the negative log-likelihood loss, with the probability \( P(\yhat|\mTheta) \) now replaced with \( f(\vx_\nlabeledsmall,\mTheta) \) the score of the positive class.
For a training instance \( \vx_\nlabeledsmall \), only one of the two terms is active depending on the value of \( y_\nlabeledsmall \) and the other becomes zero.
Suppose \( f_{y}(\vx,\mTheta) \) denotes the score of the model for the class \( y \), so that \( sum_{y \in {0,1}} f_y(\vx,\mTheta) = 1 \). In this case, we can also rewrite the BCE loss alternatively as
\begin{aligned}
\loss_{\text{BCE}}(y_\nlabeledsmall, \vx_\nlabeledsmall, \mTheta) = - \log f_{y_\nlabeledsmall}(\vx_\nlabeledsmall,\mTheta)
\label{eqn:bce-alternative}
\end{aligned}
It is called binary cross-entropy because its formulation is similar to the cross-entropy between two discrete distributions \( p \) and \( q \), calculated as \( \entropy{p,q} = -\sum_{x \in \mathcal{X}} p(x) \log q(x) \).
In our case, the cross-entropy is between the distribution of \( y \) and the distribution of \( f(\vx,\mTheta) \).
The BCE loss is differentiable, but has a numerical challenge.
If \( f(\vx,\mTheta) = 0 \) or \( f(\vx, \mTheta) = 1 \), then one of terms becomes \( \log 0 \) which is mathematically undefined.
Some packages like PyTorch get over this issue by setting \( log 0 = \infty \) and then clamping the value of the loss to be greater than \( -100 \), an arbitrary choice that works just fine.