Addressing the challenge of nonlinearity
The XOR-problem is a simple example of a nonlinearly separable problem. A simple linear model, for example a perceptron, cannot classify such problems.
Most problems of practical importance are nonlinearly separable problems. We need a principled strategy to address this challenge.
Attempt 1: Scalar multiplication
Consider a perceptron model with parameters \( \vw \) and \( b \). The perceptron output, \( \vw^T \vx + b \), is a linear function of the inputs \( \vx \).
Multiplying it with a scalar will only scale the linear output. Still retaining the linear nature. We need something more than a multiplicative scalar.
Attempt 2: Adding two perceptrons
Instead of one, let's use two perceptrons, say \( \sP = \set{\vw_p, b_p} \) and \( \sQ = \set{\vw_q, b_q} \).
Suppose these outputs are \( h_p = \vw_p^T \vx + b_p \) and \( h_q = \vw_q^T \vx + b_q \).
Again, both \( h_p \) and \( h_q \) are linear functions of the input \( \vx \).
What if add the outputs of these two perceptrons, do we get a nonlinear outputs? Let's find out.
\begin{aligned}
o &= h_p + h_q \\\\
&= (\vw_p^T\vx + b_p) + (\vw_q^T\vx + b_q) \\\\
&= \left[\vw_p + \vw_q \right]^T\vx + (b_p + b_q) \\\\
\end{aligned}
Thus, our final output \( o \) is still a linear function of the input. We need something more than additive perceptrons.
Attempt 3: Stacking perceptrons
Let's treat the outputs of our two perceptrons, \( h_p \) and \( h_q \), as a two-dimensional vector \( \vh = [h_p, h_q] \).
If we apply another perceptron, say \( \sO = \set{\vw_o, b_o} \) to the vector \( \vh \), we get \( o = \vw_o^T \vh + b_o \).
Is \( o \) a linear function of the original inputs \( \vx \)? Let's find out.
\begin{aligned}
o &= \vw_o^T \vh + b_o \\\\
&= \vw_o^T [h_p,h_q]^T + b_o \\\\
&= w_{o1}h_p + w_{o2}h_q + b_o \\\\
&= w_{o1}(\vw_p^T\vx + b_p) + w_{o2}(\vw_q^T\vx + b_q) + b_o \\\\
&= \left[w_{o1}\vw_p + w_{o2}\vw_q \right]^T\vx + w_{o1}b_p + w_{o2}b_q + b_o \\\\
\end{aligned}
where, we have explicitly written the elements of the vector \( \vw_o = [w_{o1}, w_{o2}] \).
Thus, \( o \) is still a linear function of the inputs \( \vx \).
Merely scaling, adding, or stacking perceptrons does not address the challenge of nonlinearity.
What can we do to introduce nonlinearity into this model?