## Addressing the challenge of nonlinearity

The XOR-problem is a simple example of a nonlinearly separable problem. A simple linear model, for example a perceptron, cannot classify such problems.
Most problems of practical importance are nonlinearly separable problems. We need a principled strategy to address this challenge.

#### Attempt 1: Scalar multiplication

Consider a perceptron model with parameters \( \vw \) and \( b \). The perceptron output, \( \vw^T \vx + b \), is a linear function of the inputs \( \vx \).
Multiplying it with a scalar will only scale the linear output. Still retaining the linear nature. We need something more than a multiplicative scalar.

#### Attempt 2: Adding two perceptrons

Instead of one, let's use two perceptrons, say \( \sP = \set{\vw_p, b_p} \) and \( \sQ = \set{\vw_q, b_q} \).
Suppose these outputs are \( h_p = \vw_p^T \vx + b_p \) and \( h_q = \vw_q^T \vx + b_q \).
Again, both \( h_p \) and \( h_q \) are linear functions of the input \( \vx \).

What if add the outputs of these two perceptrons, do we get a nonlinear outputs? Let's find out.

\begin{aligned}
o &= h_p + h_q \\\\
&= (\vw_p^T\vx + b_p) + (\vw_q^T\vx + b_q) \\\\
&= \left[\vw_p + \vw_q \right]^T\vx + (b_p + b_q) \\\\
\end{aligned}

Thus, our final output \( o \) is still a linear function of the input. We need something more than additive perceptrons.

#### Attempt 3: Stacking perceptrons

Let's treat the outputs of our two perceptrons, \( h_p \) and \( h_q \), as a two-dimensional vector \( \vh = [h_p, h_q] \).
If we apply another perceptron, say \( \sO = \set{\vw_o, b_o} \) to the vector \( \vh \), we get \( o = \vw_o^T \vh + b_o \).

Is \( o \) a linear function of the original inputs \( \vx \)? Let's find out.

\begin{aligned}
o &= \vw_o^T \vh + b_o \\\\
&= \vw_o^T [h_p,h_q]^T + b_o \\\\
&= w_{o1}h_p + w_{o2}h_q + b_o \\\\
&= w_{o1}(\vw_p^T\vx + b_p) + w_{o2}(\vw_q^T\vx + b_q) + b_o \\\\
&= \left[w_{o1}\vw_p + w_{o2}\vw_q \right]^T\vx + w_{o1}b_p + w_{o2}b_q + b_o \\\\
\end{aligned}

where, we have explicitly written the elements of the vector \( \vw_o = [w_{o1}, w_{o2}] \).

Thus, \( o \) is still a linear function of the inputs \( \vx \).

Merely scaling, adding, or stacking perceptrons does not address the challenge of nonlinearity.
What can we do to introduce nonlinearity into this model?