4.2. Softmax Activation Function
Previously, we approached the XOR gate learning task as a single-variable classification problem. Now, we will tackle it as a binary classification problem, leveraging the softmax function as the activation function.
The softmax function is defined as:
$$ y_{i} = \frac{ \exp(h_{i}) }{ \sum^{n-1}_{k=0} \exp(h_{k}) } \tag{4.1} $$where:
- $n \in \mathbb{N}$ is the number of category.
- $y_{i}$ is the $i$-th element of the output vector $y \in \mathbb{R}^{n}$. It satisfies $\sum_{i=0}^{n-1} y_{i} = 1$ and $ y_{i} \in [0, 1]$.
- $h_{i}$ is the $i$-th element of the input vector $h \in \mathbb{R}^{n}$.
The detail is explained in Section 3.2.2.
4.2.1. Computing the gradients for Back Propagation
To obtain the gradient of the softmax function, we will use the cross-entropy (CE) as the loss function. The CE is defined as follows:
$$ L = - \sum^{n-1}_{i=0} Y_{i} \log(y_{i}) \tag{4.2} $$where:
- $n$ is the number of category.
- $Y_{i} \in $ { $0,1$ } is the ground-truth label for the $i$-th data point. If $Y_{i} = 1$ (truth), then all other $Y_{j} = 0$ for $j \neq i$.
- $y_{i} \in [0,1]$ is the predicted probability for the $i$-th data point.
The term $\sum_{i=0}^{n-1} Y_{i} = 1$ is by definition, because the ground-truth labels must sum to 1.
First, we calculate $dL_{i} \stackrel{\mathrm{def}}{=} \dfrac{\partial L}{\partial y_i}$, which is the gradient propagated to the softmax function.
$$ \begin{align} dL_{i} = \frac{\partial L}{\partial y_{i}} = \dfrac{\partial}{\partial y_i} \left[ -\sum_{j=0}^{n-1} Y_j \log(y_j) \right] = -\sum_{j=0}^{n-1} Y_j \dfrac{\partial \log(y_j)}{\partial y_i} = -\dfrac{Y_i}{y_i} \end{align} \tag{4.3} $$Next, we calculate $ \partial L / \partial h_{k} $, which is the gradient propagated from the softmax function to the dense layer.
$$ \begin{align} \frac{\partial L}{\partial h_{k}} &= - \sum^{n-1}_{j=0} Y_{j} \frac {\partial \log(y_{j}) }{\partial h_{k}} = - \sum^{n-1}_{j=0} Y_{j} \frac {\partial \log(y_{j}) }{\partial y_{j}} \frac {\partial y_{j} }{\partial h_{k}} = - \sum^{n-1}_{j=0} \frac{Y_{j}} {y_{j}} \frac {\partial y_{j} }{\partial h_{k}} \end{align} \tag{4.4} $$Here we can calculate $\partial y / \partial h$ using the formula: $ d(\frac{f(x)}{g(x)})/dx = \frac{f’(x)g(x)-f(x)g’(x)}{g(x)^{2}}$.
$$ \begin{align} \frac{\partial y_{j}}{\partial h_{k}} &= \frac{\partial}{\partial h_{k}} \left[ \frac{\exp(h_{j})}{\sum_{i=0}^{n-1} \exp(h_{i})} \right] \\ &= \frac{1}{\left( \sum_{i=0}^{n-1} \exp(h_{i})\right)^{2}} \left[ \frac{\partial \exp(h_{j})}{\partial h_{k}} \sum_{i=0}^{n-1} \exp(h_{i}) - \exp(h_{j}) \frac{\partial}{\partial h_{k}} \left( \sum_{i=0}^{n-1} \exp(h_{i}) \right) \right] \\ &= \frac{1}{\left( \sum_{i=0}^{n-1} \exp(h_{i}) \right)^{2}} \left[ \delta_{j,k} \exp(h_{j}) \sum_{i=0}^{n-1} \exp(h_{i}) - \exp(h_{j}) \exp(h_{k}) \right] \\ &= \delta_{j,k} \frac{\exp(h_{j})}{\sum_{i=0}^{n-1} \exp(h_{i})} - \frac{\exp(h_{j})}{\sum_{i=0}^{n-1} \exp(h_{i})} \frac{\exp(h_{k})}{\sum_{i=0}^{n-1} \exp(h_{i})} \\ &= \delta_{j,k} y_{j} - y_{j} y_{k} \end{align} \tag{4.5} $$where $\delta_{j,k}$ is the Kronecker delta, which is equal to $1$ if $j=k$ and $0$ otherwise.
Using $(4.3),(4.4)$ and $(4.5)$, we finally obtain the following result:
$$ \begin{align} \frac{\partial L}{\partial h_{k}} &= - \sum^{n-1}_{j=0} \frac{Y_{j}} {y_{j}} (\delta_{j,k} y_{k} - y_{j}y_{k}) = \sum^{n-1}_{j=0} \left[ Y_{j} y_{k} - \delta_{j,k} \frac{Y_{j}}{y_{j}} y_{k} \right] \\ &= y_{k} \left[ \sum^{n-1}_{j=0} Y_{j} + \sum^{n-1}_{j=0} \delta_{j,k} \left( - \frac{Y_{j}}{y_{j}} \right) \right] = y_{k} \left[ 1 + \left( - \frac{Y_{k}}{y_{k}} \right) \right] \\ &= y_{k} (1 + dL_{k}) \end{align} \tag{4.6} $$4.2.2. Softmax Class
Complete Python code is available at: ActivationFunctions.py
The Python class below encapsulates the softmax activation function:
#
# Softmax
#
# Assume that the loss function is the Cross-Entropy (CE).
#
class Softmax:
def activate_func(self, x):
exp = np.exp(x)
self.y = exp / (np.sum(exp, axis=0) + 1e-8)
return self.y
# dL should be (- Label / output) because of CE.
def deriv_activate_func(self, _, dL):
return self.y * (1 + dL)
This activation function assumes the model employs the CE loss function.
4.2.3. Implementation
This subsection showcases two applications using the softmax activation function:
- XOR-gate
- Binary-to-Decimal conversion
4.2.3.1. Implementing XOR-gate
We implement a XOR-gate model as a binary classified problem using softmax function.
Complete Python code is available at: XOR-gate-softmax.py
[1] Prepare the inputs and the ground-truth labels.
To handle the task as a classification task, which requires categorical input data, we need to convert the ground truth data $Y$ into one-hot vectors.
# ========================================
# Create datasets
# ========================================
# Inputs
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
# The ground-truth labels
Y = np.array([[0], [1], [1], [0]])
# Convert into One-Hot vector
Y = keras.utils.to_categorical(Y, 2)
"""
Y= [[1. 0.]
[0. 1.]
[0. 1.]
[1. 0.]]
"""
# Convert row vectors into column vectors.
X = X.reshape(4, 2, 1)
Y = Y.reshape(4, 2, 1)
[2] Create model.
We create a dense layer and a softmax layer.
# ========================================
# Create Model
# ========================================
input_nodes = 2
hidden_nodes = 3
output_nodes = 2
dense = Layers.Dense(input_nodes, hidden_nodes, sigmoid, deriv_sigmoid)
softmax = Layers.Dense(hidden_nodes, output_nodes, activate_class=Softmax())
[3] Training.
The training phase remains largely the same as the neural network with double dense layers, with the key difference being the loss function used.
Our softmax activation function assumes the cross-entropy function $L$, defined in expression $(4.2)$, as loss function. Therefore, we utilize the gradient of this loss function $dL_{i}$ given by expression $(4.3)$.
def train(x, Y, lr=0.001):
# Forward Propagation
y = dense.forward_prop(x)
y = softmax.forward_prop(y)
# Back Propagation
loss = -np.sum(Y * np.log(y + 1e-8)) # for measuring the training progress
dL = - Y / (y + 1e-8)
dx = softmax.back_prop(dL)
_ = dense.back_prop(dx)
# Weights and Bias Update
update_weights([dense, softmax], lr=lr)
return loss
n_epochs = 15000 # Epochs
lr = 0.1 # Learning rate
#
# Training loop
#
for epoch in range(1, n_epochs + 1):
loss = 0.0
for i in range(0, len(Y)):
loss += train(X[i], Y[i], lr)
[4] Test.
Run the following command to test the model:
$ python XOR-gate-softmax.py
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 3) 9
softmax (Softmax) (None, 2) 8
=================================================================
Total params: 17
epoch: 1 / 15000 Loss = 3.108775
epoch: 1000 / 15000 Loss = 0.642858
epoch: 2000 / 15000 Loss = 0.046528
epoch: 3000 / 15000 Loss = 0.022903
epoch: 4000 / 15000 Loss = 0.015087
epoch: 5000 / 15000 Loss = 0.011222
epoch: 6000 / 15000 Loss = 0.008922
epoch: 7000 / 15000 Loss = 0.007400
epoch: 8000 / 15000 Loss = 0.006319
epoch: 9000 / 15000 Loss = 0.005512
epoch: 10000 / 15000 Loss = 0.004887
epoch: 11000 / 15000 Loss = 0.004388
epoch: 12000 / 15000 Loss = 0.003982
epoch: 13000 / 15000 Loss = 0.003644
epoch: 14000 / 15000 Loss = 0.003358
epoch: 15000 / 15000 Loss = 0.003114
-------------------------------------
x0 XOR x1 => prob(0) prob(1)
=====================================
0 XOR 0 => 0.9998 0.0002
0 XOR 1 => 0.0009 0.9991
1 XOR 0 => 0.0008 0.9992
1 XOR 1 => 0.9987 0.0013
=====================================
4.2.3.2. Implementing binary-to-decimal conversion model
This example demonstrates the versatility of neural networks.
By simply modifying the label data and the number of output node, we can transform the XOR gate model into a binary-to-decimal converter.
Complete Python code is available at: binary-to-decimal-conversion.py
# ========================================
# XOR-gate
# ========================================
Y = np.array([[0], [1], [1], [0]])
Y = keras.utils.to_categorical(Y, 2)
"""
Y= [[1. 0.]
[0. 1.]
[0. 1.]
[1. 0.]]
"""
output_nodes = 2
# ========================================
# Binary-to-Decimal conversion
# ========================================
Y = np.array([[0], [1], [2], [3]])
Y = keras.utils.to_categorical(Y, 4)
"""
Y = [[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
"""
output_nodes = 4
We can test its ability to convert binary to decimal numbers using the following command:
$ python binary-to-decimal-conversion.py
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense (Dense) (None, 3) 9
softmax (Softmax) (None, 4) 16
=================================================================
Total params: 25
epoch: 1 / 15000 Loss = 6.055010
epoch: 1000 / 15000 Loss = 0.087320
epoch: 2000 / 15000 Loss = 0.038823
epoch: 3000 / 15000 Loss = 0.024832
epoch: 4000 / 15000 Loss = 0.018224
epoch: 5000 / 15000 Loss = 0.014383
epoch: 6000 / 15000 Loss = 0.011874
epoch: 7000 / 15000 Loss = 0.010107
epoch: 8000 / 15000 Loss = 0.008797
epoch: 9000 / 15000 Loss = 0.007787
epoch: 10000 / 15000 Loss = 0.006984
epoch: 11000 / 15000 Loss = 0.006330
epoch: 12000 / 15000 Loss = 0.005789
epoch: 13000 / 15000 Loss = 0.005332
epoch: 14000 / 15000 Loss = 0.004942
epoch: 15000 / 15000 Loss = 0.004605
-------------------------------------------
(x0 x1) => prob(0) prob(1) prob(2) prob(3)
===========================================
(0 0) => 0.9989 0.0008 0.0003 0.0000
(0 1) => 0.0007 0.9989 0.0000 0.0005
(1 0) => 0.0003 0.0000 0.9988 0.0009
(1 1) => 0.0000 0.0005 0.0007 0.9989
===========================================