A tutorial covering Cross Entropy Loss, complete with code in PyTorch and Tensorflow and interactive visualizations. Made by Saurav Maheshkar using W&B

One of the most common loss functions used for training neural networks is cross-entropy. In this blogpost we'll go over its derivation and implementation using different frameworks and learn how to log and visualize them using wandb.

Cross entropy loss is a metric used to measure how well a classification model in machine learning performs. The loss (or error) is measured as a number between 0 and 1, with 0 being a perfect model. The goal is generally to get your model as close to 0 as possible.

Cross entropy loss is often considered interchangeable with logistic loss (or log loss, and sometimes referred to as binary cross entropy loss) but this isn't always correct.

Cross entropy loss measures the difference between the discovered probability distribution of a machine learning classification model and the predicted distribution. All possible values for the prediction are stored so, for example, if you were looking for the odds in a coin toss it would store that information at 0.5 and 0.5 (heads and tails).

Binary cross entropy loss, on the other hand, store only one value. That means it would store only 0.5, with the other 0.5 assumed in a different problem, if the first probability was 0.7 it would assume the other was 0.3). It also uses a logarithm (thus "log loss").

It is for this reason that binary cross entropy loss (or log loss) is used in scenarios where there are only two possible outcomes, though it's easy see where it would fail immediately if there were three or more. That's where cross entropy loss is often used: in models where there are three or more classification possibilities.

Let's start from the basics. In deep learning, we typically use gradient based optimization strategy to train a model (say f(x)) using some loss function l \, (f(x_i), \, y_i) where (x_i, y_i) are some input-output pair. A loss function is used to help the model determine how "wrong" it is and, based on that "wrongness," improve itself. It's a measure of error. Our goal throughout training is to minimize this error/loss.

The role of a loss function is an important one. If doesn't penalize wrong output appropriately to its magnitude it can delay convergence and affect learning.

There's a learning paradigm called Maximum Likelihood Estimation (MLE) which trains the model to estimate its parameters so as to learn the underlying data distribution. Thus, we use a loss function to evaluate the how well the model fits the data distribution.

Using cross entropy we can measure the error (or difference) between two probability distributions.

For example, in the case of Binary Classification, cross-entropy is given by:

l = - (\,y \, log(p)\,\,+ \,\, (1-y) \, log(1-p)\,)

where:

- p is the predicted probability, and
- y is the indicator ( 0 or 1 in the case of binary classification )

Let's walk through what happens for a particular data point. Let's say the correct indicator is i.e, y = 1. In this case,

l = - ( \, \,1 \times log(p) + (1 - 1) \, \, log (1- p) \, \,)

l = - ( \, \, 1 \times log(p) \, \,)

the value of loss l thus depends on the probability p. Therefore, our loss function will reward the model for giving a correct prediction (high value of p) with a low loss. However, if the probability is lower, the value of the error will be high (bigger negative value) and therefore it penalizes the model for a wrong outcome.

A simple extension to a Multi-Classification (say N classes) problem exists as follows:-

- \sum_{c=1}^{N} y_c log(p_c)

In this section, we'll go over how to use Cross Entropy loss in both Tensorflow and PyTorch and log to wandb.

`import tensorflow as tffrom wandb.keras import WandbCallbackdef build_model(): ... # Define the Model Architecture model = tf.keras.Model(inputs = ..., outputs = ...) # Define the Loss Function -> BinaryCrossentropy or CategoricalCrossentropy fn_loss = tf.keras.losses.BinaryCrossentropy() model.compile(optimizer = ..., loss = [fn_loss], metrics= ... ) return modelmodel = build_model()# Create a W&B Runrun = wandb.init(...)# Train the model, allowing the Callback to automatically sync lossmodel.fit(... ,callbacks = [WandbCallback()])# Finish the run and sync metricsrun.finish()`

`import wandbimport torch.nn as nn# Define the Loss Functioncriterion = nn.CrossEntropyLoss()# Create a W&B Runrun = wandb.init(...)def train_step(...): ... loss = criterion(output, target) # Back-propagation loss.backward() # Log to Weights and Biases wandb.log({"Training Loss": loss.item()})# Finish the run and sync metricsrun.finish()`

And that wraps up our short tutorial on CrossEntropy Loss. To see the full suite of wandb features please check out this short 5 minutes guide.

- If you're wondering why we should use negative log probabilities, check out this video ๐ฅ
- If you want a more rigorous mathematical explanation, check out these blogpost(1) and blogpost(2) ๐งพ

Here's that video, to save you a click: