TensorFlow Automatic Differentiation (AutoDiff)

Image for post
Image for post

Keras API can perform backpropagation easily using built-in optimizers and loss functions. However, there are cases where we want to manipulate or apply gradient specifically. For example, to avoid explosive gradient, we may want to clip the gradient.

Image for post
Image for post

In general, TensorFlow AutoDiff allows us to compute and manipulate gradients. In the example below, we compute and plot the derivative of the sigmoid function. In deep learning, we use AutoDiff to perform custom backpropagation.

Image for post
Image for post

GradientTape in Recording Forward Pass

In TensorFlow (TF), tf.GradientTape records all forward-pass operations within its “with” block to a “tape”. This record will be utilized to calculate the backpropagation gradient in tape.gradient later. The code below records the operation loss = . Then, tape.gradient(loss, w) computes the loss gradient w.r.t. w (i.e. dL/dw).

Image for post
Image for post

Here is an example of a dense layer in which the input x is a vector instead of a scalar. The returned gradient for w (dl_dw) is a tensor with shape (3, 2). It has the same shape as w as it contains one gradient for each element in w.

Image for post
Image for post

To apply the gradient descent to a neural network, we change its weights according to the learning rate and the gradients.

Image for post
Image for post

AutoDiff for a Neural Network Model

Let’s complete our demonstrations with an MNIST classifier below. The Keras model provides property mnist_model.trainable_variables such that we do not need to keep track of all trainable variables ourselves.

Image for post
Image for post

tape.gradient returns a list (grads) that contains a gradient Tensor for each trainable variable. Then we can apply the gradient descent according to our choice of the optimizer.

Image for post
Image for post

For completeness, here is the complete code.

Image for post
Image for post

Note: I use screenshots for the codes because they are better formatted. For readers that want the source code, please refer to the TensorFlow guide where most of the codes here are originated from. With the constant changes in TF API, it is hard to keep the code in this article up to date. Readers should always refer to the latest documents.

Vector Output

The computed y (say a loss function) shown before is a scalar value. But y can be a vector. For example, y = x*[3., 4.] which is a vector. y has 2 components and dy/dx = [3., 4.]. But, tape.gradient always returns the gradients with a shape same as x. Indeed, tape.gradient returns the sum of components which equals 7 here.

Image for post
Image for post

Persistent

The resources held by the tape is released when tape.gradient is called. We cannot call the same tape multiple times for the derivative of different variables. To allow multiple calls, make sure the persistent is True for the tape. Use del tape to release the resources afterward when it is not needed.

Image for post
Image for post

Custom Gradient

Tensorflow provides automatic gradient calculation automatically. Nevertheless, some of the gradient calculation is not computable or can be numerically unstable. For the latter case, the intermediate result may approach infinity even though it may cancel out eventually. For example, the gradient calculated below returns NAN (not a number) for x = 100 even though the actual gradient is 1.0.

Image for post
Image for post

To solve that, we can add a @tf.custom_gradient annotation and implement a more stable and custom gradient calculation. This new method returns both the loss and the custom gradient function. But this annotated method will only return the loss to the caller and record the custom function to calculate the gradients in tape.gradient. As shown below, the gradient of log (1 + eˣ) is implemented as 1–1/(1+eˣ) which equals 1 for x = 100.

Image for post
Image for post

Gradient Clipping by Norm

One major application of manipulating gradient explicitly is to clip the gradient by norm to avoid gradient explosion. Here is an example code in demonstrating the idea.

Image for post
Image for post

Getting a gradient of None

In some cases, the gradient returned will be None. In this section, we will see why or how to fix it if this is incorrect.

Non-trainable variables, Tensors, Constants

tf.GradientTape records operations within the “with” block and tracks all trainable tf.Variable that these operations involved with. However, non-trainable variables, tensors, and constants are not tracked automatically. They are x1 (non-trainable tf.Variables), x2 (Tensors), and x3 (tf.constant) below. Their corresponding calculated gradients are None.

Image for post
Image for post

watch

If this is not desirable, we can fix it by adding them explicitly with tape.watch. In the example below, after watching the input x also, we can compute its gradient correctly.

Image for post
Image for post

By the way, intermediate results within the GradientTape block (like y below) will be recorded automatically even though it is a Tensor. Gradient dz/dy returns 18 below.

Image for post
Image for post

None

As expected, tape.gradient returns None when the gradient is w.r.t unrelated variables. In the code below, z is unrelated to x and therefore the returned gradient is None.

Image for post
Image for post

Operations perform outside of TensorFlow, like NumPy operations, cannot be traced. So the gradient of y computed with Numpy below cannot be computed. And operations involve integer or string data types are not differentiable. For example, the casting of integers to a float is not differentiable and the corresponding gradient is None.

Image for post
Image for post

The assignment x = x +1 turns x to a Tensor. So when we start the second epoch and a new taping session, x will not be tapped automatically. Therefore, for the second epoch, the gradient w.r.t. x is None.

Image for post
Image for post

When TF reads from a stateful object like a tf.Variable, the tape can only see the current state, not the history that leads to it. Operations that change a variable state will block the gradient propagation. Hence, the assign_add operation on the left will result in None for the gradient calculation. (It seems the assignment records the changed state but not the operation.) To fix that, we add the variables together and assign the result to a Tensor instead. This is shown on the right below.

Image for post
Image for post

Higher-order gradients

tap.gradient can be viewed as another TensorFlow operator. Therefore, we can use nested GradientTape to compute a higher-order gradient. In the inner “with” loop below, we compute the first-order derivative while the other loop trace dy/dx and compute its derivate, i.e. the second-order derivative of y w.r.t. x.

Image for post
Image for post

Jacobian & Hessian

TensorFlow also provides API to compute the Jacobian matrix J.

Image for post
Image for post

The code below computes the Jacobian w.r.t. to a scalar variable.

Image for post
Image for post

And this one computes the Hessian.

Image for post
Image for post

Credits and References

TensorFlow Guide: AutoDiff and Advanced AutoDiff

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store