TensorFlow Automatic Differentiation (AutoDiff)

6 min readDec 13, 2020

Keras API can perform backpropagation easily using built-in optimizers and loss functions. However, there are cases where we want to manipulate or apply gradient specifically. For example, to avoid explosive gradient, we may want to clip the gradient.

In general, TensorFlow AutoDiff allows us to compute and manipulate gradients. In the example below, we compute and plot the derivative of the sigmoid function. In deep learning, we use AutoDiff to perform custom backpropagation.

GradientTape in Recording Forward Pass

In TensorFlow (TF), tf.GradientTape records all forward-pass operations within its “with” block to a “tape”. This record will be utilized to calculate the backpropagation gradient in tape.gradient later. The code below records the operation loss = w². Then, tape.gradient(loss, w) computes the loss gradient w.r.t. w (i.e. dL/dw).

Here is an example of a dense layer in which the input x is a vector instead of a scalar. The returned gradient for w (dl_dw) is a tensor with shape (3, 2). It has the same shape as w as it contains one gradient for each element in w.

To apply the gradient descent to a neural network, we change its weights according to the learning rate and the gradients.

AutoDiff for a Neural Network Model

Let’s complete our demonstrations with an MNIST classifier below. The Keras model provides property mnist_model.trainable_variables such that we do not need to keep track of all trainable variables ourselves.

tape.gradient returns a list (grads) that contains a gradient Tensor for each trainable variable. Then we can apply the gradient descent according to our choice of the optimizer.

For completeness, here is the complete code.

Note: I use screenshots for the codes because they are better formatted. For readers that want the source code, please refer to the TensorFlow guide where most of the codes here are originated from. With the constant changes in TF API, it is hard to keep the code in this article up to date. Readers should always refer to the latest documents.

Vector Output

The computed y (say a loss function) shown before is a scalar value. But y can be a vector. For example, y = x*[3., 4.] which is a vector. y has 2 components and dy/dx = [3., 4.]. But, tape.gradient always returns the gradients with a shape same as x. Indeed, tape.gradient returns the sum of components which equals 7 here.

Persistent

The resources held by the tape is released when tape.gradient is called. We cannot call the same tape multiple times for the derivative of different variables. To allow multiple calls, make sure the persistent is True for the tape. Use del tape to release the resources afterward when it is not needed.

Custom Gradient

Tensorflow provides automatic gradient calculation automatically. Nevertheless, some of the gradient calculation is not computable or can be numerically unstable. For the latter case, the intermediate result may approach infinity even though it may cancel out eventually. For example, the gradient calculated below returns NAN (not a number) for x = 100 even though the actual gradient is 1.0.

To solve that, we can add a @tf.custom_gradient annotation and implement a more stable and custom gradient calculation. This new method returns both the loss and the custom gradient function. But this annotated method will only return the loss to the caller and record the custom function to calculate the gradients in tape.gradient. As shown below, the gradient of log (1 + eˣ) is implemented as 1–1/(1+eˣ) which equals 1 for x = 100.

Gradient Clipping by Norm

One major application of manipulating gradient explicitly is to clip the gradient by norm to avoid gradient explosion. Here is an example code in demonstrating the idea.

Getting a gradient of `None`

In some cases, the gradient returned will be None. In this section, we will see why or how to fix it if this is incorrect.

Non-trainable variables, Tensors, Constants

tf.GradientTape records operations within the “with” block and tracks all trainable tf.Variable that these operations involved with. However, non-trainable variables, tensors, and constants are not tracked automatically. They are x1 (non-trainable tf.Variables), x2 (Tensors), and x3 (tf.constant) below. Their corresponding calculated gradients are None.

watch

If this is not desirable, we can fix it by adding them explicitly with tape.watch. In the example below, after watching the input x also, we can compute its gradient correctly.

By the way, intermediate results within the GradientTape block (like y below) will be recorded automatically even though it is a Tensor. Gradient dz/dy returns 18 below.

None

As expected, tape.gradient returns None when the gradient is w.r.t unrelated variables. In the code below, z is unrelated to x and therefore the returned gradient is None.

Operations perform outside of TensorFlow, like NumPy operations, cannot be traced. So the gradient of y computed with Numpy below cannot be computed. And operations involve integer or string data types are not differentiable. For example, the casting of integers to a float is not differentiable and the corresponding gradient is None.

The assignment x = x +1 turns x to a Tensor. So when we start the second epoch and a new taping session, x will not be tapped automatically. Therefore, for the second epoch, the gradient w.r.t. x is None.

When TF reads from a stateful object like a tf.Variable, the tape can only see the current state, not the history that leads to it. Operations that change a variable state will block the gradient propagation. Hence, the assign_add operation on the left will result in None for the gradient calculation. (It seems the assignment records the changed state but not the operation.) To fix that, we add the variables together and assign the result to a Tensor instead. This is shown on the right below.

Higher-order gradients

tap.gradient can be viewed as another TensorFlow operator. Therefore, we can use nested GradientTape to compute a higher-order gradient. In the inner “with” loop below, we compute the first-order derivative while the other loop trace dy/dx and compute its derivate, i.e. the second-order derivative of y w.r.t. x.