# TensorFlow Automatic Differentiation (AutoDiff)

--

Keras API can perform backpropagation easily using built-in optimizers and loss functions. However, there are cases where we want to manipulate or apply gradient specifically. For example, to avoid explosive gradient, we may want to clip the gradient.

In general, TensorFlow AutoDiff allows us to compute and manipulate gradients. In the example below, we compute and plot the derivative of the sigmoid function. In deep learning, we use AutoDiff to perform custom backpropagation.

# GradientTape in Recording Forward Pass

In TensorFlow (TF), *tf.GradientTape* records all forward-pass operations within its “with” block to a “tape”. This record will be utilized to calculate the backpropagation gradient in *tape.gradient* later. The code below records the operation *loss* = *w²*. Then, *tape.gradient*(*loss, w*) computes the loss gradient w.r.t. *w *(i.e. d*L*/d*w*).

Here is an example of a dense layer in which the input *x* is a vector instead of a scalar. The returned gradient for *w* (*dl_dw*) is a tensor with shape (3, 2). It has the same shape as *w* as it contains one gradient for each element in *w*.

To apply the gradient descent to a neural network, we change its weights according to the learning rate and the gradients.

# AutoDiff for a Neural Network Model

Let’s complete our demonstrations with an MNIST classifier below. The Keras model provides property *mnist_model.trainable_variables* such that we do not need to keep track of all trainable variables ourselves.

*tape.gradient* returns a list (*grads*)* *that contains a gradient Tensor for each trainable variable. Then we can apply the gradient descent according to our choice of the optimizer.

For completeness, here is the complete code.

Note: I use screenshots for the codes because they are better formatted. For readers that want the source code, please refer to the TensorFlow guide where most of the codes here are originated from. With the constant changes in TF API, it is hard to keep the code in this article up to date. Readers should always refer to the latest documents.

# Vector Output

The computed *y* (say a loss function) shown before is a scalar value. But *y* can be a vector. For example, *y* = *x**[3., 4.] which is a vector. *y* has 2 components and *dy/dx *= [3., 4.]. But, *tape.gradient *always returns the gradients with a shape same as *x*. Indeed, *tape.gradient *returns the sum of components which equals 7 here.

**Persistent**

The resources held by the tape is released when *tape.gradient* is called. We cannot call the same tape multiple times for the derivative of different variables. To allow multiple calls, make sure the persistent is True for the tape. Use *del tape *to release the resources afterward when it is not needed.

**Custom Gradient**

Tensorflow provides automatic gradient calculation automatically. Nevertheless, some of the gradient calculation is not computable or can be numerically unstable. For the latter case, the intermediate result may approach infinity even though it may cancel out eventually. For example, the gradient calculated below returns NAN (not a number) for *x* = 100 even though the actual gradient is 1.0.

To solve that, we can add a *@tf.custom_gradient* annotation and implement a more stable and custom gradient calculation. This new method returns both the loss and the custom gradient function. But this annotated method will only return the loss to the caller and record the custom function to calculate the gradients in *tape.gradient*. As shown below, the gradient of *log* (*1 + eˣ*) is implemented as *1–1/(1+eˣ)* which equals 1 for *x* = 100.

# Gradient Clipping by Norm

One major application of manipulating gradient explicitly is to clip the gradient by norm to avoid gradient explosion. Here is an example code in demonstrating the idea.

# Getting a gradient of `None`

In some cases, the gradient returned will be None. In this section, we will see why or how to fix it if this is incorrect.

**Non-trainable variables, Tensors, Constants**

*tf.GradientTape* records operations within the “with” block and tracks all trainable *tf.Variable* that these operations involved with. However, non-trainable variables, tensors, and constants are not tracked automatically. They are *x1* (non-trainable *tf.Variables)*,* x2 *(Tensors), and *x3* (*tf.constant*) below. Their corresponding calculated gradients are None.

**watch**

If this is not desirable, we can fix it by adding them explicitly with *tape.watch*. In the example below, after watching the input *x* also, we can compute its gradient correctly.

By the way, intermediate results within the *GradientTape* block (like *y *below) will be recorded automatically even though it is a Tensor. Gradient *dz/dy* returns 18 below.

**None**

As expected, tape.gradient returns None when the gradient is w.r.t unrelated variables. In the code below, *z* is unrelated to *x *and therefore the returned gradient is None.

Operations perform outside of TensorFlow, like NumPy operations, cannot be traced. So the gradient of *y* computed with Numpy below cannot be computed. And operations involve integer or string data types are not differentiable. For example, the casting of integers to a float is not differentiable and the corresponding gradient is None.

The assignment *x = x +1* turns *x* to a Tensor. So when we start the second epoch and a new taping session, *x* will not be tapped automatically. Therefore, for the second epoch, the gradient w.r.t. *x* is None.

When TF reads from a stateful object like a *tf.Variable*, the tape can only see the current state, not the history that leads to it. Operations that change a variable state will block the gradient propagation. Hence, the *assign_add* operation on the left will result in None for the gradient calculation. (It seems the assignment records the changed state but not the operation.) To fix that, we add the variables together and assign the result to a Tensor instead. This is shown on the right below.

# Higher-order gradients

*tap.gradient *can be viewed as another TensorFlow operator. Therefore, we can use nested *GradientTape* to compute a higher-order gradient. In the inner “with” loop below, we compute the first-order derivative while the other loop trace *dy/dx* and compute its derivate, i.e. the second-order derivative of *y* w.r.t. *x*.

# Jacobian & Hessian

TensorFlow also provides API to compute the Jacobian matrix *J*.

The code below computes the Jacobian w.r.t. to a scalar variable.

And this one computes the Hessian.

# Credits and References

TensorFlow Guide: AutoDiff and Advanced AutoDiff