TensorFlow RNN models

Image for post
Image for post

Keras has 3 built-in RNN layers: SimpleRNN, LSTM ad GRU.

LSTM

Starting with a vocabulary size of 1000, a word can be represented by a word index between 0 and 999. For example, the word “side” can be encoded as integer 3.

Image for post

In the code example below, the input to the embedding layer is a sequence of word indexes representing a text. This layer transforms the text into a sequence of 64-D vectors — one vector per word.

Image for post
Image for post

Next, an LSTM layer converts this sequence of vectors into a 128-D vector. At last, it is converted to a 10-D vector using a dense layer. It makes one classification prediction for each class.

Image for post
Image for post

Here is a summary of the model.

Image for post
Image for post

As a reference, the round rectangle below is an LSTM cell. In the code example above, the LSTM returns the hidden state (a 128-D vector) in the last timestep as the output.

Image for post
Image for post
Modified from source 1 & 2

GRU

Let’s replace the LSTM module with a GRU and set return_sequences to True which returns all hidden states from every timestep, instead of the last one. In the diagram below, each hidden state of the GRU is fed into the corresponding input of the SimpleRNN layer. We take the last hidden start of SimpleRNN and then feed it into a dense layer for classification.

Image for post
Image for post

Here is the corresponding code:

Image for post
Image for post

and the model summary.

Image for post
Image for post

return_sequences

As shown before, we set the return_sequences of a GRU layer to True to return all hidden states. In fact, this parameter is supported in LSTM and SimpleRNN also.

Image for post
Image for post

return_state

By setting the return_state to True, an LSTM/GRU/SimpleRNN layer returns the output as well as the hidden state in the last timestep. For LSTM, it also returns the cell state in the last timestep. In the example below, “output” has the same value as the last hidden state state_h. It is redundant. But if return_sequences equals True, “output” contains all hidden states, not just state_h in the last timestep.

Image for post
Image for post

initial_state

The initial_state tensor is the input hidden state and the cell state for the first timestep. By default, the initial state tensors in LSTM and GRU are zero-filled. But in an encoder-decoder architecture, we can use the last hidden state and the cell state of the encoder (state_h and state_c) to initialize the decoder.

Image for post
Image for post

Cross-batch statefulness

By default, the initial states of the RNN cells are reset for every batch of samples. However, there are situations where we want to keep the states between batches. For example, in meta-learning, we keep learning from previous experience and we don’t want to reset the experience. In other cases, the input sequence may be too long and therefore, we may break it up into sub-sequences during training. In this situation, we do not reset the state between sub-sequences. To keep the state of a cell between samples, we set stateful=True. To reset, we call lstm_layer.reset_states.

Image for post
Image for post

Here is an example in which we treat 3 paragraphs to be a single sample. We keep the cell states in the process and reset it only when it is done.

Image for post
Image for post
Source

Bidirectional RNNs

The following diagram shows a bidirectional RNN which contains a forward LSTM and a backward LSTM. For each timestep, we merge the result from the forward pass and the backward pass together to generate an output. And there are different options on how the merge is done, for example, concatenation, adding, multiplication, etc …

Image for post
Image for post

Here is the code for constructing a classifier using bidirectional layers.

Image for post
Image for post

The first bidirectional LSTM has an input shape of (None, 5, 10). With return_sequences=True, it output 5 hidden states. By default, bidirectional LSTM concatenates the forward and backward pass result together (merge_mode=’concat’). Hence, the output of the first layer is (None, 5, 128) which double the output dimension of a forward LSTM layer.

Image for post
Image for post

For the second bidirectional layer, it only outputs one vector (by default, return_sequences=False ). The output of the bidirectional layer is the merging result of the last outputs from the forward pass and the backward pass. Again, by default, it is concatenation. So the output shape is (None, 64) since both the forward and backward LSTM output a 32-D vector.

Image for post
Image for post

Example

Here is a model example of using TextVectorization, Embedding, Bidirectional LSTM, and Dense layer to classify the sentiment of a movie review (positive or negative).

Image for post
Image for post
Source

CuDNN performance

In TensorFlow 2, the built-in LSTM and GRU layers will leverage CuDNN kernels by default when an Nvidia GPU is available. Nevertheless, if any of the default configurations below are changed, CuDNN will not be used. So be aware of the performance impact of choosing a non-standard configuration.

  • Change from the tanh activation function.
  • Change the recurrent_activation function from sigmoid.
  • Change recurrent_dropout from 0.
  • Change unroll form False.
  • Change use_bias from True.
  • If masking is used, change from right padding (discussed later).

Variable Size RNN/LSTM/GRU Time Sequence

RNN, LSTM, or GRU in TF can handle variable size time sequence nicely without extra coding. You can feed data into model(input) with “input” having a different number of timesteps (sequence length). The real issue is in training. Training takes an input Tensor (None, None, embedding_dim) which the first dimension is the batch size and the second dimension is the sequence length.

Padding

Unless you have a batch size of one in training, you need to pad the input to have a fixed length, like the code below.

Image for post
Image for post

The mask_zero flag in Embedding instructs the layer to treat zero value as padding and ignore the corresponding input.

Image for post
Image for post

If mask_zero is true, the Embedding layer also generates a separate mask tensor masked_output._keras_mask for the corresponding input.

Image for post
Image for post

And this masked tensor will be propagated to the next layer.

Image for post
Image for post

Here is the code for setting up an embedding layer with masking in a Sequential Model.

Image for post
Image for post

Custom Layer

The mask information will be passed to a layer as “mask” in “call”. In the code below, before computing the softmax value, it masks all the scores corresponding to the padded input to zero.

Image for post
Image for post

However, by default, the mask will be passed only once and it will be destroyed after this layer. To passing a mask to the next layer, set supports_masking to True.

Image for post
Image for post

Generate mask in a custom layer

A layer can consume a mask but also create a mask. This is done by implementing “compute_mask”.

Image for post
Image for post

Credits and References

TensorFlow tutorial

TensorFlow guide

Written by

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store