TensorFlow RNN models

Keras has 3 built-in RNN layers: SimpleRNN, LSTM ad GRU.


Starting with a vocabulary size of 1000, a word can be represented by a word index between 0 and 999. For example, the word “side” can be encoded as integer 3.

In the code example below, the input to the embedding layer is a sequence of word indexes representing a text. This layer transforms the text into a sequence of 64-D vectors — one vector per word.

Next, an LSTM layer converts this sequence of vectors into a 128-D vector. At last, it is converted to a 10-D vector using a dense layer. It makes one classification prediction for each class.

Here is a summary of the model.

As a reference, the round rectangle below is an LSTM cell. In the code example above, the LSTM returns the hidden state (a 128-D vector) in the last timestep as the output.

Modified from source 1 & 2


Let’s replace the LSTM module with a GRU and set return_sequences to True which returns all hidden states from every timestep, instead of the last one. In the diagram below, each hidden state of the GRU is fed into the corresponding input of the SimpleRNN layer. We take the last hidden start of SimpleRNN and then feed it into a dense layer for classification.

Here is the corresponding code:

and the model summary.


As shown before, we set the return_sequences of a GRU layer to True to return all hidden states. In fact, this parameter is supported in LSTM and SimpleRNN also.


By setting the return_state to True, an LSTM/GRU/SimpleRNN layer returns the output as well as the hidden state in the last timestep. For LSTM, it also returns the cell state in the last timestep. In the example below, “output” has the same value as the last hidden state state_h. It is redundant. But if return_sequences equals True, “output” contains all hidden states, not just state_h in the last timestep.


The initial_state tensor is the input hidden state and the cell state for the first timestep. By default, the initial state tensors in LSTM and GRU are zero-filled. But in an encoder-decoder architecture, we can use the last hidden state and the cell state of the encoder (state_h and state_c) to initialize the decoder.

Cross-batch statefulness

By default, the initial states of the RNN cells are reset for every batch of samples. However, there are situations where we want to keep the states between batches. For example, in meta-learning, we keep learning from previous experience and we don’t want to reset the experience. In other cases, the input sequence may be too long and therefore, we may break it up into sub-sequences during training. In this situation, we do not reset the state between sub-sequences. To keep the state of a cell between samples, we set stateful=True. To reset, we call lstm_layer.reset_states.

Here is an example in which we treat 3 paragraphs to be a single sample. We keep the cell states in the process and reset it only when it is done.


Bidirectional RNNs

The following diagram shows a bidirectional RNN which contains a forward LSTM and a backward LSTM. For each timestep, we merge the result from the forward pass and the backward pass together to generate an output. And there are different options on how the merge is done, for example, concatenation, adding, multiplication, etc …

Here is the code for constructing a classifier using bidirectional layers.

The first bidirectional LSTM has an input shape of (None, 5, 10). With return_sequences=True, it output 5 hidden states. By default, bidirectional LSTM concatenates the forward and backward pass result together (merge_mode=’concat’). Hence, the output of the first layer is (None, 5, 128) which double the output dimension of a forward LSTM layer.

For the second bidirectional layer, it only outputs one vector (by default, return_sequences=False ). The output of the bidirectional layer is the merging result of the last outputs from the forward pass and the backward pass. Again, by default, it is concatenation. So the output shape is (None, 64) since both the forward and backward LSTM output a 32-D vector.


Here is a model example of using TextVectorization, Embedding, Bidirectional LSTM, and Dense layer to classify the sentiment of a movie review (positive or negative).


CuDNN performance

In TensorFlow 2, the built-in LSTM and GRU layers will leverage CuDNN kernels by default when an Nvidia GPU is available. Nevertheless, if any of the default configurations below are changed, CuDNN will not be used. So be aware of the performance impact of choosing a non-standard configuration.

  • Change from the tanh activation function.
  • Change the recurrent_activation function from sigmoid.
  • Change recurrent_dropout from 0.
  • Change unroll form False.
  • Change use_bias from True.
  • If masking is used, change from right padding (discussed later).

Variable Size RNN/LSTM/GRU Time Sequence

RNN, LSTM, or GRU in TF can handle variable size time sequence nicely without extra coding. You can feed data into model(input) with “input” having a different number of timesteps (sequence length). The real issue is in training. Training takes an input Tensor (None, None, embedding_dim) which the first dimension is the batch size and the second dimension is the sequence length.


Unless you have a batch size of one in training, you need to pad the input to have a fixed length, like the code below.

The mask_zero flag in Embedding instructs the layer to treat zero value as padding and ignore the corresponding input.

If mask_zero is true, the Embedding layer also generates a separate mask tensor masked_output._keras_mask for the corresponding input.

And this masked tensor will be propagated to the next layer.

Here is the code for setting up an embedding layer with masking in a Sequential Model.

Custom Layer

The mask information will be passed to a layer as “mask” in “call”. In the code below, before computing the softmax value, it masks all the scores corresponding to the padded input to zero.

However, by default, the mask will be passed only once and it will be destroyed after this layer. To passing a mask to the next layer, set supports_masking to True.

Generate mask in a custom layer

A layer can consume a mask but also create a mask. This is done by implementing “compute_mask”.

Credits and References

TensorFlow tutorial

TensorFlow guide

Deep Learning

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store