TensorFlow RNN models

Keras has 3 built-in RNN layers: SimpleRNN, LSTM ad GRU.

LSTM

In the code example below, the input to the embedding layer is a sequence of word indexes representing a text. This layer transforms the text into a sequence of 64-D vectors — one vector per word.

Next, an LSTM layer converts this sequence of vectors into a 128-D vector. At last, it is converted to a 10-D vector using a dense layer. It makes one classification prediction for each class.

Here is a summary of the model.

As a reference, the round rectangle below is an LSTM cell. In the code example above, the LSTM returns the hidden state (a 128-D vector) in the last timestep as the output.

Modified from source 1 & 2

GRU

Here is the corresponding code:

and the model summary.

return_sequences

As shown before, we set the return_sequences of a GRU layer to True to return all hidden states. In fact, this parameter is supported in LSTM and SimpleRNN also.

return_state

initial_state

Cross-batch statefulness

Here is an example in which we treat 3 paragraphs to be a single sample. We keep the cell states in the process and reset it only when it is done.

Source

Bidirectional RNNs

Here is the code for constructing a classifier using bidirectional layers.

The first bidirectional LSTM has an input shape of (None, 5, 10). With return_sequences=True, it output 5 hidden states. By default, bidirectional LSTM concatenates the forward and backward pass result together (merge_mode=’concat’). Hence, the output of the first layer is (None, 5, 128) which double the output dimension of a forward LSTM layer.

For the second bidirectional layer, it only outputs one vector (by default, return_sequences=False ). The output of the bidirectional layer is the merging result of the last outputs from the forward pass and the backward pass. Again, by default, it is concatenation. So the output shape is (None, 64) since both the forward and backward LSTM output a 32-D vector.

Example

Source

CuDNN performance

  • Change from the tanh activation function.
  • Change the recurrent_activation function from sigmoid.
  • Change recurrent_dropout from 0.
  • Change unroll form False.
  • Change use_bias from True.
  • If masking is used, change from right padding (discussed later).

Variable Size RNN/LSTM/GRU Time Sequence

Padding

Unless you have a batch size of one in training, you need to pad the input to have a fixed length, like the code below.

The mask_zero flag in Embedding instructs the layer to treat zero value as padding and ignore the corresponding input.

If mask_zero is true, the Embedding layer also generates a separate mask tensor masked_output._keras_mask for the corresponding input.

And this masked tensor will be propagated to the next layer.

Here is the code for setting up an embedding layer with masking in a Sequential Model.

Custom Layer

The mask information will be passed to a layer as “mask” in “call”. In the code below, before computing the softmax value, it masks all the scores corresponding to the padded input to zero.

However, by default, the mask will be passed only once and it will be destroyed after this layer. To passing a mask to the next layer, set supports_masking to True.

Generate mask in a custom layer

A layer can consume a mask but also create a mask. This is done by implementing “compute_mask”.

Credits and References

TensorFlow guide

Deep Learning