TensorFlow Sequence to Sequence Model Examples

Jonathan Hui
10 min readFeb 1, 2021


Sequence-to-sequence models are particularly popular in NLP. This article, as part of the TensorFlow series, will cover examples for the sequence to sequence model.

The examples include:

  • Shakespear text generation with GRU,
  • Seq2seq language translation with attention,
  • Image captioning with visual attention and Inception V3 for feature extraction.

Shakespear text generation with GRU

Given a large file with Shakespear's writings,

we can use its content as a stream of characters to train a model to write like Shakespeare.

For example, during inference, we start with an initial text, say “ROMEO: ”. The model predicts what is the next character once at a time.

This can be done with the GRU model below with embedding and dense layers. The next character is the character predicted in the last GRU cell.

Source (The round rectangles represent operation and the square rectangles represent Tensors.)

Let trace it from the beginning with the initial text. The logits output of the last cell will be used to determine the next character after “ROMEO: “.

We take the generated character as the input to the model again and generates the next character.

As we keep repeating the process, it generates a script similar to the one below. Even though not grammatically sound, it can be improved with more complex designs. For now, we stick with the simple model.

Let’s go back to the code. First, we load the Shakespear file and uses its unique characters (65) to form a vocabulary (this vocabulary contains characters instead of words). Then, we create a mapping between a character and an integer index.

And we convert the characters in the downloaded Shakespear text into a long sequence of integer indexes. We slice and batch it to create the dataset “sequences”. Next, we create a dataset with an additional field holding the target text in line 39 as the labels.

Given an input character, the corresponding character in the target text (the character we want to predict) should be the next character of the input text. So these labels are simply formed by shifting the input left by one.

Then, we shuffle and batch the dataset.

Then we build the model.

It contains an embedding layer, a GRU, and a dense layer.


And here is a summary of the model.

Then, we train the model and create checkpoints.

Next, we reload the model from the checkpoint.

With the model restored, we start generating text in the Shapespear style with the initial text “ROMEO: ”

We convert this text into the corresponding sequence of integer indexes. Then, we fit it into the model which outputs a Tensor logits with the shape (1, 7, 65). One prediction for each input character in the initial text “ROMEO: ” (batch size, sequence length, vocabulary size).

In line 140, we use tf.random.categorical to sample a character prediction in each sequence using the logit values. The output will have the shape of (7, 1). We are only interested in the last GRU cell (what is next after the last character), so we take the last sampled value [-1, 0] as the next generated character.

For the rest of the “for” loop iterations, we just use the newly generated character as input to the model and predict the next character. We repeat the iterations until 1000 characters are generated.

Seq2seq language translation with attention

This example trains a sequence to sequence (seq2seq) model for Spanish to English translation. First, we download the dataset files. The following is part of the Spanish to English translation file.

Next, we create methods to preprocess a sentence.

Given a translation file, create_dataset generates samples for the target text and the original text and tokenize converts the text into a sequence of integers.

Now, load_dataset put them together in loading source (Spanish) and target (English) text into samples of source and target integer sequences. It also returns the tokenizers used.

Following is a sample of the integer sequences for the source and the target text.

Here is the boilerplate code for creating a smaller dataset for training and validation.

The diagram below contains an encoder to extract the input context for “Je suis étudiant” and a decoder to translate this context into “I am a student”.

But our example has an additional attention module. Without attention, the decoder predicts the next word based on the predicted word and the GRU hidden states in the last step.

With attention, we also use the input context. But not the complete input context, we use the last GRU hidden states to select the part of the input context that we should pay attention to. In computer vision, we use attention to mask out information that is not important at the current stage.

Here is the conceptual visualization of applying attention in NLP. If we use the 16th word (Bank) as the query, it will pay more attention to the phrase “Bank of America” rather than the “river bank”.


Here is the encoder. It contains an embedding layer and a GRU.

Encoder produces “output”, the hidden states of the complete sequence, and “state”, the last hidden state.

Let’s elaborate further on how attention works. The query is the previous hidden GRU state of the decoder and values are the output of the encoder. The attention keeps the input context that is relevant to the query.

This is the detailed model diagram for the attention module.

values” in our example contains a sequence of 3 integer vectors representing the input “Je suis étudiant”. Each vector has “hidden dim.” components. We train dense layers (W1 and W2) to suppress the vectors that are not relevant to the query. Its output passes through a tanh function and further trained to score each vector with a dense layer V. Then, the scores pass through a softmax function. Each vector in the sequence is multiplied with the corresponding softmax result. The model will be trained to zero out irrelevant vectors in the input sequence according to the current query. Finally, we sum up all vectors. The attention output, the context vector, will have the shape (batch size, hidden dimension). Here is the model code for the attention module.

And the equations that we apply.

Finally, this is the decoder.

At the second time step in the decoder, here are the notation used in the code above.

Here are the Adam optimizer and the loss function defined.

Here is the training step:

And the training itself:

After the model is trained, we can evaluate the model. The evaluation step is similar to the training, except it uses the previous prediction as input, rather than the target word. When we predict the <end> token, the output is complete for that sample.

Finally, we plot the weights applied in the attention module when we translate “hace mucho frio aqui”.

The plot indicates the context where the model focuses on at different stages. At the time when it should predict the word “cold”, the input context at that stage focuses on the word “frio” — cold in Spanish.

Image captioning with visual attention and Inception V3 for feature extraction

In this example, we generate image caption directly from an image.

Again, we will apply the attention mechanism to pay attention to smaller areas of images in generating the next word. The attention and the decoder module will be similar to the translation example. The major difference is on preprocessing the data. We will use a Inception v3 model to extract image features and save them to files. The input to the encoder will be these extracted image features. We will go through the code quickly since most parts are similar to the previous translation example.

First, we download the annotation files containing the captions and about 82,000 MSCOCO images (13GB).

But for faster training, we take 6,000 images only. And prepare a list of about 30,011 captions and a list of 30,011 corresponding file names. Each image can have multiple captions.

We will resize the image and use a pre-built Inception v3 to extract features. Then the features are stored into a file.

Next, we preprocess and tokenize the captions. We will prepare a vocabulary for the top 5,000 words and create the mapping between a word and the word integer index. We also pad all sequences to the same length as the longest one.

Next, we will split the samples between training and validation.

And we will create the datasets.

The attention model will be the same as what we already discussed.

Because our inputs are extracted features already, the encoder is just a simple dense layer followed by ReLU.

The decoder is similar to the language translation example, with one more dense layer and a couple ReLU layers. It uses an attention layer also.

We define the model, an Adam optimizer, the loss function and the CheckpointManager.

In a training step, we encode the extracted image features. Then we use the decoder with the attention to predict one word at a time.

Here is the training loop,

and the evaluation.

Credits and References

All the source code is originated or modified from the TensorFlow tutorial.