NLP — BERT & Transformer

Jonathan Hui
22 min readNov 5, 2019

Google published an article “Understanding searches better than ever before” that positioned BERT as one of its most important updates to the searching algorithms in recent years. Google says 15% of the Google queries have never been seen before. But the real issue is not on what people ask. It is how many ways a question may be asked. Previously, Google search was keyword-based. But this is far from understanding or clarifying the ambiguity in human language. That is why Google has utilized BERT in its search engine. In the demonstrations below, BERT responds to the query “Can you get medicine for someone pharmacy” much better than a keyword-based search.

Source (After using BERT in understanding the query)

In the last article, we cover the word embedding in representing a word. Let’s move on to BERT on representing a sentence. Strictly speaking, BERT is a training strategy, not a new architecture design. To understand BERT, we need to study another proposal from Google Brain first. That is the Transformer. The concept is complex and will take some time to explain. BERT just need the encoder part of the Transformer. But for completeness, we will cover the decoder also but feel free to skip it according to your interests.

Let’s see one application of the Transformer. OpenAI GPT-2 is a transformer-based model with 1.5 billion parameters. As I type the paragraph below, the grayed part is automatically generated with the GPT-2 model.

Generated from source using GPT-2 model

In GPT-3, the quality of writing may even reach the level of a writer. Therefore, OpenAI does not release the model parameters afraid of possible abuses.

Encoder-Decoder & Sequence to Sequence

To map an input sequence to an output sequence, we often apply sequence-to-sequence transformation using an encoder-decoder model. One example is the use of seq2seq to translate a sentence from one language to another.

We assume you have a basic background on this already. So we will not repeat the information. (Google the phrase “sequence to sequence” or “Seq2Seq” later if you need help).

For many years, the seq2seq model uses RNN, LSTM, or GRU to parse the input sequence and to generate the output sequence. But this approach suffers a few setbacks.

  • Learning the long-range context with RNN gets more difficult as the distance increases.
Source (sfgate)
  • RNN is directional. In the example below, a backward RNN may have a better chance to guess the word “win” correctly.

To avoid making the wrong choice,


we can design a model including both forward and backward RNN (i.e. a bidirectional RNN) and then add both results together.


We can also stack pyramidal bidirectional RNN layers to explore context better.

But at one point, we may argue that in order to understand the context of the word “view” below, we should check over all the words in a paragraph concurrently. i.e. to know what the “view” below may refer to, we should apply fully-connected (FC) layers directly to all the words in the paragraph.

However, this problem involves high-dimensional vectors and makes it like finding a needle in a haystack. But how do humans solve this problem? The answer may land on “attention”.


Event the picture below contains about 1M pixels, most of our attention may fall on the blue-dress girl.

When creating a context for our query, we should not put equal weight on all the information we get. We need to focus! We should create a context of what interests us based on the query. But this query will shift in time. For example, if we are searching for the ferry, our attention may focus on the ticket booth instead. So how can we conceptualize this into equations and deep networks?

In RNN, we make predictions based on the input xt and the previous hidden state h(t-1).

But in an attention-based system, input x will be replaced by the attention.

We can conceptualize that the attention process keeps information that is currently important.

For example, for each input feature xᵢ, we can train an FC layer to score how important feature i is (or the pixel) under the context of the previous hidden state h.

Afterward, we normalize the score using a softmax function to form the attention weights α.

Finally, the attention Z in replacing the input x will be the weighted output of the input features based on attention. Let’s develop the concept further before we introduce the equations.

Query, Keys, Values (Q, K, V)

First, we will formalize the concept of attention with query, key, and value. So what are query, key, and value? A query is the context of what we are looking for. In previous equations, we use the previous hidden state as the query context. We want to know what is next based on what we know already. But in searching, it can simply a word provided by a user. Value is the input features or raw pixels. Key is simply an encoded representation for “value”. But in some cases, the “value” itself can be used as a “key”.

To create attention, we determine the relevance between the query and the keys. Then we mask out the associated values that are not relevant to the query. For example, for the query “ferry”, the attention should focus on the waiting lines and the ticket sign.

Now, let’s see how we apply attention to NLP and start our Transformer discussion. But the Transformer is pretty complex. It is a novel encoder-decoder model with attention. It will take some time to discuss it.


Many DL problems represent an input with a dense representation. This forces the model to extract critical knowledge about the input. These extracted features are often called latent features, hidden variables, or a vector representation. Word embedding creates a vector representation of a word that we can manipulate with linear algebra. However, a word can have different meanings in different contexts. In the example below, word embedding uses the same vector in representing “bank” even though they have different meanings in the sentence.

To create a dense representation of this sentence, we can parse the sentence with an RNN to form an embedding vector. In this process, we gradually accumulate information in each timestep. But one may argue that when the sentence is getting longer, early information may be forgotten or override.

Maybe, we should convert a sentence to a sequence of vectors instead, i.e. one vector per word. In addition, the context of a word will be considered during the encoding process. For example, the word “bank” below will be treated and encoded differently according to the context.

Let’s integrate this concept with attention using query, key, and value. We decompose a sentence into single words. Each word acts as a value but also as its key.

To encode the whole sentence, we perform a query on each word. So a 21-word sentence results in 21 queries and 21 vectors. This 21-vector sequence will represent the sentence.

So, how do we encode the 16th word “bank”? We use the word itself (“bank”) as the query. We compute the relevancy of this query with each key in the sentence. The representation of this word is simply a weighted sum of the values according to the relevancy — the attention output. Conceptually, we “grey out” non-relevant values to form the attention.

By going through Q₁ to Q₂₁, we generate a 21 attention (vector) sequence that represents the sentence.

Transformer Encoder

Let’s get into more details. But in the demonstration, we use the sentence below instead which contains 13 words only.

New England Patriots win 14th straight regular-season game at home in Gillette stadium.

In the encoding step, the Transformer uses learned word embedding to convert these 13 words into 13 512-D word embedding vectors. Then they are passed into an attention-based encoder to generate the context-sensitive representation for each word. Each word-embedding will have one output vector hᵢ. In the model we built here, hᵢ is a 512-D vector. It encodes the word xᵢ with its context.

Let’s zoom into this attention-based encoder more. The encoder actually stacks up 6 encoders with each encoder shown below. The output of an encoder is fed to the encoder above. This example takes 13 512-D vectors and output 13 512-D vectors. For the first decoder (encoder₁), the input is the 13 512-D word embedding vectors.

Scaled Dot-Product Attention

In each encoder, we perform attention first. In our example, we have 13 words and therefore 13 queries. But we don’t compute their attention separately.

Instead, all 13 attentions can be computed concurrently. We pack the queries, keys, and values into the matrix Q, K, and V respectively. Each matrix will have a dimension of 13 × 512 (d=512). The matrix product QKᵀ will measure the similarity among the queries and the keys.

However, when the dimension d is large, we will encounter a problem with the dot products QKᵀ. Assume each row (q and k) in these matrices contains independent random variables with mean 0 and variance 1. Then their dot product q · k will have a mean 0 and variance d (512). This will push some of the dot product values to be very large. This can move the softmax output to a low gradient zone that requires a large change in the dot product to make a noticeable change in the softmax output. This hurts the training progress. To correct that, the Transformer divides the dot product with a scale factor equals to the root of the dimension.

Multi-Head Attention

In the last section, we generate one attention per query.

Multi-Head Attention generates h attentions per query. Conceptually, we just pack h scaled dot-product attention together.


For example, the diagram below shows two attentions, one in green and the other in yellow.

In the Transformer, we use 8 attentions per query. So why do we need 8 but not 1 attention as each attention can cover multiple areas anyway? In the Transformer, we don’t feed Q, K, and V directly to the attention module. We transform Q, K, and V respectively with trainable matrix Wq, Wk, Wv first.

If we use 8 attentions, we will have 8 different sets of projections above. This gives us 8 different “perspectives”. This eventually pushes the overall accuracy higher, at least empirically. But, we want to keep the computation complexity similar. So instead of having the transformed Q to have a dimension of 13 × 512, we scale it down to 13 × 64. But now, we have 8 attentions and 8 transformed Qs.


The output is the concatenate of the results from all the Scaled Dot-Product Attentions. Finally, we apply a linear transformation to the concatenated result with W. Note, we describe the model as 8 separate heads but in the coding, we pack all 8 heads into a multi-dimensional Tensor and manipulate them as a single unit.

Skip connection & Layer normalization

This is the encoder using multi-head attention.

As shown, the Transformer applies skip connection (residual blocks in ResNet) to the output of the multi-head attention followed by a layer normalization. Both techniques make training easier and more stable. In batch normalization, we normalize an output based on means and variances collected from the training batches. In layer normalization, we use values in the same layer to perform the normalization instead. We will not elaborate on them further and it is not critical to understanding them to learn the Transformer. It is just a common technique to make training more stable and easier.

Position-wise Feed-Forward Networks

Next, we apply a fully-connected layer (FC), a ReLU activation, and another FC layer to the attention results. This operation is applied to each position separately and identically (sharing the same weights). It is a position-wise feed-forward because the ith output depends on the ith attention of the attention layer only.

Similar to the attention, the Transformer also uses skip connection and layer normalization.

Positional Encoding

Politicians are above the law.

This sounds awfully wrong. But it demonstrates the positions or relative positions of words matter.

Convolution layers use limited size filters to extract local information. So, for the first sentence, “nice” will associate with “for you” instead of “requests”. Nevertheless, the Transformer encodes a word with all its context at once. In the beginning, “you” will be treated similarly to “requests” in encoding the word “nice”. We just hope the model will extract and utilize the position and ordering information eventually. If failed, the inputs behave like a bag of words, and both sentences above will encode similarly.

One possible solution is to provide position information as part of the word embedding.

So, how can we encode the position i into a 512-D input vector?

The equation below is used for the fixed position embedding. This position embedding vector has 512 elements, the same as the word embedding. The even elements use the first equation and the odd elements use the second equation to compute the positional value. Once it is computed, we sum the position embedding with the original word embedding to form the new word embedding.

The diagram below colorizes the values of the position embedding for the first 50 positions in a 512-D embedding. The color bar on the right indicates the values. As shown below, the early elements in the position embedding will repeat their position value more frequently than the later elements (depth). So it is tailor for a shorter position range.


For a word k position away for the word i, its PE value will be close to a linear function of PEᵢ and k. This allows the model to discover and utilize the relative positions between words in generating attention.

Even without the fixed position embedding, we can argue that the model weights will learn how to take the relative position into account eventually. Maybe, we just don’t want the same weights to serve two purposes — discovering the context and the relative position. So in the second approach, we reformulate the attention formula and introduce two parameters (one for the values and one for the keys) that take the relative position of words into consideration.

In generating the attention zᵢ for the ith word, we adjust the contribution from the jth word with aᵢⱼ below. Instead of fixing their values, we make them trainable. (details)

Modified from source

aᵢⱼ models the absolute positions — the ith and the jth word. Maybe, we only care about the relative distance. We should treat a(3, 9) to be the same as a(5, 11). So instead of modeling a(i, j), we model w(k) where k is the distance j-i. In the equations above, we simply replace a(i, j) with w(j-i).

In addition, we clip the distance. Anything farther away from k, we clip it to w(k) or w(-k) instead. Therefore, we only need to learn 2×k + 1 set of parameters. If you want more information, please refer to the original research paper.


The Transformer uses the fixed position embedding because it has similar performance as other approaches but it can handle sequence lengths longer than the ones trained.

This is the encoder. Next, we will discuss the decoder. Nevertheless, this section is optional because BERT uses the encoder only. It will be nice to know the decoder. But it is relatively long and harder to understand. So skip the next six sections if you want.

Transformer Decoder (Optional)

The encoder generates the vector representation h to represent the input sentence. This representation will be used by the decoder during training or to decode the sequence in inferencing.

As recalled, attention can be composed of a query, keys, and values. For the decoder, the vector representation h will be used as keys and values for the attention-based decoder. In training, the first input token to the decoder will be the <sos> (start of string). The rest of the input contains the target words, i.e. <sos>, Los, Patriots, de, etc … But let’s defer the discussion on the attention-based decoder and discuss something easier first.

Embedding and Softmax in Training (Optional)

We fit the output of the attention decoder to a linear layer followed by a softmax in making a word prediction. This linear layer is actually the reverse of the embedding layer.

The encoder-decoder model contains two word-embedding layers — one for the encoder and one for the decoder. Both will use the same learned embedding. For the linear layer just mentioned, we will use the weights in the embedding to derive its weights (a.k.a. its inverse). Empirical results show improvements in accuracy when we share all these parameters.

Inference (Optional)

In inference, we predict one word at a time. In the next time step, we collect all the previous predictions and feed them to the decoder. So in timestep ③ below, the input will be <sos>, Los, Patriots.

Encoder-decoder attention (Optional)

Let’s get back to the details of the encoder-decoder attention. Recall previously, in the encoder, we apply linear transformations to create Q, K, and V respectively from the input word embeddings X.

For the Transformer decoder, the attention is done in 2 stages.

Stage ① is similar to encoder. K, V, and Q are derived from the input embeddings. This prepares the vector representation for the query needed for stage ②.

But in stage ②, K and V are derived from h (from the encoder).

Once the attention is computed, we pass it through the Position-wise Feed-Forward Network. The attention decoder stacks up these 6 decoders with the last output passing through a linear layer followed by a softmax in predicting the next word.

And h is fed into each decoder.

Here is the diagram for the whole Transformer.

Training (optional)

During training, we do know the ground truth. The attention model is not a time sequence model. Therefore, we can compute output predictions all at once.

But, for the prediction at position i, we make sure the attention can only see the ground truth output from position 1 to i-1 only. Therefore, we add a mask in the attention to mask out information from position i and beyond when creating attention for position i.

Soft Label (Optional)

To avoid overfitting, the training also uses dropout and label smoothing. Usually, we want the probability for the ground truth label to be one. But pushing it to one may also overfit the model. Label smoothing targets the probability prediction for the ground truth label to a lower value (say 0.9) and for non-ground truth to be higher than 0 (say 0.1). This avoids getting over-confidence with specific data. In short, being overconfidence about a data point may be a sign of overfitting and hurt us in generalizing the solution.

Congratulations! This is all about the Transformer.

NLP Tasks

So far we have focused our discussion on sequence-to-sequence learning, like language translation. While this type of problem covers a wide range of NLP tasks, there are other types of NLP Tasks. For example, in question and answer (QA), we want to spot the answer in a paragraph regarding a question being asked.

Source (SQuAD)

There is another type of NLP task called Natural Language Inference (NLI). Each problem contains a pair of sentences: a premise and a hypothesis. Given a premise, an NLI model predicts whether a hypothesis is true (entailment), false (contradiction), or undetermined (neutral).


The codes below are two more applications in NLP. The first one determines the sentiment of a sentence. The second one answers a question given a context.


We will demonstrate how BERT can solve these problems.

BERT (Bidirectional Encoder Representations from Transformers)

With word embedding, we create a dense representation of words. But in the section of Transformer, we discover word embedding cannot explore the context of the neighboring words well. In NLI applications, we want the model able to handle two sentences. In addition, we want a representation model that is multi-purposed. NLP training is intense! Can we pre-trained a model and repurpose it for other applications without building a new model again?

Let’s have a quick summary of BERT. In BERT, a model is first pre-trained with data that requires no human labeling. Once it is done, the pre-trained model outputs a dense representation of the input. To solve other NLP tasks, like QA, we modify the model by simply adding a shallow DL layer connecting to the output of the original model. Then, we retrain the model with data and labels specific to the task end-to-end.

In short, there is a pre-training phase in which we create a dense representation of the input (the left diagram below). The second phase retunes the model with task-specific data, like MNLI or SQuAD, to solve the target NLP problem.



BERT uses the Transformer encoder we discussed to create the vector representation.

Input/Output Representations

But first, let’s define how input is assembled and what output is expected for the pre-trained model. First, the model needs to take one or two word-sequences to handle different spectrums of NLP tasks.


All input will start with a special token [CLS] (a special classification token). If the input composes of two sequences, a [SEP] token will put between Sequence A and Sequence B.

If the input has T tokens, including the added tokens, the output will have T outputs also. Different parts of the output will be used to make predictions for different NLP tasks. The first output is C (or sometimes written as the output [CLS] token). It is the only output used for any NLP classification task. For non-classification tasks with only one sequence, we use the remaining outputs (without C).


So, how do we compose the input embedding? In BERT, the input embedding composes of word piece embedding, segment embeddings, and position embedding of the same dimension. We add them together to form the final input embedding.

Modified from source

Instead of using every single word as tokens, BERT breaks a word into word pieces to reduce the vocabulary size (30,000 token vocabularies). For example, the word “helping” is decomposed into “help” and “ing”. Then it applies an embedding matrix (V × H) to convert the one-hot vector Rⱽ for “help” to Rᴴ.

The segment embeddings model which sequence that tokens belong to. Does the token belong to the first sentence or the second sentence? So it has a vocabulary size of two (segment A or B). Intuitively, it adds a constant offset to the embedding with value based on whether it belongs to sequence A or B. Mathematically, we apply an embedding matrix (2 × H) to convert R² to Rᴴ. The last embedding is the position embedding. It serves the same purpose in the Transformer in identifying the absolute or relative position of words.


BERT pre-trains the model with 2 NLP tasks.

Masked LM

The first one is the Masked LM (Masked Language Model). We use the Transformer decoder to generate a vector representation of the input which some words masked.

Then BERT applies a shallow deep decoder to reconstruct the word sequence(s) back including the missing one.

In the Masked LM, BERT masks out 15% of the WordPiece. 80% of the masked WordPiece will be replaced with a [MASK] token, 10% with a random token and 10% will keep the original word. The loss is defined as how well BERT predicts the missing word, not the reconstruction error of the whole sequence.

We do not replace 100% of the missing WordPiece with the [MASK] token. This encourages the model to predict missing words, not the final objective of creating vector representations for the sequences with context taken into consideration. BERT replaces 10% with random tokens and 10% with the original words. This encourages the model to learn what may be correct or what be wrong for the missing words.

Next Sentence Prediction (NSP)

The second pre-trained task is NSP. The key purpose is to create a representation in the output C that will encode the relations between Sequence A and B. To prepare the training input, about 50% of the time, BERT uses two consecutive sentences as sequences A and B respectively. BERT expects the model to predict “IsNext”, i.e. sequence B should follow sequence A. For the remaining 50% of the time, BERT selects two-word sequences randomly and expect the prediction to be “Not Next”.


In this training, we take the output C and then classify it with a shallow classifier.

As noted, for both pre-training task, we create the training from a corpse without any human labeling.

These two training tasks help BERT to train the vector representation of one or two word-sequences. Other than the context, it likely discovers other linguistics information including semantics and coreference.

Fine-tuning BERT

Once the model is pre-trained, we can add a shallow classifier for any NLP task or a decoder, similar to what we discussed in the pre-training step.


Then, we fit the task-related data and the corresponding labels to refine all the model parameters end-to-end. That is how the model is trained and refined. So BERT is more on the training strategy rather than the model architecture. Its encoder is simply the Transformer encoder.

SQuAD Fine-tuning

The fine-tuning for Q&A problems is slightly different. Given the first sentence to be the question and the second sentence (paragraph) as the context, we want to find the start and the end position in the second sentence that will answer the question. For example, the question is who is Obama and the context is “Obama borned in Hawaii and he served as the President of the United States. The model would returns the start and the end position for “President of the United States”.

In the fine-tuning, we will introduce two more trainable vectors S and E. Tᵢ below is the same as T’ᵢ in the diagram above. (T’ is the output corresponding to the position of the second sentence.). S, E and Tᵢ are vectors and have the same dimension.

The dot product S Tscores how likely the answer starts at position i and the dot product E Tscores how likely the answer ends at position i. We pass them to a softmax function in calculating a probability. With the probability above, we calculate a loss function compared with the ground truth and train S, E and all other parameters.


But the model configuration in BERT is different from the Transformer paper. Here are a sample configuration used for the Transformer encoder in BERT.

For example, the base model stacks up 12 decoders, instead of 6. Each output vector has a 768 dimension and the attention uses 12 heads.

Source Code

For those interested in the source code for BERT, here is the source code from Google. For Transformer, here is the source code.


NLP training is resource intense. Some BERT models are trained with 64 GB TPU using multiple nodes. Here is an article on how to scale the training with Nvidia GPUs.

Credit and References

Attention Is All You Need

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding