Machine Translation with Attention

Suyash Khare
6 min readJan 25, 2024

--

Implementing Machine Translation using a recurrent neural network like LSTMS can work for short to medium-length sentences, but can result in vanishing gradients for very long sequences. To solve this, we will be adding an attention mechanism to allow the decoder to access all relevant parts of the input sentence regardless of its length.

So essentially, we will build an LSTM with simple attention. Let's begin by understanding the architecture and shortcomings of the conventional seq2seq model.

Seq2Seq model:

  • The traditional seq2seq model was introduced by Google in 2014 and it was a revelation at the time.
  • It works by taking one sequence of items, such as words, and outputs another sequence. The way this is done is by mapping variable length sequences to a fixed length memory (a vector) which encodes the overall meaning of sentences. So essentially, text of any length gets encoded into a vector of the same length. This feature made this model a powerhouse for machine translation because the input and output sequences don’t need to have matching lengths.
  • Then you might know about the vanishing and exploding gradients problem. In the seq2seq model, LSTMs and GRUs are typically used to avoid these problems.

Now let's look at the architecture of the Seq2Seq model:

Seq2Seq model architecture

So as you can see, in a seq2seq model, you have an encoder and a decoder. The encoder takes word tokens as input and returns its final hidden states as output. This hidden state is used by the decoder to generate the translated sentence in the target language.

The Encoder:

The encoder in a Seq2Seq model

The encoder typically consists of an embedding layer and an LSTM module with one or more layers. The embedding layer transforms tokenized words into a vector. At each step in the input sequence, the LSTM module receives inputs from the embedding layer, as well as the hidden states from the previous step. The encoder returns the hidden states of the final step, shown here as h4. This final hidden state has information from the whole sentence and it encodes its overall meaning.

The Decoder:

The decoder in Seq2Seq model

The decoder architecture is similar to the encoder, with an embedding layer and an LSTM module. You use the output word of each step as the input word for the next step. You also pass the LSTM hidden state to the next step. The thing to note here is that at the start of the sequence, the inputs are the output of the encoder and a <start of sentence> token.

Limitations of Seq2Seq:

Information Bottleneck in Seq2Seq

One major limitation of the traditional seq2seq model is what’s referred to as the information bottleneck. Since seq2seq uses a fixed-length memory for the hidden states, long sequences become problematic. This is because, in traditional seq2seq models, only a fixed amount of information can be passed from the encoder to the decoder no matter how much information is contained in the input sequence. So essentially the blessing of seq2seq, which allows for inputs and outputs to be different sizes, becomes a curse when the input sequence is long.

Intuitively, a very straightforward solution would be that instead of passing only the final hidden states, you can pass all the hidden states to the decoder. However, this quickly becomes inefficient as you must retain the hidden states for each input step in memory. So what can we do? 🤔

Solution: Pass all hidden states to the decoder. But how?

Optimal way of passing all hidden states to the decoder

You can combine the hidden states into one vector, typically called the context vector. The simplest way to do this is by point-wise addition. Since the hidden vectors are all the same size, you can just add up these vectors element by element to produce another vector of the same size. Now the decoder is getting information about each step. However, it only needs information from the first few input steps to predict the first word. In essence, this isn’t that much different from using the last hidden states from LSTM or GRU.

The solution here is to weigh certain encoder vectors more than others before the point-wise addition, so the vectors that are more important for the next decoder output would have larger weights. This way, the context vector holds more information about the most important words and less information about other words.

Weighted pointwise addition of hidden states to create context vector c

How to calculate the weights?

The decoder's previous hidden state contains information about the previous words in the output translation. This means you can compare the decoder states with each encoder state to determine the most important inputs. Intuitively, the decoder can set the weights such that it focuses on only the most important input words for the next prediction. In other words, using the decoder’s last hidden state, we can decide which parts of the input sequence to pay attention to.

Seq2Seq model with attention layer

Attention!

Finally, we have built enough intuition to dive into the attention layer and examine how the context vector is calculated.

The first step is to calculate the alignments, eᵢⱼ, which is a score of how well the inputs around j match the expected output i. The more the match, the higher the score we will expect. This is done using a feedforward neural network with the encoder and decoder hidden states as inputs, where the weights for the feedforward network are learned along with the rest of the Seq2Seq model. (Note that this is a rudimentary way for calculating alignments, there are newer and more efficient methods as well)

The attention layer

The scores are then turned into weights which range from zero to one using the softmax function. The weights can be thought of as a probability distribution that sums to one. Finally, each encoder state is multiplied by its respective weights and summed together into one context vector.

Calculating the context vector

Conclusion

In conclusion, the implementation of an attention mechanism in machine translation, specifically in the Seq2Seq model using LSTMs, addresses the limitations associated with the traditional approach. The conventional Seq2Seq model, while groundbreaking, suffered from the information bottleneck when dealing with long sequences. By introducing attention, we overcome this hurdle and allow the decoder to selectively focus on relevant parts of the input sentence, regardless of its length.

The attention layer calculates alignments, determining the similarity between the input and output sequences. Through a feedforward neural network, scores are generated, converted into weights using the softmax function, and applied to the encoder states. This results in a context vector that encapsulates information from the input sequence, with a focus on key elements for accurate translation.

In summary, the integration of attention in machine translation facilitates the generation of more accurate and contextually relevant translations, marking a significant stride in improving the efficiency and performance of machine translation systems.

--

--

Suyash Khare
Suyash Khare

Written by Suyash Khare

They call me Dirichlet because all my potential is latent and awaiting allocation.

No responses yet