Attention Mechanism & Code— NLP is easy

Fragkoulis Logothetis
3 min readJan 11, 2021

by F. N. LOGOTHETIS

“Attention is all you need”

Recurrent neural networks (RNN), long short-term memory (LSTM) and gated recurrent neural networks (GRNN) in particular, have been established as state-of-the art approaches in sequence modeling and other problems, such as language modeling and machine translation.

Recurrent models are typically recurrent nodes which take as input the hidden states (h) of the previous node and a new embeddings (v). Then, they generate a sequence of new hidden sated (h’), as a function of the inputs. For each new word or character, a new hidden state is produced resulting in a long sequence of recurrent nodes. The inherently sequential nature (new hidden state for each new word) precludes the parallelization and does not take advantage of the modern hardware (GPU, TPU). Another caveat of recurrent units is the lack of long-term history. Even though, the huge effort of researchers to create a more robust RNN in terms of long historical memory, the experimental results show only minor optimizations. To recap, the basic disadvantages of RNN are the long-term history and the lack of the parallelization. The former could be further optimized by using attention mechanism.

What is the attention mechanism (AM)?

Let’s think like humans. When we try to translate a sentence from English to Greek, we pay attention only to words that incorporate the basic meaning of the sentence. We never try to remember all the text in order to translate a sentence. This idea is inherited by attention mechanism (AM). AM has become an integral part of compelling sequence modelling. The basic idea of attention mechanism is the ability to model dependencies without regard to their distance in input sentence.

Let’s assume that we have a set of embedding tokens v={v_1, v_2,…,v_n}, with v_n being of dimension ( n x dim) and (dim) being the dimension of each token.

In short, a self attention layer projects embeddings to the Query (Q), Key (K) , and Value (V) matrices (This process is similar to database query answering). Technically speaking, keys are embeddings that are stored in a database and we ask (using Q) to retrieve only the keys (K) that are similar to Q. But how the similarity is extracted? We only have to compute the correlation matrix. The correlation matrix is computed by Z= QK^T, where ^T is the transpose operation. Z matrix holds all the similarity scores of each query with all the keys, hence, Z is a (nxn)-dimensional matrix. The scores of matrix Z must be normalized between [0, 1] (softmax). The final step is to compute the new embeddings that are enriched with the information of the most similar ones. The following image depicts the core steps of self-attention. It is important to remember that the tensor of the inner dot-product QK^T has asymptotic memory complexity O(nxn).

Let’s now dive into the python code..

What is the Multi-Head Attention (MH-AM)?

Instead of performing a single attention function, it is more beneficial to linearly project the Q, K, V, h times with different learned linear projections. On each of these projected versions of Q, K, V we perform the attention mechanism in parallel, yielding ‘h’ different enriched embeddings. These are concatenated and once again projected, resulting in the final attention, as depicted in the following figure.

Applications of Attention Models

  1. AM is used as intermediate module of seq2seq , in order to improve the performance of language model.
  2. AM is the main structural part of the transformers.

References:

  1. Paper “Attention is all you need”, Ashish Vaswani and
    Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and
    Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin, 2017

--

--