Interview Questions on Transformers

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

The Transformers architecture introduced in the paper “Attention Is All You Need”, has changed the scenario of creating more complex and advanced NLP models. The Transformer in NLP is a novel architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies with ease. Following are the important questions for an interview on Transformers.

Table of Content

Multiple Choice Questions
Descriptive Questions
Practical Questions

Multiple Choice Questions

1. What does “transfer learning” mean?

a. Transferring the knowledge of a pretrained model to a new model by training it on the same dataset.
b. Transferring the knowledge of a pretrained model to a new model by initializing the second model with the first model's weights.
c. Transferring the knowledge of a pretrained model to a new model by building the second model with the same architecture as the first model.

Ans: Option b. Transferring the knowledge of a pretrained model to a new model by initializing the second model with the first model's weights.

Explanation: When the second model is trained on a new task, it transfers the knowledge of the first model.

2. True or false? A language model usually does not need labels for its pretraining.

a. True
b. False
c. Not enough information

Ans: Option a. True

Explanation: The pretraining is usually self-supervised, which means the labels are created automatically from the inputs (like predicting the next word or filling in some masked words).

3. Which of these types of models would you use for completing prompts with generated text?

a. An encoder model
b. A decoder model
c. A sequence-to-sequence model

Ans: Option b. A decoder model

Explanation: Decoder models are perfectly suited for text generation from a prompt.

4. Which of those types of models would you use for summarizing texts?

a. An encoder model
b. A decoder model
c. A sequence-to-sequence model

Ans: Option c. A sequence-to-sequence model

Explanation: Sequence-to-sequence models are perfectly suited for a summarization task.

5. Which of these types of models would you use for classifying text inputs according to certain labels?

a. An encoder model
b. A decoder model
c. A sequence-to-sequence model

Ans: Option a. An encoder model

Explanation: An encoder model generates a representation of the whole sentence which is perfectly suited for a task like classification.

6. What possible source can the bias observed in a model have?

a. The model is a fine-tuned version of a pretrained model and it picked up its bias from it.
b. The data the model was trained on is biased.
c. The metric the model was optimizing for is biased.

Ans: Option b. The data the model was trained on is biased.

Explanation: This is the most obvious source of bias, but not the only one.

7. What is the order of the language modeling pipeline?

a. First, the model, which handles text and returns raw predictions. The tokenizer then makes sense of these predictions and converts them back to text when needed.
b. First, the tokenizer, which handles text and returns IDs. The model handles these IDs and outputs a prediction, which can be some text.
c. The tokenizer handles text and returns IDs. The model handles these IDs and outputs a prediction. The tokenizer can then be used once again to convert these predictions back to some text.

Ans: Option c. The tokenizer handles text and returns IDs. The model handles these IDs and outputs a prediction. The tokenizer can then be used once again to convert these predictions back to some text.

Explanation: The tokenizer can be used for both tokenizing and de-tokenizing.

8. How many dimensions does the tensor output by the base Transformer model have, and what are they?

a. 2: The sequence length and the batch size
b. 2: The sequence length and the hidden size
c. 3: The sequence length, the batch size, and the hidden size

Ans: Option c. 3: The sequence length, the batch size, and the hidden size

Explanation: The vector output by the Transformer module is usually large. It generally has three dimensions:

Batch size: The number of sequences processed at a time (2 in our example).
Sequence length: The length of the numerical representation of the sequence (16 in our example).
Hidden size: The vector dimension of each model input.

9. What is the point of applying a SoftMax function to the logits output by a sequence classification model?

a. It softens the logits so that they're more reliable.
b. It applies a lower and upper bound so that they're understandable.
c. The total sum of the output is then 1, resulting in a possible probabilistic interpretation.
d. Both B and C

Ans: Option d. Both B and C

Explanation: Softmax is an activation function that scales numbers/logits into probabilities. The resulting values are bound between 0 and 1.

10. How does the BERT model expect a pair of sentences to be processed?

a. Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2
b. [CLS] Tokens_of_sentence_1 Tokens_of_sentence_2
c. [CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2 [SEP]
d. [CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2

Ans: Option c. [CLS] Tokens_of_sentence_1 [SEP] Tokens_of_sentence_2 [SEP]

Explanation: BERT needs the input to be massaged and decorated with some extra metadata:

Token embeddings: A [CLS] token is added to the input word tokens at the beginning of the first sentence and a [SEP] token is inserted at the end of each sentence.
Segment embeddings: A marker indicating Sentence A or Sentence B is added to each token. This allows the encoder to distinguish between sentences.
Positional embeddings: A positional embedding is added to each token to indicate its position in the sentence.

11. What is the purpose of a collate function?

a. It ensures all the sequences in the dataset have the same length.
b. It puts together all the samples in a batch.
c. It preprocesses the whole dataset.
d. It truncates the sequences in the dataset.

Ans: Option b. It puts together all the samples in a batch.

Explanation: You can pass the collate function as an argument of a DataLoader. We used the DataCollatorWithPadding function, which pads all items in a batch so they have the same length.

12. What’s the purpose of TrainingArguments?

a. It contains all the hyperparameters used for training and evaluation with the Trainer.
b. It specifies the size of the model.
c. It just contains the hyperparameters used for evaluation.
d. It just contains the hyperparameters used for training.

Ans: Option a. It contains all the hyperparameters used for training and evaluation with the Trainer.

Explanation: Training arguments are a set of arguments related to the training loop that are passed into the Trainer instance. TrainingArguments is used to access all the points of customization during training.

13. Which of the following tasks can be framed as a token classification problem?

a. Find the grammatical components in a sentence.
b. Find the persons mentioned in a sentence.
c. Find whether a sentence is grammatically correct or not.
d. Both A and B

Ans: Option d. Both A and B

Explanation: Words in a sentence can be labelled as a noun, verb, etc. In a sentence, a word can be labelled as person or not person.

14. What does “domain adaptation” mean?

a. It's when we run a model on a dataset and get the predictions for each sample in that dataset.
b. It's when we train a model on a dataset.
c. It's when we fine-tune a pretrained model on a new dataset, and it gives predictions that are more adapted to that dataset
d. It's when we add misclassified samples to a dataset to make our model more robust.

Ans: Option c. It's when we fine-tune a pretrained model on a new dataset, and it gives predictions that are more adapted to that dataset

Explanation: The model adapted its knowledge to the new dataset.

15. What are the labels in a masked language modeling problem?

a. Some of the tokens in the input sentence are randomly masked and the labels are the original input tokens.
b. Some of the tokens in the input sentence are randomly masked and the labels are the original input tokens, shifted to the left.
c. Some of the tokens in the input sentence are randomly masked, and the label is whether the sentence is positive or negative.
d. Some of the tokens in the two input sentences are randomly masked, and the label is whether the two sentences are similar or not.

Ans: Option a. Some of the tokens in the input sentence are randomly masked and the labels are the original input tokens.

Explanation: Masked language modeling is also known as a fill-mask task because it predicts a masked token in a sequence.

16. Which of these tasks can be seen as a sequence-to-sequence problem?

a. Writing short reviews of long documents
b. Answering questions about a document
c. Translating a text in Chinese into English
d. All of the above

Ans: Option d. All of the above

Explanation: Sequence to Sequence (often abbreviated to seq2seq) models is a special class of Recurrent Neural Network architectures that we typically use (but not restricted) to solve complex Language problems like Machine Translation, Question Answering, creating Chatbots, Text Summarization, etc.

17. When should you pretrain a new model?

a. When there is no pretrained model available for your specific language
b. When you have concerns about the bias of the pretrained model you are using
c. When you have lots of data available, even if there is a pretrained model that could work on it
d. Both A and B

Ans: Option d. Both A and B

Explanation: But you have to make very sure the data you will use for training is really better.

18. Why is it often unnecessary to specify a loss when calling compile() on a Transformer model?

a. Because Transformer models are trained with unsupervised learning
b. Because the model's internal loss output is used by default
c. Because we compute metrics after training instead
d. Because loss is specified in model.fit() instead

Ans: Option b. Because the model's internal loss output is used by default

Explanation: In Keras, the standard way to train a model is to create it, then compile() it with an optimizer and loss function, and finally fit() it. If you specify a loss argument to compile(), then the model will use that instead of the default loss.

19. What is normalization?

a. It's any cleanup the tokenizer performs on the texts in the initial stages.
b. It's a data augmentation technique that involves making the text more normal by removing rare words.
c. It's the final post-processing step where the tokenizer adds the special tokens.
d. It's when the embeddings are made with mean 0 and standard deviation 1, by subtracting the mean and dividing by the std.

Ans: Option a. It's any cleanup the tokenizer performs on the texts in the initial stages.

Explanation: for instance, it might involve removing accents or whitespace, or lowercasing the inputs.

20. How does the question-answering pipeline handle long contexts?

a. It doesn't really, as it truncates the long context at the maximum length accepted by the model.
b. It splits the context into several parts and averages the results obtained.
c. It splits the context into several parts (with overlap) and finds the maximum score for an answer in each part.
d. It splits the context into several parts (without overlap, for efficiency) and finds the maximum score for an answer in each part.

Ans: Option c. It splits the context into several parts (with overlap) and finds the maximum score for an answer in each part.

Explanation: The question-answering pipeline allows us to split the context into smaller chunks, specifying the maximum length. To make sure we don’t split the context at exactly the wrong place to make it possible to find the answer, it also includes some overlap between the chunks.

Descriptive Questions

1. What is NLP?

Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can.

NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. Together, these technologies enable computers to process human language in the form of text or voice data and to ‘understand’ its full meaning, complete with the speaker or writer’s intent and sentiment.

2. What is seq2seq model?

As the name suggests, seq2seq takes as input a sequence of words(sentence or sentences) and generates an output sequence of words. It does so by use of the recurrent neural network (RNN). Although the vanilla version of RNN is rarely used, its more advanced version i.e. LSTM or GRU is used. It develops the context of the word by taking 2 inputs at each point in time. One from the user and the other from its previous output, hence the name recurrent (output goes as input).

It mainly has two components i.e encoder and decoder, and hence sometimes it is called the Encoder-Decoder Network.

Encoder: It uses deep neural network layers and converts the input words to corresponding hidden vectors. Each vector represents the current word and the context of the word.

Decoder: It is similar to the encoder. It takes as input the hidden vector generated by the encoder, its own hidden states, and the current word to produce the next hidden vector and finally predict the next word.

3. What is the basis of attention mechanism?

In psychology, attention is the cognitive process of selectively concentrating on one or a few things while ignoring others.

A neural network is considered to be an effort to mimic human brain actions in a simplified manner. Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things, while ignoring others in deep neural networks.

4. Explain the need of attention

Vanilla Encoder-Decoder architecture passes only the last hidden state from the encoder to the decoder. This leads to the problem that information has to be compressed into a fixed length vector and information can be lost in this compression. Especially information found early in the sequence tends to be “forgotten” after the entire sequence is processed. The addition of bi-directional layers remedies this by processing the input in reversed order. While this helps for shorter sequences, the problem still persists for long input sequences. The development of attention enables the decoder to attend to the whole sequence and thus use the context of the entire sequence during the decoding step.

5. Why choose Attention based models over Recurrent based ones?

RNN works sequentially, this means in order to compute the second word of a sentence (second time step) we need the first hidden vector to be calculated (first time step). Then in order to calculate the hidden state a time t you always have to wait the results from t-1, so we cannot parallelize. Moreover, RNN implies a huge number of calculations requiring a lot of resources.

RNN with attention has improved the extraction of temporal dependencies over longer sentences but still struggles with long sequences. In a simple Encoder-Decoder architecture the decoder is supposed to start making predictions by looking only at the final output of the encoder step which has condensed information. On the other hand, attention based architecture attends every hidden state from each encoder node at every time step and then makes predictions after deciding which one is more informative.

6. What is Self Attention?

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. Self-attention allows the model to look at the other words in the input sequence to get a better understanding of a certain word in the sequence.

7. What is a Transformer?

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing (NLP) and computer vision (CV). Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution. Here, “transduction” means the conversion of input sequences into output sequences. The idea behind Transformer is to handle the dependencies between input and output with attention and recurrence completely.

8. What are the applications of Transformer

Law – In analysing legal records
Healthcare sector – Analysing medical records and drug interaction
Virtual assistants
Machine translation
Text summarization
Document generation
Named Entity Recognition (NER)
Biological sequence analysis
Video understanding.

9. Explain Transformer Architecture

Encoder and Decoder are building blocks of a Transformer. The encoder block turns the sequence of input words into a vector and a Decoder converts a vector into a sequence.

The encoder architecture has two layers: Self Attention and Feed Forward. The encoder’s inputs first pass by a self-attention layer and then the outputs of the self-attention layer are fed to a feed-forward neural network. Sequential data has temporal characteristics. It signifies that each word holds some position concerning the other.

The decoder architecture has three layers: Self Attention, Encoder-decoder attention, and Feed Forward. The decoder has both the self-attention and feed-forward layer which is also present in the encoder, but between them is an attention layer that helps the decoder focus on relevant parts of the input sentence.

10. Limitations of the Transformer

Transformer is undoubtedly a huge improvement over the RNN based seq2seq models. But it comes with its own share of limitations:

Attention can only deal with fixed-length text strings. The text has to be split into a certain number of segments or chunks before being fed into the system as input
This chunking of text causes context fragmentation. For example, if a sentence is split from the middle, then a significant amount of context is lost. In other words, the text is split without respecting the sentence or any other semantic boundary

11. Explain BERT

BERT is an open source machine learning framework for natural language processing (NLP). BERT is designed to help computers understand the meaning of ambiguous language in text by using surrounding text to establish context. The BERT framework was pre-trained using text from Wikipedia and can be fine-tuned with question and answer datasets.

Historically, language models could only read text input sequentially -- either left-to-right or right-to-left -- but couldn't do both at the same time. BERT is different because it is designed to read in both directions at once. This capability, enabled by the introduction of Transformers, is known as bidirectionality.

Using this bidirectional capability, BERT is pre-trained on two different, but related, NLP tasks: Masked Language Modeling and Next Sentence Prediction.

12. Explain Masked Language Modeling (MLM)

The masked language model randomly masks some of the tokens from the input, and the objective is to predict the original vocabulary id of the masked word based only on its context. Unlike left-to-right language model pre-training, the MLM objective allows the representation to fuse the left and the right context, which allows us to pre-train a deep bidirectional Transformer.

13. Explain Next Sentence Prediction

Generally, language models do not capture the relationship between consecutive sentences. BERT was pre-trained on this task as well. For language model pre-training, BERT uses pairs of sentences as its training data.

For instance, imagine we have a text dataset of 100,000 sentences and we want to pre-train a BERT language model using this dataset. So, there will be 50,000 training examples or pairs of sentences as the training data.

For 50% of the pairs, the second sentence would actually be the next sentence to the first sentence
For the remaining 50% of the pairs, the second sentence would be a random sentence from the corpus
The labels for the first case would be ‘IsNext’ and ‘NotNext’ for the second case

14. What is Named Entity Recognition (NER)?

Named Entity Recognition is a part of information retrieval, a method to locate and classify the entities present in the unstructured data provided and convert them into predefined categories.

15. What is Tokenizenization?

Tokenization is one of the first step in any NLP pipeline. Tokenization is nothing but splitting the raw text into small chunks of words or sentences, called tokens. If the text is split into words, then its called as 'Word Tokenization' and if it's split into sentences then its called as 'Sentence Tokenization'. Generally 'space' is used to perform the word tokenization and characters like 'periods, exclamation point and newline char are used for Sentence Tokenization. We have to choose the appropriate method as per the task in hand.

Practical Questions

1. How to set up the working environment for transformers on a python notebook?

Head over to your Jupyter notebook, local or in Google Colab(Preferred). We’ll use pip for the installation, which is the package manager for Python.

!pip install transformers

You can make sure the package was correctly installed by importing it within your Python runtime:

import transformers

2. What will the following code return?

from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)
ner("My name is Derek and I live in Paris.")

Named entity recognition (NER) consists of extracting ‘entities’ from text. It will return the words representing persons, or locations. Furthermore, with grouped_entities = True, it will group together the words belonging to the same entity, like "Paris".

OUTPUT:
[{'end': 16,
'entity_group': 'PER',
'score': 0.9991296,
'start': 11,
'word': 'Derek'},
{'end': 36,
'entity_group': 'LOC',
'score': 0.99867034,
'start': 31,
'word': 'Paris'}]

3. What is the output of the following snippet

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.tokenize("Hello!")

Tokenization is the process of breaking up a larger entity into its constituent units. The code returns a list of strings, each string being a token.

OUTPUT: ['Hello', '!']

4. Demostrate tokenizers - tokenization, encoding and decoding

Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data.

Tokenization
The tokenization process is done by the tokenize() method of the tokenizer:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "Hello world"
tokens = tokenizer.tokenize(sequence)
print(tokens)

Output: ['Hello', 'world']

This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary.

Encoding : fom tokens to input IDs
The conversion to input IDs is handled by the convert_tokens_to_ids() tokenizer method:

ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

Output: [8667, 1362]

These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model.

Decoding
Decoding is going the other way around: from vocabulary indices, we want to get a string. This can be done with the decode() method as follows:

decoded_string = tokenizer.decode(ids)
print(decoded_string)

Output: Hello world