Introduction to Multilingual BERT (M-BERT)

In the previous article, we discussed about the in-depth working of BERT for Native Language Identification (NLI) task. In this article, we explore what is Multilingual BERT (M-BERT) and see a general introduction of this model.

Introduction

image-16-1

Deep learning has revolutionized NLP with introduction of models such as BERT. It is pre-trained on huge, unlabeled text data (without any genuine training objective). However, BERT was trained on English text data, leaving low-resource languages such as Icelandic language behind.

Now there are some approaches to overcome this problem.

One might use Machine Translation (MT), i.e., convert one language to another. This means convert every language other than English into English and apply normal BERT on the problem you are tackling. But the problem is MT is very expensive as to make MT for every language would result in additional cost. Also MT would result in some error (translation error) which would creep in our BERT model. So, this approach doesn't seem feasible.

Second approach would be to train a BERT model in English, other in Swedish, another in German and so on, i.e., one BERT model for one language. It also seem unfeasible due to huge size of so many BERT models and cost of maintaining so many BERT models.

What if we can maintain one BERT model for all the languages. This would be the best approach but it sounds very difficult and seem impossible. Well, if you just train single BERT on many languages, it somehow can work for all the languages that it was trained on! M-BERT works in the exact same way.

How M-BERT is trained?

Text for many languages are taken and all these texts and trained on BERT which makes it M-BERT. Google took text data from 104 different languages and trained all that data on single BERT! But as most data were in English as compared to some other languages such as Icelandic language which is scarce, languages which had less data were over-sampled and languages which had large data were under-sampled. All these data were grabbed from Wikipedia.

How to test such a model?

As we train our M-BERT model for 104 different languages, we need to evaluate our model’s capacity to learn multiple languages and generalize across languages. There is a dataset for exact same thing, The Cross-lingual Natural Language Inference (XNLI) dataset has become a standard dataset for this purpose. It has been used extensively for Multi-Lingual tasks. It considers the case when we have plenty of English training data, but very little for other languages such as Swedish. More precisely, the dataset contains test data in 14 languages (French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu) and testing is done on test data. There are various testing techniques such as premise-hypothesis relationship.

Some Important Observations

One important point to note is that while training on many languages, we keep a shared vocabulary for all languages rather than keeping distinct vocab for every language. This saves space and most importantly our model learns the root structure of language and learn the underlying structure rather than just learning of the vocab.

Luckily, this unusual technique works quite well and achieves very good accuracy. It has been noticed that choosing BERT-Large or model that have very large number of parameters performs very well on XNLI dataset, and rightly so.

Below are the references for M-BERT paper and XNLI dataset.

References

  1. XNLI by Facebook for XNLI dataset for M-BERT
  2. BERT by Google Research for the Research paper for M-BERT.