BERT cased vs BERT uncased

BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. It is pre-trained on huge, unlabeled text data (without any genuine training objective). BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text.

We have explored the difference between BERT cased and BERT uncased. BERT uncased and BERT cased are different in terms of BERT training using case of text in WordPiece tokenization step and presence of accent markers.

To get the indepth understanding of BERT model, please go ahead on this link which will help you understand it in depth.

BERT uncased and cased in tokenization

In BERT uncased, the text has been lowercased before WordPiece tokenization step while in BERT cased, the text is same as the input text (no changes).

For example, if the input is "OpenGenus", then it is converted to "opengenus" for BERT uncased while BERT cased takes in "OpenGenus".

# BERT uncased
OpenGenus -> opengenus

# BERT cased

Accent markers

In BERT uncased, we strip out any accent markers while in BERT cased, accent markers are preserved.
Accent markers are marks over letters which are usually used in Latin language.

In terms of accent markers, we have:

BERT uncased
OpènGènus -> opengenus

# BERT cased

Note the letter "e" in the above example. It has an accent marker over it.

Applications (BERT uncased vs cased)

BERT uncased is better than BERT cased in most applications except in applications where case information of text is important.

Named Entity Recognition and Part-of-Speech tagging are two applications where case information is important and hence, BERT cased is better in this case.

The number of parameters and layers are same across BERT uncased and BERT cased.

Following is the accuracy measurement for BERT uncased and cased for sample applications:

Model SQUAD 1.1 F1/EM Multi NLI Accuracy
BERT-Large, Uncased (Original) 91.0/84.3 86.05
BERT-Large, Uncased (Whole Word Masking) 92.8/86.7 87.07
BERT-Large, Cased (Original) 91.5/84.8 86.09
BERT-Large, Cased (Whole Word Masking) 92.9/86.7 86.46

Note that with BERT cased, the accuracy improves slightly. This depends on the application. There are applications where BERT uncased works well.