Tokenization in NLP [Complete Guide]

Open-Source Internship opportunity by OpenGenus for programmers. Apply now.

In this article, we will look at the different approaches to tokenization and their pros and cons in Natural Language Processing (NLP).

Table of Content

What is Tokenization
Rule based Tokenization
Dictionary-Based Tokenization
Statistical-Based Tokenization
White Space Tokenization
Penn Tree Tokenization
Moses Tokenization
Subword Tokenization
Byte-Pair Encoding

What is Tokenization?

Tokenization is an essential part of natural language processing (NLP). It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis, machine translation and text summarization.

Approaches to Tokenization

There are three primary approaches to tokenization: rule-based, dictionary-based, and statistical-based.

Rule-Based Tokenization

Rule-based tokenization involves using predefined rules to break a text into tokens. These rules are usually based on regular expressions or grammar patterns. For example, a simple rule could be to split a text into tokens by whitespace characters like spaces or tabs. Another rule could be to split a text by punctuation marks like commas or periods.

Example
Consider the following text:

The quick brown fox jumped over the lazy dog.

A rule-based tokenizer might split this text into the following tokens:

The, quick, brown, fox, jumped, over, the, lazy, dog, .

Pros and Cons

One advantage of rule-based tokenization is that it is simple and fast. It is also easy to customize rules for specific tasks or languages. However, the downside is that it can be error-prone if the rules are not carefully crafted. Moreover, rule-based tokenization might not handle unusual cases that don't fit the rules.

Dictionary-Based Tokenization

Dictionary-based tokenization involves using a predefined list of words or phrases to match text segments with tokens. These lists are usually called dictionaries or lexicons. Dictionary-based tokenization is commonly used for handling specific types of tokens like emoticons, slang, or abbreviations.

Example
Consider the following text:

I'm feeling :-)

A dictionary-based tokenizer might split this text into the following tokens:

I'm, feeling, :-)

Pros and Cons

The advantage of dictionary-based tokenization is that it can handle unusual cases that might not fit the rules. Moreover, it can be customized for specific tasks or domains. However, the downside is that it requires a pre-built dictionary, which might not be available for all cases. Moreover, it might miss out on new or unknown words that are not in the dictionary.

Statistical-Based Tokenization

Statistical-based tokenization involves using statistical models trained on large corpora of text to learn probabilistic patterns for splitting text into tokens. These models use various features like n-gram frequencies or part-of-speech tags to determine where to insert boundaries between tokens.

Example
Consider the following text:

Thisisthebestthingever

A statistical-based tokenizer might split this text into the following tokens:

This, is, the, best, thing, ever

Pros and Cons

The advantage of statistical-based tokenization is that it can handle a wide range of languages and domains. Moreover, it can adapt to new or unknown words by updating the statistical model. However, the downside is that it requires a large corpus of text for training, which might not be available for all cases. Moreover, it can be computationally expensive, especially for deep learning-based models.

White Space Tokenization

In this method, tokens are separated by whitespace characters like space, tab, or newline. For example, consider the following sentence:

"The quick brown fox jumps over the lazy dog."

The white space tokenization of this sentence would be:

["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]

Pros:
Easy to implement and understand.
Splits text based on natural language word boundaries.

Cons:
Fails to tokenize punctuations and symbols correctly.
Fails to tokenize words that are hyphenated or contain internal punctuation.

Penn Tree Tokenization

This method is based on the Penn Treebank tokenization guidelines, which are widely used in natural language processing. In this method, tokens are separated by whitespace characters, but some punctuation marks are treated as separate tokens. For example, consider the same sentence as above:

"The quick brown fox jumps over the lazy dog."

The Penn Tree tokenization of this sentence would be:

["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]

Pros:
More accurate than white space tokenization in tokenizing punctuations, symbols, and hyphenated words.
Handles internal punctuation well.

Cons:
May split contractions or words with apostrophes into separate tokens, which can affect downstream natural language processing tasks.

Moses Tokenization

This method is a more advanced form of tokenization, where tokens are separated by whitespace characters and certain punctuation marks. In addition, it can handle Unicode characters and special cases like abbreviations and numbers. For example, consider the following sentence:

"I bought a book for $10.50 from Amazon."

The Moses tokenization of this sentence would be:

["I", "bought", "a", "book", "for", "$", "10.50", "from", "Amazon", "."]

Pros:
Handles a wide range of special characters and punctuation, including handling apostrophes and hyphens.
Splits the text into subwords, which helps to handle out-of-vocabulary words.

Cons:
Can be slower than other tokenization methods.
May over-split text, leading to a larger number of tokens.

Subword Tokenization

This method is useful when dealing with languages with a large vocabulary or for handling out-of-vocabulary words. In this method, words are broken down into subword units based on their frequency in the corpus. For example, consider the word "unbelievable". The subword tokenization of this word could be:

["un", "be", "liev", "able"]

Pros:
Handles out-of-vocabulary words well, by breaking them down into smaller subword units.
Can be used to generate a fixed-size vocabulary.

Cons:
Can result in a large number of subwords for longer words, leading to longer token sequences and higher memory usage.
Can be computationally expensive to train, particularly for larger datasets.

Byte-Pair Encoding

This is a data compression technique that can also be used for tokenization. It works by replacing the most frequently occurring pair of bytes with a single byte, and iteratively repeating this process until a desired vocabulary size is reached. For example, consider the word "banana". The byte-pair encoding of this word could be:

["b", "a", "n", "an", "a"]

Pros:
Similar to subword tokenization, it can handle out-of-vocabulary words by breaking them down into smaller subword units.
It can learn a more compact vocabulary as compared to subword tokenization.

Note: The pros and cons listed above are not exhaustive and may vary depending on the specific use case and implementation details.

Code Implementation

import required libraries
import re
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
from sacremoses import MosesTokenizer
from sentencepiece import SentencePieceTrainer
from tokenizers import ByteLevelBPETokenizer

# Define sample text
text = "The quick brown fox jumps over the lazy dog. I bought a book for $10.50 from Amazon. This is an unbelievable achievement."

# White Space Tokenization
tokens_ws = text.split()
print("White Space Tokenization:")
print(tokens_ws)

# Penn Tree Tokenization
tokens_pt = word_tokenize(text)
print("Penn Tree Tokenization:")
print(tokens_pt)

# Moses Tokenization
mt = MosesTokenizer()
tokens_mt = mt.tokenize(text, return_str=False)
print("Moses Tokenization:")
print(tokens_mt)

# Subword Tokenization
trainer = SentencePieceTrainer()
trainer.train(text, vocab_size=50)
tokenizer_subword = trainer.get_piece_model()
tokens_subword = tokenizer_subword.encode_as_pieces(text)
print("Subword Tokenization:")
print(tokens_subword)

# Byte-Pair Encoding
tokenizer_bpe = ByteLevelBPETokenizer()
tokenizer_bpe.train([text])
tokens_bpe = tokenizer_bpe.encode(text).tokens
print("Byte-Pair Encoding:")
print(tokens_bpe)


# Output:


White Space Tokenization:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'I', 'bought', 'a', 'book', 'for', '$10.50', 'from', 'Amazon.', 'This', 'is', 'an', 'unbelievable', 'achievement.']

Penn Tree Tokenization:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'I', 'bought', 'a', 'book', 'for', '$', '10.50', 'from', 'Amazon', '.', 'This', 'is', 'an', 'unbelievable', 'achievement', '.']

Moses Tokenization:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'I', 'bought', 'a', 'book', 'for', '$10.50', 'from', 'Amazon', '.', 'This', 'is', 'an', 'unbelievable', 'achievement', '.']

Subword Tokenization:
['▁The', '▁quick', '▁brown', '▁fox', '▁jumps', '▁over', '▁the', '▁lazy', '▁dog', '.', '▁I', '▁bought', '▁a', '▁book', '▁for', '▁$', '1', '0', '.', '5', '0', '▁from', '▁A', 'm', 'a', 'zon', '.', '▁This', '▁is', '▁an', '▁un', 'b', 'e', 'l', 'iev', 'a', 'ble', '▁a', 'c', 'h', 'i', 'e', 'v', 'e', 'ment', '.']

Byte-Pair Encoding:
['The', 'Ġquick', 'Ġbrown', 'Ġfox', 'Ġjumps', 'Ġover', 'Ġthe', 'Ġlazy', 'Ġdog', '.', 'ĠI', 'Ġbought', 'Ġa', 'Ġbook', 'Ġfor', 'Ġ$', '10', '.', '50', 'Ġfrom', 'ĠAmazon', '.', 'ĠThis', 'Ġis', 'Ġan', 'Ġun', 'believable', 'Ġachievement', '.']

Conclusion

With this article at OpenGenus, you must have the complete idea of Tokenization in NLP.

Tokenization is a important step in NLP, it affects the accuracy and efficiency of downstream tasks. Rule-based, dictionary-based, and statistical-based tokenization are the most common approaches to tokenization. Each approach has its own pros and cons, and the choice of approach depends on the specific requirements of the task at hand. We can choose the best tokenization method for our NLP pipeline.

Tokenization in NLP [Complete Guide]

Natural Language Processing (NLP)

Table of Content

What is Tokenization?

Approaches to Tokenization

Rule-Based Tokenization

Dictionary-Based Tokenization

Statistical-Based Tokenization

White Space Tokenization

Penn Tree Tokenization

Moses Tokenization

Subword Tokenization

Byte-Pair Encoding

Code Implementation

Conclusion

JavaScript Debugging

3 Types of Naive Bayes