Tokenization in NLP [Complete Guide]
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
In this article, we will look at the different approaches to tokenization and their pros and cons in Natural Language Processing (NLP).
Table of Content
- What is Tokenization
- Rule based Tokenization
- Dictionary-Based Tokenization
- Statistical-Based Tokenization
- White Space Tokenization
- Penn Tree Tokenization
- Moses Tokenization
- Subword Tokenization
- Byte-Pair Encoding
What is Tokenization?
Tokenization is an essential part of natural language processing (NLP). It involves splitting a text into smaller pieces, known as tokens. These tokens can be words, phrases or even characters and are the basis for any NLP task such as sentiment analysis, machine translation and text summarization.
Approaches to Tokenization
There are three primary approaches to tokenization: rule-based, dictionary-based, and statistical-based.
Rule-Based Tokenization
Rule-based tokenization involves using predefined rules to break a text into tokens. These rules are usually based on regular expressions or grammar patterns. For example, a simple rule could be to split a text into tokens by whitespace characters like spaces or tabs. Another rule could be to split a text by punctuation marks like commas or periods.
Example
Consider the following text:
The quick brown fox jumped over the lazy dog.
A rule-based tokenizer might split this text into the following tokens:
The, quick, brown, fox, jumped, over, the, lazy, dog, .
Pros and Cons
One advantage of rule-based tokenization is that it is simple and fast. It is also easy to customize rules for specific tasks or languages. However, the downside is that it can be error-prone if the rules are not carefully crafted. Moreover, rule-based tokenization might not handle unusual cases that don't fit the rules.
Dictionary-Based Tokenization
Dictionary-based tokenization involves using a predefined list of words or phrases to match text segments with tokens. These lists are usually called dictionaries or lexicons. Dictionary-based tokenization is commonly used for handling specific types of tokens like emoticons, slang, or abbreviations.
Example
Consider the following text:
I'm feeling :-)
A dictionary-based tokenizer might split this text into the following tokens:
I'm, feeling, :-)
Pros and Cons
The advantage of dictionary-based tokenization is that it can handle unusual cases that might not fit the rules. Moreover, it can be customized for specific tasks or domains. However, the downside is that it requires a pre-built dictionary, which might not be available for all cases. Moreover, it might miss out on new or unknown words that are not in the dictionary.
Statistical-Based Tokenization
Statistical-based tokenization involves using statistical models trained on large corpora of text to learn probabilistic patterns for splitting text into tokens. These models use various features like n-gram frequencies or part-of-speech tags to determine where to insert boundaries between tokens.
Example
Consider the following text:
Thisisthebestthingever
A statistical-based tokenizer might split this text into the following tokens:
This, is, the, best, thing, ever
Pros and Cons
The advantage of statistical-based tokenization is that it can handle a wide range of languages and domains. Moreover, it can adapt to new or unknown words by updating the statistical model. However, the downside is that it requires a large corpus of text for training, which might not be available for all cases. Moreover, it can be computationally expensive, especially for deep learning-based models.
White Space Tokenization
In this method, tokens are separated by whitespace characters like space, tab, or newline. For example, consider the following sentence:
"The quick brown fox jumps over the lazy dog."
The white space tokenization of this sentence would be:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog."]
Pros:
Easy to implement and understand.
Splits text based on natural language word boundaries.
Cons:
Fails to tokenize punctuations and symbols correctly.
Fails to tokenize words that are hyphenated or contain internal punctuation.
Penn Tree Tokenization
This method is based on the Penn Treebank tokenization guidelines, which are widely used in natural language processing. In this method, tokens are separated by whitespace characters, but some punctuation marks are treated as separate tokens. For example, consider the same sentence as above:
"The quick brown fox jumps over the lazy dog."
The Penn Tree tokenization of this sentence would be:
["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "."]
Pros:
More accurate than white space tokenization in tokenizing punctuations, symbols, and hyphenated words.
Handles internal punctuation well.
Cons:
May split contractions or words with apostrophes into separate tokens, which can affect downstream natural language processing tasks.
Moses Tokenization
This method is a more advanced form of tokenization, where tokens are separated by whitespace characters and certain punctuation marks. In addition, it can handle Unicode characters and special cases like abbreviations and numbers. For example, consider the following sentence:
"I bought a book for $10.50 from Amazon."
The Moses tokenization of this sentence would be:
["I", "bought", "a", "book", "for", "__aSyNcId_<_VRYrT_Bw__quot;, "10.50", "from", "Amazon", "."]
Pros:
Handles a wide range of special characters and punctuation, including handling apostrophes and hyphens.
Splits the text into subwords, which helps to handle out-of-vocabulary words.
Cons:
Can be slower than other tokenization methods.
May over-split text, leading to a larger number of tokens.
Subword Tokenization
This method is useful when dealing with languages with a large vocabulary or for handling out-of-vocabulary words. In this method, words are broken down into subword units based on their frequency in the corpus. For example, consider the word "unbelievable". The subword tokenization of this word could be:
["un", "be", "liev", "able"]
Pros:
Handles out-of-vocabulary words well, by breaking them down into smaller subword units.
Can be used to generate a fixed-size vocabulary.
Cons:
Can result in a large number of subwords for longer words, leading to longer token sequences and higher memory usage.
Can be computationally expensive to train, particularly for larger datasets.
Byte-Pair Encoding
This is a data compression technique that can also be used for tokenization. It works by replacing the most frequently occurring pair of bytes with a single byte, and iteratively repeating this process until a desired vocabulary size is reached. For example, consider the word "banana". The byte-pair encoding of this word could be:
["b", "a", "n", "an", "a"]
Pros:
Similar to subword tokenization, it can handle out-of-vocabulary words by breaking them down into smaller subword units.
It can learn a more compact vocabulary as compared to subword tokenization.
Cons:
Can result in a large number of subwords for longer words, leading to longer token sequences and higher memory usage.
Can be computationally expensive to train, particularly for larger datasets.
Note: The pros and cons listed above are not exhaustive and may vary depending on the specific use case and implementation details.
Code Implementation
import required libraries
import re
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
from sacremoses import MosesTokenizer
from sentencepiece import SentencePieceTrainer
from tokenizers import ByteLevelBPETokenizer
# Define sample text
text = "The quick brown fox jumps over the lazy dog. I bought a book for $10.50 from Amazon. This is an unbelievable achievement."
# White Space Tokenization
tokens_ws = text.split()
print("White Space Tokenization:")
print(tokens_ws)
# Penn Tree Tokenization
tokens_pt = word_tokenize(text)
print("Penn Tree Tokenization:")
print(tokens_pt)
# Moses Tokenization
mt = MosesTokenizer()
tokens_mt = mt.tokenize(text, return_str=False)
print("Moses Tokenization:")
print(tokens_mt)
# Subword Tokenization
trainer = SentencePieceTrainer()
trainer.train(text, vocab_size=50)
tokenizer_subword = trainer.get_piece_model()
tokens_subword = tokenizer_subword.encode_as_pieces(text)
print("Subword Tokenization:")
print(tokens_subword)
# Byte-Pair Encoding
tokenizer_bpe = ByteLevelBPETokenizer()
tokenizer_bpe.train([text])
tokens_bpe = tokenizer_bpe.encode(text).tokens
print("Byte-Pair Encoding:")
print(tokens_bpe)
# Output:
White Space Tokenization:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.', 'I', 'bought', 'a', 'book', 'for', '$10.50', 'from', 'Amazon.', 'This', 'is', 'an', 'unbelievable', 'achievement.']
Penn Tree Tokenization:
['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.', 'I', 'bought', 'a', 'book', 'for', '
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.