Token Classification in Python with HuggingFace
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
In this article, we will learn about token classification, its applications, and how it can be implemented in Python using the HuggingFace library.
Table of Contents
- What are tokens?
- What is token classification, and how is it used?
- IOB Tagging Format
- Token classification implementation using HuggingFace
- Conclusion
What are tokens?
Tokens are smaller units of a piece of text, separated by punctuation (period, comma, etc.) or whitespace (space, newline, etc.).
For an example, let's take a look at the following sentence:
"I like natural language processing! It is fun."
The tokens of this sentence are:
"I", "like", "natural", "language", "processing", "!", "It", "is", "fun", "."
What is token classification, and how is it used?
Token classification is a natural language understanding task in which a label is predicted for each token in a piece of text. This is different from text classification because each token within the text receives a prediction. Some common token classification tasks include:
- Named Entity Recognition (NER)- Used to identify specific entities in a piece of text (eg. person, location, company, etc.)
- Part-of-Speech (POS) Tagging- Classify each token as the correct part-of-speech (eg. noun, adjective, verb, etc.).
Here's an example of how token classification can be applied for Named-Entity Recognition (NER).
The following output was produced by the spaCy library in Python.
As shown in the above picture, the entities in the text (Apple, U.K., $1 billion) are highlighted and the type of entity (organization, geopolitical entity, and money).
Our focus in this article, however, is not specifically on NER. It is on token classification, and how we can create our own token classification model using the HuggingFace Python library. This token classification model can then be used for NER.
Inside-outside-beginning(IOB) Tagging Format
IOB is a common tagging format used for token classification tasks. It assigns labels to tokens that are part of a specific "chunk". A chunk is essentially a group of words that corresponds to a specific class. For example, in Named Entity Recognition (NER), not all entities are one token. Entities could belong to a chunk - for example, "New York City" is one entity that consists of multiple tokens.
IOB tags tokens that are part of a chunk by adding a prefix, either "I-" or "B-". The I- prefix is added to tokens inside a chunk, and an O tag is assigned to tokens that are not part of any chunk. The B- prefix denotes that the token is at the beginning of a chunk, only if this chunk immediately follows another chunk without O tags in between them. Another very similar format, called IOB2, adds the B- prefix to the beginning of all chunks, regardless of their previous chunks.
Here's an example of IOB tagging for named entity recognition:
Michael I-PER
lives O
in O
New I-LOC
York I-LOC
City I-LOC
With IOB2, the tags would look like this:
Michael B-PER
lives O
in O
New B-LOC
York I-LOC
City I-LOC
For IOB2, notice how the beginning of each chunk has a B- prefix, regardless of whether or not it immediately follows another chunk.
The IOB or IOB2 formats are used by token classification models. An understanding of these tagging format makes it much easier to understand how token classification models work.
Token classification implementation using HuggingFace
We will use the HuggingFace Python library for this part of the article.
Installation
The following pip commands will install the necessary Python libraries.
pip install datasets
pip install transformers
If needed, use this link to see additional installation instructions.
This code works very well in a notebook-based environment, like Google Colab.
Dataset
We will be using the conll2003 dataset. This dataset is used for Named Entity Recognition (NER). It contains four types of entities: persons, locations, organizations, and miscellaneous entities that don't belong to any of the other three categories.
Below is the code for downloading and loading the dataset.
from datasets import load_dataset
raw_dataset = load_dataset("conll2003")
Printing this dataset allows us to see the columns and splits of dataset (training, validation, and testing).
print(raw_dataset)
Output:
DatasetDict({
train: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
num_rows: 14042
})
validation: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
num_rows: 3251
})
test: Dataset({
features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
num_rows: 3454
})
})
Let's take a look at the first element of the dataset.
item = raw_dataset['train'][0] # returns a dict object
print(item['tokens'])
print(item['ner_tags'])
Output:
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]
The tokens
key of each element in the dataset returns a list of tokens.
The ner_tags
key of each element returns a list of each token's NER tag.
This dataset uses IOB2 Tagging, and each NER tag has an index.
We can see a full list of NER labels in this dataset using the following code:
print(raw_dataset["train"].features["ner_tags"].feature.names)
Output:
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
- B-ORG/I-ORG - Organization
- B-PER/I-PER - Person
- B-LOC/I-LOC - Location
- B-MISC/B-MISC - Miscellaneous
- O - doesn't correspond to any entity (as explained in the IOB tagging section)
These NER tags are referenced by their index. That means
[3, 0, 7, 0, 0, 0, 7, 0, 0]
(the NER tags for the first element in the dataset) corresponds to ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']
Tokenization
Before the data can be understood by a model, the texts need to be converted into token IDs. Token IDs are numerical representations of tokens that will be used as input by the model.
There are several methods that can be used for tokenization. Each of these methods works differently, but the underlying mechanism is the same. We will use the BERT pre-trained model, which is a which is a WordPiece tokenizer. WordPiece was introduced in this paper.
This code creates an AutoTokenizer
object using the BERT pre-trained model.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
To tokenize our input, we call tokenizer
on the input data. Below is an example of how we would tokenize the first element in the dataset. Notice how the parameter is_split_into_words
is True
- this is because the dataset has already been split into words (as we saw previously, where each element had a list of tokens).
inputs = tokenizer(raw_dataset["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())
Output:
['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']
As shown above, the tokenizer added the token [CLS] to the beginning and [SEP] to the end. [CLS] is a special token that indicates the start of a new input, and [SEP] indicates the end of the input. These special tokens are used by the token classification model to determine when a new input starts.
Also, the tokenizer tokenized 'lamb' into two subtokens: 'la' and '##mb'. This is because BERT's vocabulary is fixed (around 30,000 tokens), and tokens that are not found in its vocabulary are represented as subtokens and characters. This creates a length mismatch between the labels and the tokens, since there are more tokens than labels.
To fix this issue, HuggingFace has provided a helpful function called tokenize_and_align_labels
. In this method, special tokens get a label of -100, because -100 is ignored by the loss function (cross entropy) we will use. Also, only the first token of each word gets its original label. All other subtokens of a word get the label -100.
Here's their implementation:
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f"ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(label[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
The dataset has a map()
method that can apply the tokenize_and_align_labels
function to each entry in the dataset.
tokenized_dataset= raw_dataset.map(
tokenize_and_align_labels,
batched=True,
remove_columns=raw_dataset["train"].column_names
)
Now, if we inspect our tokenized dataset, we will see some new columns:
print(tokenized_dataset)
Output:
DatasetDict({
train: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
num_rows: 14042
})
validation: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
num_rows: 3251
})
test: Dataset({
features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
num_rows: 3454
})
})
- input_ids are required parameters to be passed to the model as input. They are numerical representations of the tokens.
- labels contains the correct class for each token. It is the column we changed in the
tokenize_and_align_labels()
function. - attention_mask is an optional argument used when batching sequences together. 1 describes a token that should be attended to, and 0 is assigned to padded indices. Padding is done in order to make each sequence the same length (it will be done in the next step).
- token_type_ids are typically used in next sentence prediction tasks, where two sentences are given. Unless we specify two arguments for token types, the tokenizer assigns 0 to each token.
Data Collator
We will use DataCollatorForTokenClassification
to create a batch of examples. This pads the text and labels to the length of the longest element in its batch, so each sample is a uniform length.
We use the following code to create a DataCollatorForTokenClassification
object:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
Training
We will use the AutoModelForTokenClassification
class since this is a token classification task. When we define this model, we need to pass in the name pre-trained model we want to use as well as the number of classes. One way to do this is to simply pass the number of classes into the constructor. However, a more effective way is to creating ID-to-label mapping and a label-to-ID mapping using dictionaries. Using these two dictionaries, the model will be able to determine how many classes there are, and it will be very useful when testing the model on unseen data.
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
label_names = raw_dataset["train"].features["ner_tags"].feature.names
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
label_names = raw_dataset["train"].features["ner_tags"].feature.names
model = AutoModelForTokenClassification.from_pretrained("bert-base-cased", id2label=id2label, label2id=label2id)
In the code below, we are first creating a TrainingArguments
object that contains some training parameters and configurations for our model. This is then passed into a Trainer
object, and when the train()
method is called on this object, it trains the model. After each epoch (there are 3 in this example), the model will be evaluated on the validation data and the model checkpoints will be saved in the results
directory. These model checkpoints can then be loaded and used later without having to retrain. We also save the model in the saved_model
directory.
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model('./saved_model') #For reuse
If a GPU is found, HuggingFace should use it by default, and the training process should take a few minutes to complete. Without a GPU, training can take several hours to complete.
Loading/Testing the Model
Now, we can load the trained Token Classifier from its saved directory with the following code:
from transformers import pipeline
# Replace this with your own checkpoint
model_checkpoint = "./saved_model"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
It's very easy to test it on a new sample:
print(token_classifier("Hello, my name is Joe and I live in Los Angeles"))
Output:
[{'end': 21,
'entity_group': 'PER',
'score': 0.99576664,
'start': 18,
'word': 'Joe'},
{'end': 47,
'entity_group': 'LOC',
'score': 0.9986893,
'start': 36,
'word': 'Los Angeles'}]
The model correctly determined "Joe" to be a person entity and "Los Angeles" to be a location entity.
For each entity the model recognized, it outputted the word, the entity group, and the confidence score. This is where the id2label and label2id mappings are very useful. Since the mappings from ID-to-label were passed to the model, the model knows what entity each ID represents, so it is able to output the actual class name instead of the the ID of each entity class. Also, notice how the entity group doesn't contain the prefix. This is because it recognized that beginning tags and inside tags represent the same class. For example B-PER and I-PER both represent a token that is a person entity.
Conclusion
In this article at OpenGenus, we learned about what token classification is, why it is used, the various steps involved, and how it can be implemented in Python using HuggingFace.
That's it for this article! Thanks for reading.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.