×

Search anything:

Token Classification in Python with HuggingFace

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

In this article, we will learn about token classification, its applications, and how it can be implemented in Python using the HuggingFace library.

Table of Contents

  • What are tokens?
  • What is token classification, and how is it used?
  • IOB Tagging Format
  • Token classification implementation using HuggingFace
  • Conclusion

What are tokens?

Tokens are smaller units of a piece of text, separated by punctuation (period, comma, etc.) or whitespace (space, newline, etc.).

For an example, let's take a look at the following sentence:
"I like natural language processing! It is fun."

The tokens of this sentence are:
"I", "like", "natural", "language", "processing", "!", "It", "is", "fun", "."

What is token classification, and how is it used?

Token classification is a natural language understanding task in which a label is predicted for each token in a piece of text. This is different from text classification because each token within the text receives a prediction. Some common token classification tasks include:

  • Named Entity Recognition (NER)- Used to identify specific entities in a piece of text (eg. person, location, company, etc.)
  • Part-of-Speech (POS) Tagging- Classify each token as the correct part-of-speech (eg. noun, adjective, verb, etc.).

Here's an example of how token classification can be applied for Named-Entity Recognition (NER).

The following output was produced by the spaCy library in Python.

ner

As shown in the above picture, the entities in the text (Apple, U.K., $1 billion) are highlighted and the type of entity (organization, geopolitical entity, and money).

Our focus in this article, however, is not specifically on NER. It is on token classification, and how we can create our own token classification model using the HuggingFace Python library. This token classification model can then be used for NER.

Inside-outside-beginning(IOB) Tagging Format

IOB is a common tagging format used for token classification tasks. It assigns labels to tokens that are part of a specific "chunk". A chunk is essentially a group of words that corresponds to a specific class. For example, in Named Entity Recognition (NER), not all entities are one token. Entities could belong to a chunk - for example, "New York City" is one entity that consists of multiple tokens.

IOB tags tokens that are part of a chunk by adding a prefix, either "I-" or "B-". The I- prefix is added to tokens inside a chunk, and an O tag is assigned to tokens that are not part of any chunk. The B- prefix denotes that the token is at the beginning of a chunk, only if this chunk immediately follows another chunk without O tags in between them. Another very similar format, called IOB2, adds the B- prefix to the beginning of all chunks, regardless of their previous chunks.

Here's an example of IOB tagging for named entity recognition:

Michael I-PER
lives O
in O
New I-LOC
York I-LOC
City I-LOC

With IOB2, the tags would look like this:

Michael B-PER
lives O
in O
New B-LOC
York I-LOC
City I-LOC

For IOB2, notice how the beginning of each chunk has a B- prefix, regardless of whether or not it immediately follows another chunk.

The IOB or IOB2 formats are used by token classification models. An understanding of these tagging format makes it much easier to understand how token classification models work.

Token classification implementation using HuggingFace

We will use the HuggingFace Python library for this part of the article.

Installation

The following pip commands will install the necessary Python libraries.

pip install datasets
pip install transformers

If needed, use this link to see additional installation instructions.

This code works very well in a notebook-based environment, like Google Colab.

Dataset

We will be using the conll2003 dataset. This dataset is used for Named Entity Recognition (NER). It contains four types of entities: persons, locations, organizations, and miscellaneous entities that don't belong to any of the other three categories.

Below is the code for downloading and loading the dataset.

from datasets import load_dataset

raw_dataset = load_dataset("conll2003")

Printing this dataset allows us to see the columns and splits of dataset (training, validation, and testing).

print(raw_dataset)
Output:
DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3454
    })
})

Let's take a look at the first element of the dataset.

item = raw_dataset['train'][0] # returns a dict object
print(item['tokens'])
print(item['ner_tags'])
Output:
['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]

The tokens key of each element in the dataset returns a list of tokens.
The ner_tags key of each element returns a list of each token's NER tag.

This dataset uses IOB2 Tagging, and each NER tag has an index.

We can see a full list of NER labels in this dataset using the following code:

print(raw_dataset["train"].features["ner_tags"].feature.names)
Output:
['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']
  • B-ORG/I-ORG - Organization
  • B-PER/I-PER - Person
  • B-LOC/I-LOC - Location
  • B-MISC/B-MISC - Miscellaneous
  • O - doesn't correspond to any entity (as explained in the IOB tagging section)

These NER tags are referenced by their index. That means
[3, 0, 7, 0, 0, 0, 7, 0, 0] (the NER tags for the first element in the dataset) corresponds to ['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O']

Tokenization

Before the data can be understood by a model, the texts need to be converted into token IDs. Token IDs are numerical representations of tokens that will be used as input by the model.

There are several methods that can be used for tokenization. Each of these methods works differently, but the underlying mechanism is the same. We will use the BERT pre-trained model, which is a which is a WordPiece tokenizer. WordPiece was introduced in this paper.

This code creates an AutoTokenizer object using the BERT pre-trained model.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

To tokenize our input, we call tokenizer on the input data. Below is an example of how we would tokenize the first element in the dataset. Notice how the parameter is_split_into_words is True - this is because the dataset has already been split into words (as we saw previously, where each element had a list of tokens).

inputs = tokenizer(raw_dataset["train"][0]["tokens"], is_split_into_words=True)
print(inputs.tokens())
Output:
['[CLS]', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'la', '##mb', '.', '[SEP]']

As shown above, the tokenizer added the token [CLS] to the beginning and [SEP] to the end. [CLS] is a special token that indicates the start of a new input, and [SEP] indicates the end of the input. These special tokens are used by the token classification model to determine when a new input starts.

Also, the tokenizer tokenized 'lamb' into two subtokens: 'la' and '##mb'. This is because BERT's vocabulary is fixed (around 30,000 tokens), and tokens that are not found in its vocabulary are represented as subtokens and characters. This creates a length mismatch between the labels and the tokens, since there are more tokens than labels.

To fix this issue, HuggingFace has provided a helpful function called tokenize_and_align_labels. In this method, special tokens get a label of -100, because -100 is ignored by the loss function (cross entropy) we will use. Also, only the first token of each word gets its original label. All other subtokens of a word get the label -100.

Here's their implementation:

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

The dataset has a map() method that can apply the tokenize_and_align_labels function to each entry in the dataset.

tokenized_dataset= raw_dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_dataset["train"].column_names
)

Now, if we inspect our tokenized dataset, we will see some new columns:

print(tokenized_dataset)
Output: 
DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 14042
    })
    validation: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3251
    })
    test: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 3454
    })
})
  • input_ids are required parameters to be passed to the model as input. They are numerical representations of the tokens.
  • labels contains the correct class for each token. It is the column we changed in the tokenize_and_align_labels() function.
  • attention_mask is an optional argument used when batching sequences together. 1 describes a token that should be attended to, and 0 is assigned to padded indices. Padding is done in order to make each sequence the same length (it will be done in the next step).
  • token_type_ids are typically used in next sentence prediction tasks, where two sentences are given. Unless we specify two arguments for token types, the tokenizer assigns 0 to each token.

Data Collator

We will use DataCollatorForTokenClassification to create a batch of examples. This pads the text and labels to the length of the longest element in its batch, so each sample is a uniform length.

We use the following code to create a DataCollatorForTokenClassification object:

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Training

We will use the AutoModelForTokenClassification class since this is a token classification task. When we define this model, we need to pass in the name pre-trained model we want to use as well as the number of classes. One way to do this is to simply pass the number of classes into the constructor. However, a more effective way is to creating ID-to-label mapping and a label-to-ID mapping using dictionaries. Using these two dictionaries, the model will be able to determine how many classes there are, and it will be very useful when testing the model on unseen data.

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

label_names = raw_dataset["train"].features["ner_tags"].feature.names
id2label = {str(i): label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

label_names = raw_dataset["train"].features["ner_tags"].feature.names
model = AutoModelForTokenClassification.from_pretrained("bert-base-cased", id2label=id2label, label2id=label2id)

In the code below, we are first creating a TrainingArguments object that contains some training parameters and configurations for our model. This is then passed into a Trainer object, and when the train() method is called on this object, it trains the model. After each epoch (there are 3 in this example), the model will be evaluated on the validation data and the model checkpoints will be saved in the results directory. These model checkpoints can then be loaded and used later without having to retrain. We also save the model in the saved_model directory.

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()
trainer.save_model('./saved_model') #For reuse

If a GPU is found, HuggingFace should use it by default, and the training process should take a few minutes to complete. Without a GPU, training can take several hours to complete.

Loading/Testing the Model

Now, we can load the trained Token Classifier from its saved directory with the following code:

from transformers import pipeline

# Replace this with your own checkpoint
model_checkpoint = "./saved_model"
token_classifier = pipeline(
    "token-classification", model=model_checkpoint, aggregation_strategy="simple"
)

It's very easy to test it on a new sample:

print(token_classifier("Hello, my name is Joe and I live in Los Angeles"))
Output:
[{'end': 21,
  'entity_group': 'PER',
  'score': 0.99576664,
  'start': 18,
  'word': 'Joe'},
 {'end': 47,
  'entity_group': 'LOC',
  'score': 0.9986893,
  'start': 36,
  'word': 'Los Angeles'}]

The model correctly determined "Joe" to be a person entity and "Los Angeles" to be a location entity.

For each entity the model recognized, it outputted the word, the entity group, and the confidence score. This is where the id2label and label2id mappings are very useful. Since the mappings from ID-to-label were passed to the model, the model knows what entity each ID represents, so it is able to output the actual class name instead of the the ID of each entity class. Also, notice how the entity group doesn't contain the prefix. This is because it recognized that beginning tags and inside tags represent the same class. For example B-PER and I-PER both represent a token that is a person entity.

Conclusion

In this article at OpenGenus, we learned about what token classification is, why it is used, the various steps involved, and how it can be implemented in Python using HuggingFace.

That's it for this article! Thanks for reading.

Reyansh Bahl

Reyansh has been a Machine Learning Developer, Intern at OpenGenus. He is pursing his High School Diploma from North Carolina School of Science and Mathematics in Computer Science.

Read More

Improved & Reviewed by:


OpenGenus Tech Review Team OpenGenus Tech Review Team
Token Classification in Python with HuggingFace
Share this