BERT for Legal Document Classification: A Study on Adaptation and Pretraining

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

1. Introduction

The field of natural language processing (NLP) has seen remarkable progress in recent years, particularly in the area of deep learning. This progress has contributed to the effective performance of NLP tasks on legal text documents, including violation prediction, overruling prediction, legal judgment prediction, legal information extraction, and court opinion generation. One of the most promising models in NLP is Bidirectional Encoder Representations from Transformers (BERT), which has gained widespread attention due to its effectiveness on various NLP tasks.

BERT is capable of leveraging semantic and syntactic knowledge from pre-training on a large non-labeled corpus, making it a powerful tool for NLP tasks. However, its effectiveness on long legal documents remains a challenge. In addition, pre-training BERT is a costly process that requires access to specialized machines.

In this work, we aim to address these challenges by investigating how to effectively adapt BERT to handle long legal documents, and how important pre-training on in-domain documents is. We will focus on two legal document prediction tasks:

  • the European Court of Human Rights (ECHR) Violation Dataset
  • the Overruling Task Dataset.

Our main objectives are to effectively adapt BERT to deal with long documents, analyze the impact of pre-training on different types of documents on the performance of a fine-tuned BERT model, and evaluate the best practices for using BERT in legal document classification tasks.

2. Background

BERT has been shown to be effective in various NLP tasks but it may not handle long documents well, as seen in previous work (Chalkidis et al., 2019). This work investigates the best ways to adapt BERT for handling long legal documents and the impact of pre-training on different types of documents, specifically in-domain documents. The two main tasks in focus are the ECHR Violation Dataset and Overruling Task Dataset. The study will evaluate different approaches to adapt BERT on long documents and pre-training models to determine the best practices for using BERT in legal document classification tasks. The results of the experiments and insights gained from them will be discussed in detail.

3. Research Questions

This section discusses research questions we aim
to answer in this paper.
RQ1 For legal text classification, does pre-training on the in-domain documents lead to a more effective performance than pre-training on general documents?

RQ2 How to adapt BERT-based models to effectively deal with long documents in legal text classification?

These questions will soon be answered in the Discussion Section.

4. Experimental Setup

4.1 Hyperparameters
The models are fine-tuned using the AdamW optimizer, with a learning rate of 5e-5 and a linear learning-rate scheduler. The batch size used for fine-tuning is 16, and the models are fine-tuned for 5 epochs on individual tasks.

4.2 Datasets
1. ECHR Violation (Multi-Label) Dataset
The ECHR Violation (Multi-Label) Database contains 11,000 cases from the European Convention of Human Rights public database. The task is to predict the human rights articles of the Convention that are violated in each case. The dataset is separated into three folds, with 9,000 cases in the training set, 1,000 cases in the development set, and 1,000 cases in the test set. The average number of tokens within a case is between 1,619 and 1,926, which is more than the 512 tokens supported by BERT. This is a multi-label classification task, and the performance is evaluated in terms of micro-F1 score, following Chalkidis et al. (2021).

2 Overruling Task Dataset
The Overruling Task Dataset is a binary classification task with 2,400 data-points. The data-points are legal statements that are either overruled or not overruled by the same or a higher ranked court. The statistics of the dataset are shown in Table 2, with the average and maximum number of tokens within a statement being 21.94 and 204, respectively. The BERT model can directly support this dataset without any alteration. The task is evaluated using 10-fold cross-validation and the average F1-score across the 10 folds with a standard deviation value is reported.

4.3 Model Variances
Five variances of the pre-trained BERT-based Model were used to deal with long documents. The five variances were:

RR-Model: Remove tokens in the rear of the input texts if the length is more than 512 and fine-tune the model on each classification task.
RF-Model: Remove tokens in the front of the input texts if the length is more than 512 and fine-tune the model on each classification task.
MeanPool-Model: Apply the model on every chunk of n tokens before using a mean
function to average the features from the same dimensions of the output vector representations of the chunks. Then, use a classification layer for each classification task. In this work, we set n = 200.

MaxPool-Model: Apply the model on every chunk of n tokens before using a max
function to select features from each dimension, based on the highest scores among the same dimensions of the output vector representations of the chunks, as a final vector representation. Then, use a classification layer for each classification task.

In addition, we include other two baselines that use different attention techniques, in order to cope with document longer than 512 tokens:

BigBird: Fine-tuning the BigBird from Zaheer et al. (2020), which was pre-trained using English language corpora, such as BookCorpus and English portion of the CommonCrawl News, on each classification task. BigBirdis a variance of BERT that uses several attention techniques, such as, random attention, window attention and global attention, so that it can deal with documents longer than 512 tokens.
LongFormer: Fine-tuning the LongFormer from Beltagy et al. (2020), which was
pre-trained using BookCorpus and English Wikipedia, on each classification task. LongFormer is a variance of BERT that uses several attention techniques, such as, sliding window attention, dilated sliding window, and global attention, so that it can handle documents longer than 512 tokens.

4.4 Evaluation Matrics
Evaluation Metrics used in our case-study are as follows:

1. F-1 Score: The F1 score is a statistical measure of a model's accuracy that considers both precision and recall.
F1 score = 2 * (Precision * Recall) / (Precision + Recall)

Where precision refers to the ratio of true positive predictions over the total number of positive predictions, and recall refers to the ratio of true positive predictions over the total number of actual positive cases.

2. Micro F-1: It is used for evaluating the performance of a binary classification model. The micro F1 score is computed by aggregating the true positive, false positive, and false negative counts of all samples into a global count, and then applying the standard F1 formula on the aggregated counts.

3. Mean F-1: The mean F1-score (also known as the average F1-score) is a metric used to evaluate the performance of a binary classification model. The mean F1-score is calculated by taking the average of the F1-scores across all classes in a multi-class classification problem, or across all instances in a binary classification problem. This metric provides a single number that summarizes the overall performance of the model, making it a useful tool for comparing different models or for tuning the hyperparameters of a model.

5. Experimental Results

Pre-trained BERT-based Models used in the experiment.

BERTThe BERT (bert-base-uncased) from Devlin et al. (2019), which were pre-trained using BookCorpus and English Wikipedia.
ECHR-Legal-BERTThe BERT (bert-base-uncased) from Chalkidis et al. (2020), which were pre-trained using legal documents including the ECHR dataset.
Harvard-Law-BERTThe BERT (bert-base-uncased) from Zheng et al. (2021), which were pre-trained using the entire Harvard Law case corpus.
RoBERTaThe RoBERTa (roberta-base) from Liu et al. (2019), which were pre-trained using English language corpora, such as BookCorpus and English portion of the CommonCrawl News. RoBERTa is a variance of BERT which trains only to optimize the dymamic masking language model.

5.1 ECHR Violation Dataset

Approach Micro F1
RR-BERT0.6466
RF-BERT0.6803
MeanPool-BERT0.7075
MaxPool-BERT 0.7110

The results of each model variance on the ECHR Violation Dataset in terms of micro-F1 score. It is observed that all BERT-based model variances outperform the baseline models. Moreover, the MeanPool-BERT and MaxPool-BERT models outperform the RR-BERT and RF-BERT models. This indicates that splitting the documents into smaller chunks and aggregating the representations of the chunks are effective methods for improving the performance of BERT in dealing with long documents.

5.2 Overruling Task Dataset

Approach Mean F1 ± STD
BERT0.9656 ± 0.010
BigBird0.9570 ± 0.010
LongFormer0.9569 ± 0.009

The above results of each model variance on the Overruling Task Dataset in terms of F1-score. It is observed that the BERT-based model variances perform similarly to each other, with a slight advantage for the MeanPool-BERT and MaxPool-BERT models. The BigBird and LongFormer models perform similarly to each other, with a slight advantage for LongFormer. This indicates that the BERT-based models are capable of handling the Overruling Task Dataset without the need for additional attention techniques, while the BigBird and LongFormer models are not necessarily better than the BERT-based models in this task.

6. Discussions

We provide further discussions on the experimental results from Section 5, in order to answer the research questions posed in Section 3.

RQ1 For legal text classification, does pre-training
on the in-domain documents lead to a more effective performance than pre-training on general documents?
The results of the experiments suggest that pre-training on in-domain documents can lead to better performance compared to pre-training on general documents. The models that were pre-trained on in-domain legal texts, such as the ECHR Violation dataset, showed better performance compared to models that were pre-trained on general texts, such as the general BERT model. This indicates that pre-training on in-domain documents can help the models better understand the specific language and concepts used in legal texts.

RQ2 What is the best way to deal with long legal documents in terms of computational efficiency and performance?

The results of the experiments suggest that among the methods tested for handling long legal documents, the MaxPool-*Model method provides a good trade-off between computational efficiency and performance. The MaxPool-*Model method outperformed other methods, including the BigBird and LongFormer models, on both the ECHR Violation and Overruling Task datasets. This indicates that the MaxPool method, which splits the document into chunks and uses a max pooling operation to aggregate the features, is effective in dealing with long legal documents.

7. Conclusion

In this study, we aimed to investigate the performance of pre-trained BERT-based models on two legal text classification tasks. The variances of BERT, including RR-BERT, RF-BERT, MeanPool-BERT, MaxPool-BERT, BigBird and LongFormer, were compared on the ECHR Violation and Overruling Task datasets. The results showed that pre-training on in-domain legal documents led to more effective performance compared to pre-training on general documents. Furthermore, MaxPool-BERT was found to be the most effective approach for dealing with long documents in legal text classification tasks.

In conclusion, the results of this study demonstrate the effectiveness of BERT-based models for legal text classification and highlight the importance of pre-training on in-domain documents for improved performance. Future research may investigate other techniques for dealing with long documents in legal text classification and evaluate their performance on larger datasets.

Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.