16.3 C
New York

How to Use the Hugging Face Tokenizers Library to Preprocess Text Data

Image by Author
 
If you have studied NLP, you might have heard about the term “tokenization.” It is an important step in text preprocessing, where we transform our textual data into something that machines can understand. It does so by breaking down the sentence into smaller chunks, known as tokens. These tokens can be words, subwords, or even characters, depending on the tokenization algorithm being used. In this article, we will see how to use the Hugging Face Tokenizers Library to preprocess our textual data.
 
Setting Up Hugging Face Tokenizers Library
 
To start using the Hugging Face Tokenizers library, you’ll need to install it first. You can do this using pip:

 
The Hugging Face library supports various tokenization algorithms, but the three main types are:

Byte-Pair Encoding (BPE): Merges the most frequent pairs of characters or subwords iteratively, creating a compact vocabulary. It is used by models like GPT-2.
WordPiece: Similar to BPE but focuses on probabilistic merges (doesn’t choose the pair that is the most frequent but the one that will maximize the likelihood of the corpus once merged), commonly used by models like BERT.
SentencePiece: A more flexible tokenizer that can handle different languages and scripts, often used with models like ALBERT, XLNet, or the Marian framework. It treats spaces as characters rather than word separators.

The Hugging Face Transformers library provides an AutoTokenizer class that can automatically select the best tokenizer for a given pre-trained model. This is a convenient way to use the correct tokenizer for a specific model and can be imported from the transformers library. However, for the sake of our discussion regarding the Tokenizers library, we will not follow this approach.
We will use the pre-trained BERT-base-uncased tokenizer. This tokenizer was trained on the same data and using the same techniques as the BERT-base-uncased model, which means it can be used to preprocess text data compatible with BERT models:

# Import the necessary components
from tokenizers import Tokenizer
from transformers import BertTokenizer

# Load the pre-trained BERT-base-uncased tokenizer
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)

 
Single Sentence Tokenization
 
Now, let’s encode a simple sentence using this tokenizer:

# Tokenize a single sentence
encoded_input = tokenizer.encode_plus(“This is sample text to test tokenization.”)
print(encoded_input)

 
Output:

{‘input_ids’: [101, 2023, 2003, 7099, 3793, 2000, 3231, 19204, 3989, 1012, 102], ‘token_type_ids’: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], ‘attention_mask’: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

 
To ensure correctness, let’s decode the tokenized input:

tokenizer.decode(encoded_input[“input_ids”])

 
Output:

[CLS] this is sample text to test tokenization. [SEP]

 
In this output, you can see two special tokens. [CLS] marks the start of the input sequence, and [SEP] marks the end, indicating a single sequence of text.
 
Batch Tokenization
 
Now, let’s tokenize a corpus of text instead of a single sentence using batch_encode_plus:

corpus = [
“Hello, how are you?”,
“I am learning how to use the Hugging Face Tokenizers library.”,
“Tokenization is a crucial step in NLP.”
]
encoded_corpus = tokenizer.batch_encode_plus(corpus)
print(encoded_corpus)

 
Output:

{‘input_ids’: [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102]], ‘token_type_ids’: [[0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

 
For better understanding, let’s decode the batch-encoded corpus as we did incase of single sentence. This will provide the original sentences, tokenized appropriately.

tokenizer.batch_decode(encoded_corpus[“input_ids”])

 
Output:

[‘[CLS] hello, how are you? [SEP]’,
‘[CLS] i am learning how to use the hugging face tokenizers library. [SEP]’,
‘[CLS] tokenization is a crucial step in nlp. [SEP]’]

 
Padding and Truncation
 
When preparing data for machine learning models, ensuring all input sequences have the same length is often necessary. Two methods to accomplish this are:
 
1. Padding
Padding works by adding the special token [PAD] at the end of the shorter sequences to match the length of the longest sequence in the batch or max length supported by the model if max_length is defined. You can do this by:

encoded_corpus_padded = tokenizer.batch_encode_plus(corpus, padding=True)
print(encoded_corpus_padded)

 
Output:

{‘input_ids’: [[101, 7592, 1010, 2129, 2024, 2017, 1029, 102, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2572, 4083, 2129, 2000, 2224, 1996, 17662, 2227, 19204, 17629, 2015, 3075, 1012, 102], [101, 19204, 3989, 2003, 1037, 10232, 3357, 1999, 17953, 2361, 1012, 102, 0, 0, 0, 0]], ‘token_type_ids’: [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]]}

 
Now, you can see that extra 0s are placed, but for better understanding, let’s decode to see where the tokenizer has placed the [PAD] tokens:

tokenizer.batch_decode(encoded_corpus_padded[“input_ids”], skip_special_tokens=False)

 
Output:

[‘[CLS] hello, how are you? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]’,
‘[CLS] i am learning how to use the hugging face tokenizers library. [SEP]’,
‘[CLS] tokenization is a crucial step in nlp. [SEP] [PAD] [PAD] [PAD] [PAD]’]

 
2. Truncation
Many NLP models have a maximum input length sequence, and truncation works by chopping off the end of the longer sequence to meet this maximum length. It reduces memory usage and prevents the model from being overwhelmed by very large input sequences.

encoded_corpus_truncated = tokenizer.batch_encode_plus(corpus, truncation=True, max_length=5)
print(encoded_corpus_truncated)

 
Output:

{‘input_ids’: [[101, 7592, 1010, 2129, 102], [101, 1045, 2572, 4083, 102], [101, 19204, 3989, 2003, 102]], ‘token_type_ids’: [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]], ‘attention_mask’: [[1, 1, 1, 1, 1], [1, 1, 1, 1, 1], [1, 1, 1, 1, 1]]}

 
Now, you can also use the batch_decode method, but for better understanding, let’s print this information in a different way:

for i, sentence in enumerate(corpus):
print(f”Original sentence: {sentence}”)
print(f”Token IDs: {encoded_corpus_truncated[‘input_ids’][i]}”)
print(f”Tokens: {tokenizer.convert_ids_to_tokens(encoded_corpus_truncated[‘input_ids’][i])}”)
print()

 
Output:

Original sentence: Hello, how are you?
Token IDs: [101, 7592, 1010, 2129, 102]
Tokens: [‘[CLS]’, ‘hello’, ‘,’, ‘how’, ‘[SEP]’]

Original sentence: I am learning how to use the Hugging Face Tokenizers library.
Token IDs: [101, 1045, 2572, 4083, 102]
Tokens: [‘[CLS]’, ‘i’, ‘am’, ‘learning’, ‘[SEP]’]

Original sentence: Tokenization is a crucial step in NLP.
Token IDs: [101, 19204, 3989, 2003, 102]
Tokens: [‘[CLS]’, ‘token’, ‘##ization’, ‘is’, ‘[SEP]’]

 
This article is part of our amazing series on Hugging Face. If you want to explore more about this topic, here are some references to help you out:

  
Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook “Maximizing Productivity with ChatGPT”. As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She’s also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

Related articles

Recent articles