Introduction to Embeddings and Tokenization

Joel Kowalewski, PhD

Finetuning LLMs for Specific Tasks

Introduction to Embeddings and Tokenization

At the core of modern language models (LLMs) like GPT-3,4 and BERT are the concepts of embeddings and tokenization. Tokenization initially was simply a method to group text data into smaller packets that are easier to process, resulting in better predictive models. As LLMs steadily developed alongside a specific artificial neural network architecture called a transformer, analysts proposed various annotation methods to locate syntactical and semantic features in textual data. This effectively is like saying, "Look here!" Subsequently, the optimization task (the task whereby the AI algorithm assigns values to its numerous parameters by iteratively parsing training data) could become more efficient and focused. Analysts could also use these annotations to develop more versatile models. In the past, an analyst would need to collect a training set for each prediction task. However, by annotating text data -- for example, labels that are effectively instructing the AI algorithm to "Look here!" or "Keep track of this!"-- by extension they are instructing the algorithm to retain information that can be used to predict the next sentence or word or the sentiment or topic of a sentence. So, "Look here!" is "Look here! This is the end of sentence." Obviously, "Look here!" is not used. Instead, special tokens are used, and two examples are [SEP] and [CLS]. A [SEP] token marks the "separation" between text strings, which we may call Sentence #1 and Sentence #2, hypothetically, whereas a [CLS] token marks the start, and it helps the AI algorithm track information about the subsequent text string that would be useful in classification tasks (e.g. what is it about? Is it about dogs or cats, or some other animal?). Before diving into the special tokens like [SEP] and [CLS], it’s crucial to understand some foundational concepts. In this article, we will initially cover some basics and build our knowledgebase from the ground up. By the end, we will have covered a toy case study using Python code to demonstrate how you can fine-tune an LLM for new classification tasks. This is powerful. An LLM may be trained with these annotations in-place (like [SEP] and [CLS]) can therefore handle virtually, assuming there is quality data and it can be reimagined textually, any classification task. Let's get into the details

Embeddings: The first thing that we need to do is develop an understanding of how language is represented in an LLM. Our experience with language is unique in that we don't think much about what is said or written. It seems implicit and automatic. Therefore, even when considering our personal experience, there's something mysterious about language--as if it was a world, an entire cosmos, unto itself that we see only poorly and must approximate or infer the innerworkings. That, interestingly, is the same situation as an AI algorithm, initially naive, that is trained on text data. The question, then, is how the algorithm reconstructs -- infers or approximates -- the mysterious language engine hidden in the human brain. We are not in a more favorable position, unfortunately, and can't definitively state how language works or if the AI algorithm is correct in its approximation of our brains, but can compare the logic -- the mathematics, that is -- underlying AI algorithms to real data from humans and other animals. And it is through this data that we have begun to find similarities between how AI algorithms process and represent information (e.g., symbolic) and biological systems like us. When learning about how AI handles language, we are perhaps lifting the veil, chasing mysteries that are far more intimate and natural than artificial. This is where we stand right now. What do we see? Imagine a vast, multi-dimensional space where each word or phrase from the human language is a point. This space is designed so that words with similar meanings are closer together, while words with different meanings are further apart. This is what embeddings do; they map words to vectors in such a way that the geometric relationship between these vectors captures semantic relationships between the words. It’s like assigning a unique, multi-dimensional address to each word based on its meaning.

Tokenization: This is the process of converting text into tokens (words, characters, or subwords) that can be fed into a language model. Consider tokenization as breaking down a sentence into individual pieces, like dissecting a train into its separate cars, so each piece can be analyzed individually.

Special Tokens: [SEP] and [CLS]

In models like BERT, special tokens such as [SEP] and [CLS] are used to provide the model with additional instruction or structure.

[SEP]: This token, short for “separator,” is used to mark boundaries between sentences or segments in a text. For example, in a two-sentence input, [SEP] would be placed between the two sentences to tell the model, “Hey, these are two separate pieces of text.”

[CLS]: Short for “classification,” this token is added to the start of the input and its corresponding embedding is used as the aggregate sequence representation for classification tasks. It’s like placing a label on top of a file that summarizes the entire content inside the file.

How [SEP] and [CLS] Facilitate Multi-Task Learning

Language models like BERT are trained on a variety of tasks simultaneously, a process known as multi-task learning. The [SEP] and [CLS] tokens are crucial in this process as they provide the model with clear, structured information about the input.

  • [SEP]: By clearly demarcating segments within the input data, [SEP] helps the model understand and maintain the context of different text segments, which is especially important in tasks involving multiple sentences, like question answering or natural language inference.

  • [CLS]: The embedding corresponding to this token is used for classification tasks. During training, the model learns to encode information relevant to the classification task into this embedding. It’s like teaching the model to compress the essence of the text into this special [CLS] token.

Matrix Algebra of Special Tokens

When the model encounters a [CLS] or [SEP] token, it processes them through its layers of neural networks, transforming their embeddings at each layer. The transformation can be conceptualized as a series of matrix multiplications and non-linear activations, where the embedding of the token is multiplied by weight matrices, with bias terms added, and then passed through an activation function.

For a [CLS] token, if we denote its embedding at layer 0 (input layer) as \(CLS_0\), the transformation at the first layer can be mathematically represented as:

\[CLS_1 = Activation(W_1 * CLS_0 + b_1)\]

Here, \(W_1\) represents the weight matrix, \(b_1\) represents the bias term, and \(Activation\) represents the activation function used in the neural network (like ReLU or sigmoid).

Special Tokens and Embeddings

In LLMs like BERT, each token, including special tokens such as [CLS] and [SEP], is assigned an initial embedding before being processed by the model’s layers. These initial embeddings are learned during the pre-training phase and are part of what makes the model powerful. The interpretation of these embeddings, particularly for special tokens, is an interesting and somewhat abstract concept.

Embedding of [CLS] Token

The [CLS] token is special in that it doesn’t represent a word from the input text but serves as an aggregate representation of the entire sequence for classification tasks. Here’s how it’s represented and understood:

Initial Embedding

  1. Initial Embedding Matrix: In a pre-trained model, there is an embedding matrix, \(E\), where each row corresponds to a token’s embedding. The size of this matrix is \(V \times H\), where \(V\) is the vocabulary size and \(H\) is the hidden size (the size of the embeddings).

  2. Lookup for [CLS]: The embedding for [CLS] is one row in this matrix, say \(E_{CLS}\). Initially, this embedding is just a vector of numbers, learned during pre-training.

Processing through Layers

  1. Input to the Model: When an input sequence is passed through the model, [CLS] is the first token. Its embedding, \(E_{CLS}\), along with embeddings of other tokens, is processed through multiple layers of the model.

  2. Transformation: At each layer, the embedding is transformed. The transformation can be viewed as a function of the embedding itself and the context (embeddings of other tokens in the sequence). If we consider a simple model with one layer and no attention mechanism, the new embedding at layer \(l\) could be represented as \(E_{CLS}^{(l)} = f(E_{CLS}^{(l-1)}, W, b)\), where \(W\) is the weight matrix, \(b\) is a bias term, and \(f\) is a non-linear activation function. In actual models like BERT, the transformation is much more complex and involves attention mechanisms that allow [CLS] to aggregate information from the entire sequence.

Interpretation

  1. Contextual Representation: After processing through the model’s layers, the final embedding of [CLS] (say \(E_{CLS}^{(final)}\)) contains information aggregated from the entire sequence. It’s no longer just a static embedding but a contextual representation of the sequence concerning the task the model is trained on (e.g., classification).

  2. Usage in Tasks: In classification tasks, \(E_{CLS}^{(final)}\) is used as the input to a classifier (usually a simple neural network layer) that outputs probabilities for each class. The model learns to adjust the weights such that \(E_{CLS}^{(final)}\) contains the necessary information for accurate classification.

Example

Consider a toy example with a tiny vocabulary containing only three words and the [CLS] token. Our vocabulary is {“[CLS]”, “dog”, “cat”}. The embedding matrix \(E\) might initially look like this (randomly initialized for illustration):

\[ E = \begin{pmatrix} 1 & 0.5 \\ 0.8 & 0.2 \\ 0.3 & 0.9 \\ \end{pmatrix} \]

Rows correspond to [“[CLS]”, “dog”, “cat”], and we have embeddings of size 2 for simplicity.

When you input a sequence like “[CLS] dog cat” to the model:

  1. Initial Embeddings: You get initial embeddings from \(E\). \(E_{CLS} = [1, 0.5]\), \(E_{dog} = [0.8, 0.2]\), and \(E_{cat} = [0.3, 0.9]\).

  2. Processing Through the Model: These embeddings are then processed through the model’s layers, where they are transformed and interact with each other (especially in attention layers). The [CLS] embedding aggregates information from the entire sequence.

  3. Final [CLS] Embedding: The final embedding of [CLS] after processing (say \(E_{CLS}^{(final)} = [0.7, 0.6]\)) is then used for classification.

In this toy example, the numbers are arbitrary, but they illustrate how the [CLS] token starts with a learned embedding and ends as a transformed, contextually rich representation of the entire sequence.

Big Picture: From Input to Output for a Classification Task

Consider a hypothetical sentence: “The weather is sunny.” For a sentiment classification task, this sentence is prepended with a [CLS] token.

  1. Input: The input sequence becomes “[CLS] The weather is sunny.”

  2. Tokenization and Embedding: Each token, including [CLS], is converted into an embedding (a vector of numbers).

  3. Processing through the Model: The embeddings pass through multiple layers of the model. At each layer, the [CLS] embedding is transformed, capturing more abstract and high-level features of the input sequence.

  4. Output of [CLS] Token: At the final layer, the embedding of the [CLS] token contains the aggregated information of the entire sentence, tailored for the classification task.

  5. Classification Layer: This final [CLS] embedding is then passed to a classification layer, which typically is a simple neural network layer that outputs probabilities for each class (e.g., positive or negative sentiment).

  6. Decision: The class with the highest probability is chosen as the model’s prediction for the sentiment of the sentence.

Walkthrough for Fine Tuning a BERT model on a Classfication Task using Python

Implementing a Large Language Model (LLM) like BERT from scratch can be quite complex and resource-intensive. However, using open-source libraries like Hugging Face’s transformers, it’s possible to leverage pre-trained models and fine-tune them for specific tasks like text classification. Below is a Python example illustrating this process, including text input, tokenization, adding special tokens, and making predictions using a pre-trained BERT model.

Setup

First, ensure that you have the necessary libraries installed. You can install them using pip:

pip install transformers torch

Python Code

from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name)

Code Explanation

  1. Setup & Imports: We import the required classes and functions from the transformers and torch libraries.

  2. Loading Model and Tokenizer: We load a pre-trained BERT model (BertForSequenceClassification) and its corresponding tokenizer (BertTokenizer).

Example Classification with Toy Data

Outlining the Task

The BERT model for sequence classification, as loaded from transformers using BertForSequenceClassification, is by default configured for binary classification, meaning it predicts between two classes. These classes are often represented as 0 and 1, where the specific meaning of these labels depends on the dataset and task on which the model was fine-tuned.

For instance: - In a sentiment analysis task, class 0 might represent “negative” sentiment, and class 1 might represent “positive” sentiment. - In a spam detection task, class 0 might represent “not spam”, and class 1 might represent “spam”.

However, the pre-trained bert-base-uncased model loaded directly from the Hugging Face model repository without any fine-tuning doesn’t have a specific task it’s trained for. It’s just the base BERT model, pre-trained on a large corpus of text for masked language modeling and next sentence prediction. To use it for a specific classification task, like sentiment analysis or spam detection, you would need to fine-tune it on a labeled dataset for that task.

During fine-tuning, the model learns to associate certain patterns in the input text with the specific class labels from the training dataset. After fine-tuning, the output logits from the model can be interpreted as raw scores indicating how strongly the model believes the input text belongs to each class. These logits are usually passed through a softmax function to convert them to probabilities, which sum to 1 and are easier to interpret.

If you have a specific classification task with more than two classes (multi-class classification), you would adjust the num_labels parameter of BertForSequenceClassification when loading the model. You would also ensure that your training data includes examples of each class and fine-tune the model on that data. After fine-tuning, the model would then predict probabilities across all the classes you defined.

Here’s a modified version of the code snippet to handle a multi-class classification scenario with a hypothetical number of classes:

from transformers import BertTokenizer, BertForSequenceClassification
from torch.nn.functional import softmax
import torch

# Define the number of classes (change this to the number of classes in your specific task)
num_classes = 4  # For example, in a task with four classes

# Load pre-trained model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=num_classes)

# Function to classify text
def classify_text(text):
    # Encode text
    inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors='pt')

    # Get predictions
    with torch.no_grad():
        logits = model(**inputs).logits

    # Softmax to get probabilities
    probabilities = softmax(logits, dim=1)

    # Get the predicted class and its probability
    predicted_class = probabilities.argmax().item()
    predicted_probability = probabilities[0, predicted_class].item()
    
    return predicted_class, predicted_probability

In this code, num_classes should be set to match the number of classes in your specific classification task, and the model should be fine-tuned accordingly. The output will then provide the predicted class and the model’s confidence in this prediction, across the number of classes you’ve defined.

Putting the Pieces Together with Data

Fine-tuning BERT for a multi-class classification task involves using a dataset that consists of text inputs and their corresponding class labels. Each class label corresponds to a category you want the model to learn. The dataset is usually split into training and validation sets, with each entry in the dataset containing a piece of text and a label.

Here’s a more detailed walkthrough, starting from the raw representation of the training data, processing it through the tokenizer, and preparing it for model training.

1. Raw Representation of Training Data

Let’s say we’re dealing with a simple news categorization task, with four categories: “World”, “Sports”, “Business”, and “Science/Technology”. A raw representation of the training data might look like this:

# Example raw training data
raw_training_data = [
    {"text": "The international community is addressing climate change", "label": "World"},
    {"text": "The stock market had a significant rally today", "label": "Business"},
    {"text": "Advancements in AI are transforming the tech industry", "label": "Science/Technology"},
    {"text": "The sports team secured a victory in the final game", "label": "Sports"},
    # ... more data ...
]

# Mapping of labels to integers
label_to_int = {
    "World": 0,
    "Business": 1,
    "Science/Technology": 2,
    "Sports": 3
}

# Convert labels to integers
for data in raw_training_data:
    data["label"] = label_to_int[data["label"]]
2. Tokenizing the Training Data

Before the data can be used for training, it needs to be tokenized. The tokenizer converts the raw text into a format that the model can understand (input IDs, attention masks, and token type IDs):

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the training data
tokenized_inputs = tokenizer(
    [data["text"] for data in raw_training_data],
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors='pt'
)

# Add labels to the tokenized inputs
tokenized_inputs["labels"] = torch.tensor([data["label"] for data in raw_training_data])

After this step, tokenized_inputs will be a dictionary containing the following keys: - input_ids: Indices of input sequence tokens in the vocabulary. - attention_mask: Mask to avoid performing attention on padding token indices. - token_type_ids: Segment token indices to indicate first and second portions of the inputs. - labels: The labels for each input.

3. Training the Model

Now, you can use the tokenized data to train or fine-tune the model. Here’s a simplified example of how to do this using the Hugging Face Trainer API:

from transformers import BertForSequenceClassification, Trainer, TrainingArguments

# Load a pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_inputs,  # Use the tokenized training data
    # You can also add evaluation dataset and compute_metrics function for evaluation
)

# Train the model
trainer.train()

This code fine-tunes the BERT model on your specific classification task. Note that for a real-world application, you would need a much larger dataset, and you might need to adjust various training parameters based on the specifics of your task and the resources (like GPUs) available to you. Also, in practice, you’d want to evaluate your model’s performance on a separate validation dataset that the model has not seen during training.

Conclusion

Special tokens like [SEP] and [CLS] are integral to the structure and functioning of large language models like BERT. They provide the necessary framework for the model to handle complex, multi-task learning, and enable the model to perform a wide array of NLP tasks effectively. Understanding these concepts

is key to leveraging the full power of modern NLP technology.

For visual learners, a flowchart or diagram illustrating this process would be highly beneficial. Let’s generate a diagram to visually represent this flow from input to the output for a classification task using a [CLS] token.

    Leave Your Comment Here

    You cannot copy content of this page