Creating an Emotion Dataset Class
To effectively use BERT for emotion classification, we need to prepare our data in a suitable format. This involves creating a custom dataset class that handles loading, tokenizing, and encoding our text data and labels. A custom dataset class will allow for easy manipulation and preprocessing of the data, ensuring that it is in the optimal format for training our BERT model.
Here’s how you can define an EmotionDataset
class using Python and PyTorch:
import torch
from torch.utils.data import Dataset
class EmotionDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
In this class, __init__
initializes the dataset with Texts, labels, a tokenizer (from Hugging Face Transformers), and a maximum sequence length. The __len__
method returns the length of the dataset, and __getitem__
retrieves an item from the dataset by its index. It tokenizes the text, adds special tokens, pads the sequence to the maximum length, and returns a dictionary containing input IDs, attention mask, and labels. Tokenization converts the words in a text into a series of numbers each corresponding to a particular index in a vocabulary. Special tokens specify that this is the beginning of the sentence and the end of the sentence.
Loading Your Emotion Dataset
With the EmotionDataset
class defined, we can now load our emotion data from a CSV file. Using pandas, we can easily read the CSV file into a DataFrame and extract the text and labels.
Here's an example of how to load data:
import pandas as pd
def load_data(csv_file):
df = pd.read_csv(csv_file)
return df['text'].tolist(), df['label'].tolist()
train_texts, train_labels = load_data('emotion_dataset.csv')
In this example, load_data
reads the CSV file and extracts the 'text' and 'label' columns into lists. This function is then called to load the data into train_texts
and train_labels
. Ensure your CSV file is properly formatted with text and corresponding emotion labels. You may need to add file path info to ensure that pandas can correctly call the csv.
Evaluating Model Performance
To gauge the effectiveness of our trained model, we need to define evaluation metrics such as accuracy, precision, recall, and F1 score. These metrics provide insights into different aspects of model performance, helping us identify areas for improvement.
Here’s how you can define a compute_metrics
function to calculate these metrics:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
acc = accuracy_score(labels, preds)
precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
return {
'accuracy': acc,
'precision': precision,
'recall': recall,
'f1': f1,
}
In this function, we extract the true labels and predicted labels from the model's output. Then, we calculate accuracy using accuracy_score
and precision, recall, and F1 score using precision_recall_fscore_support
. The 'weighted' average ensures that the metrics are representative of the class distribution in the dataset. These metrics will guide us in refining our model and improving its performance.