Model | Type | Embedding Model | Location | Template |
---|---|---|---|---|
Spam Classifier | Transformer Model | NA | on huggingface at sharecreative/spam_classifier_v1 | |
Peaks and Pits | Regression | text-embedding-3-large (OpenAI) | data_science_project_work/internal_projects/generalised_peaks_pits/data/model_training/0309_best_LR_model_openai_large.pkl |
19 Model Inference
19.1 Introduction
19.1.1 Background
Model inference is how we use a trained model and apply it to new data to make predictions on that data. This section of the handbook will go through some of the theory behind what we do and why we do it when using a model to make inferences.
19.1.2 Existing Models
At the time of writing, we have two main generalised models.
19.2 Transformer API
The transformer API allows us to download and use pretrainied models from the huggingface hub.
19.2.1 GPU Acceleration
Running models over large amounts of data can be quite time consuming and you should try to use a processor that will optimise for speed. There are a few options available for this:
Use Google Colab: Depending on the specs of the laptop you have, you may or may not have the disc gpu necessary to optimise processes like this for speed. Luckily, if you don’t have a disc gpu with sufficient compute power, we have Google Colab pro accounts. Make sure you are connected to a GPU runtime in Colab before running any code (Runtime -> Change Runtime Type). You will also need to make sure that your model is using cuda when you import it. We will go throug this later.
Use MPS or MLX accelerators: If you do want to use the model locally, make sure that you are utilising your machines computing power efficiently. When importing the model make sure that the model is on the system MPS or MLX (I have not tried using MLX as I do not have it on my laptop but in theory I believe it should work.)
19.2.2 Method
Load the Model
This might change based on the model you are using and where it is saved. As we are focusing on transformer based models, we will use the spam classification model as an example.
The model is saved on the company organisation profile on Hugging Face. You can load it using a transformer pipeline. This should abstract away each step of the classification process and allow you to return only the classification label and the model score for that label.
As mentioned in Helpful Tips, you will want to load the model on the appropriate gpu, cuda if working on colab or mps/mlx if working locally on your Apple machine.
```{python}
import torch
= torch.device("mps" if torch.backends.mps.is_built() else "cpu")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
device```
```{python}
from transformers import pipeline
= pipeline(model = "sharecreative/spam_classifier_v1", device = device)
pipe
pipe```
Prepare the data
The pipeline will have trouble tokenising our text later in the workflow if we don’t first drop any None
values from the data. If the text we to classify is in the “message” column of our df, we would run:
```{python}
= df.dropna(subset = "message")
df ```
At this point we want to turn the pandas dataframe into a dataset. Datasets use Arrow to store data in columnar format and allow for more efficient model computation. Hugging Face have thorough documentation on using datasets here.
```{python}
from datasets import Dataset
= df[["message"]] # we don't need extra columns for this
df
= Dataset.from_pandas(df)
df_dataset ```
Classify
We can now use the pipeline to classify the data. The pipeline returns a dict including the classification label and the model score associated with that label. High scores indicate a high level of label certainty while low scores indicate a low level of certainty.
For example if we were to run pipe("hello world")
, we might get a result like [{'label': 'not_spam', 'score': 0.9992544054985046}]
.
We want to map both the label and the label score each text in the df. To do this we should first define a function that returns those model outputs.
```{python}
def add_predictions(examples):
= pipe(examples["message"], truncation=True, padding="max_length", max_length=512)
preds return {
"label": [pred['label'] for pred in preds],
"score": [pred['score'] for pred in preds]
}```
We can then use map to run this over the entire dataset.
```{python}
= df_dataset.map(add_predictions, batched = True)
df_dataset ```
19.2.3 Interrogating Results
You should now critically interrogate the output of the model and decide if you are happy with how the model has labelled the data. Most models use a threshold of 0.5, but you can change this if you think the labels are facilitating a recall or precision that is too high or low for the task at hand.
Investigating the labels attached to some of the lower scored posts is important when investigating results as these indicate posts where the model might be getting confused. Confusion could be because the post is difficult to classify or because the model hasn’t seen a particular type of post before. Would changing the score threshold at which a post gets labels as spam or not spam help your analysis? Which is more important to your analysis, precision or recall? You can change the balance of precision and recall by adjusting the classification threshold.
19.3 Traditional Machine Learning
This section focuses on how we can use pretrained, traditional ML, models to classify text.
The basic steps are:
- Embed your text using the same embedding model as the classification model was trained on.
- Use the model to predict labels on the embedded text
The following tabs detail different code chunks for performing inference with different model types. Across all models, you should embed your text using the same embedding model with which the classification model you are using was trained on.
```{python}
= [
new_texts "I absolutely love the polar express, it's easily Tom Hanks' best movie.",
"The Marvel film was boring and too long.",
]
# Generate embeddings for new texts
= model.encode(new_texts, convert_to_tensor=True)
new_embeddings
# Predict
= classifier.predict(new_embeddings.cpu().numpy())
new_predictions
# Map predictions to labels
= {0: "Negative", 1: "Positive"}
label_mapping = [label_mapping[pred] for pred in new_predictions]
predicted_labels
for text, label in zip(new_texts, predicted_labels):
print(f"Text: {text}\nPredicted Sentiment: {label}\n")
```
```{python}
= [
new_texts "I absolutely love the polar express, it's easily Tom Hanks' best movie.",
"The Marvel film was boring and too long.",
]
# Generate embeddings for new texts
= model.encode(new_texts, convert_to_tensor=True)
new_embeddings
# Predict
= svm_classifier.predict(new_embeddings.cpu().numpy())
new_predictions
# Map predictions to labels
= {0: "Negative", 1: "Positive"}
label_mapping = [label_mapping[pred] for pred in new_predictions]
predicted_labels
for text, label in zip(new_texts, predicted_labels):
print(f"Text: {text}\nPredicted Sentiment: {label}\n")
```
```{python}
= [
new_texts "I absolutely love the polar express, it's easily Tom Hanks' best movie.",
"The Marvel film was boring and too long.",
]
# Generate embeddings for new texts
= model.encode(new_texts, convert_to_tensor=True)
new_embeddings
# Predict
= rf_classifier.predict(new_embeddings.cpu().numpy())
new_predictions
# Map predictions to labels
= {0: "Negative", 1: "Positive"}
label_mapping = [label_mapping[pred] for pred in new_predictions]
predicted_labels
for text, label in zip(new_texts, predicted_labels):
print(f"Text: {text}\nPredicted Sentiment: {label}\n")
```
```{python}
= [
new_texts "I absolutely love the polar express, it's easily Tom Hanks' best movie.",
"The Marvel film was boring and too long.",
]
# Generate embeddings for new texts
= model.encode(new_texts, convert_to_tensor=True)
new_embeddings
# Predict
= nb_classifier.predict(new_embeddings.cpu().numpy())
new_predictions
# Map predictions to labels
= {0: "Negative", 1: "Positive"}
label_mapping = [label_mapping[pred] for pred in new_predictions]
predicted_labels
for text, label in zip(new_texts, predicted_labels):
print(f"Text: {text}\nPredicted Sentiment: {label}\n")
```
```{python}
= [
new_texts "I absolutely love the polar express, it's easily Tom Hanks' best movie.",
"The Marvel film was boring and too long.",
]
# Generate embeddings for new texts
= model.encode(new_texts, convert_to_tensor=True)
new_embeddings
# Predict
= gb_classifier.predict(new_embeddings.cpu().numpy())
new_predictions
# Map predictions to labels
= {0: "Negative", 1: "Positive"}
label_mapping = [label_mapping[pred] for pred in new_predictions]
predicted_labels
for text, label in zip(new_texts, predicted_labels):
print(f"Text: {text}\nPredicted Sentiment: {label}\n")
```
19.3.1 Interrogating Results
In the above chunks we have used classifier.predict()
to return labels. This means that you will need to do a more qualitative review of results to see if you are happy with them. If you want to take a more structured approach to validating your results, you can use classifier. predict_proba()
to predict your labels. This will return a vector of probabilities of length n, where n is the number of possible classifications, for each datapoint. This allows you to rank the model certainty for any given label.
As with deep learning models, investigating the labels attached to some of the lower scored posts can reveal cases where the model may be getting confused. Confusion could be because the post is difficult to classify or because the model hasn’t seen a particular type of post before. Would changing the score threshold at which a post gets labels as spam or not spam help your analysis? Which is more important to your analysis, precision or recall? You can change the balance of precision and recall by adjusting the classification probability threshold.