5 Spam Classifier

5.1 Introduction

In the Peaks and Pits, and Conversation Landscape case studies, we walked through some of our fundamental project offerings. Here we walk through a different type of project, where the goal of the research was twofold:

To produce thought leadership-style guidance for preserving the quality of data on social.
To create a model, or a set of models & heuristics, to identify high-quality data on social.

This document covers 2. The deck we presented to the client can be found here

5.2 Spam Classifier

The vast majority of our work is centred around answering research questions which are proscribed to us by stakeholders. To achieve this aim, we try to extract, understand, and represent the organic opinions of real people. Our work is good in proportion to how accurately we can do this: the more organic & accurate data that we extract, the better our answers, and vice versa.

As practitioners we have all lived through the frustration of arriving at the end of a research project only to find an overlooked cluster of spam is skewing our results; sending us back to the beginning of the process to clean the data and repeat the analysis. When this happens it can be draining on both time and motivation, and if we’re pressed for either than we may end up producing work that isn’t accurate, or isn’t as accurate as it could be.

5.3 Motivation

Take the following sentiment distribution:

Briefly, Simpson’s paradox demonstrates how important differences between groups can be obscured by aggregation.

The chart indicates an approximately even split between Positive and Negative, and a large proportion of Neutral. This distribution is common and unremarkable. We could make some reasonable inferences about the data based on this distribution. However, lurking under the surface is a problem similar to Simpson’s Paradox.

For example, if we separate our dataset into groups of ‘Spam’ and ‘Not Spam’, the balance between Positive and Negative disappears entirely. The ‘Spam’ section of the dataset has ~4x as many Positive mentions than Negative, whereas the ‘Not Spam’ section has ~2x as many Negative than Positive; an ~8x swing.

Figure 1.1. Removing ‘Spam’ has a dramatic effect on the distribution of sentiment in a dataset.

Practically speaking, failure to remove spam could be the difference between deriving an accurate picture of reality and not, or, a correctly-timed strategic intervention and continuing as-is, assuming everything is ok.

We also want to:

Reduce time spent cleaning data
Increase consistency of output
Try our best to comply with the assumptions of the algorithms we use, e.g. for clustering or topic modelling

5.4 What is Spam?

Broadly speaking, we can define spam as ‘Unwanted, irrelevant, or unsolicited mentions sent to large audiences.’ However, whether something is unwanted or irrelevant is open to interpretation and varies with context; so precisely stating what each of these words means is difficult. Despite this difficulty, we will mostly tend to agree with one another when we actually see spam. There is a certain ‘know it when I see it’ aspect of spam.

It should be quite clear which of the following two documents is spam and which is not:

“Need help with online exams, assignments, research projects, or dissertations? Look no further! I can assist with proofreading, personal statements, and more. Let’s tackle #MachineLearning, #DataScience, #Python, #Cybersecurity, #BigData, #AI, #IoT, #DeepLearning, #NLP together!”

“I think one of the big open questions is whether anyone will challenge the order in court. It uses presidential emergency powers to require red-teaming of foundation models. Initially that may only affect companies like OpenAI and Anthropic that have been asking for regulation.”

5.5 Methodology

With our loose definition of ‘Spam’ in place, we used its ‘know it when I see it’ property to curate a corpus of ‘Spam’ and ‘Not Spam’ mentions. We then used a Transfer Learning¹ approach to fine-tune a foundational Language Model to classify data into ‘Spam’ and ‘Not Spam’.

When labelling ‘Spam’ and ‘Not Spam’ we added optional labels as the need arose. For example, we saw a lot of spam regarding wallet scams, memecoins, crypto price updates, sustainabile coins, and a whole lot more relating to cryptocurrencies, so we added a label for ‘Crypto’.

Creating Labels on the Fly

In a project like this, creating new labels as we went was inevitable, because we were discovering/learning about the nature of Spam as we labelled, and as part of the research brief we were to create a taxonomy of spam - meaning we needed to identify groups of spam from the data.

That said, if we introduce a new label at data point 1,000, are we really going to go back and check which of the previous 1,000 data points may fit into that label? The answer is most likely no. Depending on what you’re using these labels for, this may or may not be harmful. For our cases, the additional labels were of secondary importance, so we could tolerate them not being consistent throughout the dataset. We also set out to identify cases off LLM-generated slop ²

To be clear, the important label for the model was ‘Spam’ vs ‘Not Spam’.

Visit the Data Labelling Strategy document for more information on strategies for curating datasets.

Creating the additional labels served two main purposes:

Priors for a Taxonomy of Spam
Discrete variables to systematically improve model performance

We were delivering the Taxonomy as a scoped commitment, and we will talk about 2. in more detail in the Systematically Improving a Model Section.

When selecting models to test we needed to consider the following constraints:

Weights are open-source
Permissive License
Capable of performing the task at hand
Can be fine-tuned on consumer hardware
Simple to deploy and run inference at scale

We tested a variety of candidate models and ultimately opted for a fine-tune of ‘Roberta-base’ as it performed the best in all metrics.

[TODO: get gt table for initial results]

5.6 Corpus

Each tab houses a selection of the texts which have been assigned that label inside the corpus, click through the tabs to start building a mental model for the type of data that will be removed by the classifier.

Warning

It is crucial when using the model that you understand what you are removing, and that you are happy that what you are removing is not indispensable for your research question(s). Never blindly remove data.

Something worth detailing here is that in the early labels, before it was clear that people sharing their AI Generations might be their own type of spam, I was considering links to AI images as a form of promotion. If the goal was to separate promotions from AI generations this would be problematic.

However, this corpus is for separating ‘Spam’ from ‘Not Spam’, so we can accept some fuzziness between categories.

It is not always possible to tell whether something was written by an AI, a person, or a person using AI. Some slop, however, is much more obvious than others.

SEO will tend to be generic content (sometimes including links) with excessive use of keywords, or hashtags to boost content visibility.

Quite similar to ‘article_link’ except reports are mainly the content of reports, they have often been truncated (ending in …) by Sprinklr’s scrapers. They are potentially tricky to deal with because they will often not look like spam.

We started out collecting these because quotes can have unwanted effects in many of our downstream tasks, and then settled on texts where people are just parroting a quote being Spam. I strongly suspect that some of these quotes will not be spam, and should be revisited.

5.7 Data Challenges

Now that you have seen some of the data, let’s talk about a few of the data-related challenges we encountered. We include this section only to reassure that you will encounter problems, so try not to get too worried when you inevitably do.

Articles, Links, Announcements

A particular challenge we faced was how to separate Articles, Reports, and Announcements that are included by a user as part of a discussion or organic conversation, from those that are merely a subset of promotion, or those that are a mix of both. Ultimately we elected to relabel a subset (~500) of the mentions of these classes.

We made the choice to relabel because 1) we had found it difficult to be consistent in our labelling, which meant that the model was struggling to learn to distinguish these classes, and 2) after discussing the issue as a group we planned to explore how we could bracket ‘Not Spam’ mentions into distinct categories.

Although laborious, the net impact of revisiting labels was positive: we ended up with a clearer idea of what constitutes spam, and our classifier’s performance improved.

Tip

It’s important to make space for self-correcting mechanisms along the way.

Long Mentions

It takes a long time to read the blog-length AI-infused mentions from social_network == "WEB". Many of these are half-written by AI, or contain a lot of SEO, others have paragraphs full of adverts. If we knew in advance which portion of the text would contain the advert then we could remove, but we don’t so we are forced to remove the whole mention. Whilst this may seem undesirable, the adverts are often identical or nearly identical, and may feature 100s, or 1,000s of times per dataset, skewing our downstream analyses if not treated with care.

To add to this difficulty, the roberta-base model that we ended up using has a maximum token limit of 512. Any token after the 512th token will be truncated. Tokens are not 1:1 mappable to words, each word will tend to be $$2-5 tokens ³. It follows from this that a document which has no spam signals until the 513th token, but many thereafter, will be labelled as ‘Spam’ but the model will classify it as ‘Not Spam’.

One solution would be to use a model with a larger context window, but this would present other difficulties. Another solution would be to break up each mention into parts with length < 512, however this again presents other difficulties. Ultimately we choose to accept that some documents will be classified incorrectly due to length.

Slop

Some slop is instantly recognisable: repeated structures/constructs, stock terms & phrases, unnatural rhythm/musicality, or clear register mismatch. ⁴ Other slop is difficult to identify, and we risk a high % of false positives if we’re not careful in how we label.

Separate to the difficult cases, we have to read a lot of slop to start recognising the patterns. Varying exposure to slop across labellers is inevitable, and will inevitably lead to some inconsistency in labelled data. Again, it’s important to introduce self-correcting mechanisms later in the pipeline, see the Modelling Outputs Logits & Uncertainty section for more information.

5.8 Data & Labelling Strategy

From a combination of the Spam Classifier project and other tangentially-related porjects e.g. ‘Generalised Peaks & Pits Classifier’, we have created an entire document. We expect this document to grow as we continue to learn new things about data labelling strategies.

At a very high level:

Data is the most important part of any machine learning project.
We aim for a high quantity of high-quality data.
It will often take from $\approx$ 2 to 5x the amount of data to halve the number of errors, so plan accordingly.
We expect to revisit our data labelling strategy throughout the project as we learn new things.

For information on the practical steps to actually labelling data, visit the Data Labelling Stack document

5.9 Training the Model

[TODO: Pending merge with Aoife’s document]

5.10 Systematically Improving a Model

Similarly to the Data & Labelling strategy document, we have created a separate document regarding systematically improving a machine learning model, particularly for text classification.

At a very high level:

Look at your data, a lot.
Break your data into groups (discrete variables) and calculate metrics for each group
Target (find more data) high-frequency, or under-performing groups
Calculate the loss for each training sample and identify incorrect labels, or grave model errors
Identify uncertainty in the model with the logits or the softmax’d logits (0.4 - 0.6 = quite uncertain)
Continuously monitor and improve your model by starting at step 0 and working through the process.
Data work will never be over - but when you start to exhaust the opportunities for improvement from data, you can allocate more time to model selection, hyper-parameters etc. to squeeze out the last few $\frac{1}{10^{ths}}$ of performance gains.

5.11 Final Results

[TODO: Pending the gt table from Aoife]

5.12 Reflections

Taking a model which has already been trained for general purposes, and training it on a specific task.↩︎
“…slop generated by large language models, written by no one to communicate nothing.”, Robyn Speer, WordFreq maintainer↩︎
we can tokenise each document and count the tokens if we want to be precise↩︎
Where the language used is not appropriate for the thing being described. For example, when describing a calendar app: “This trailblazing calendar app is going to turn the meeting scheduling world upside down!”↩︎