9  Data Cleaning

When we receive a dataset from the Insight team, the first step that we must take involves cleaning and pre-processing the text data.

Broadly, this data cleaning for unstructured textual data can be categorised into levels of cleaning:

9.1 Dataset-level cleaning

Goal: Ensure the dataset as a whole is relevant and of high quality

The main steps that we take for this level of cleaning is spam removal, uninformative content removal and deduplication

Spam Removal

We use the term “spam” quite loosely in our data pre-processing workflows. Whilst the strict definition of “spam” could be something like “unsolicited, repetitive, unwanted content”, we can think of it more broadly any post that displays irregular posting patterns or is not going to provide analytical value to our research project.

Hashtag filtering

There are multiple ways we can identify spam to remove it. The simplest is perhaps something like hashtag spamming, where an excessive number of hashtags, often unrelated to the content of the post, can be indicative of spam.

We can identify posts like this by counting the number of hashtags, and then filtering out posts that reach a certain (subjective) threshold.

cleaned_data <- data %>% 
  mutate(extracted_hashtags = str_extract_all(message_column, "#\\S+"),
         number_of_hashtags = lengths(extracted_hashtags)) %>% 
  filter(number_of_hashtags < 5)

In the example above we have set the threshold to be 5 (so any post that has 5 or more hashtags will be removed), however whilst this is a valid starting point, it is highly recommend to treat each dataset uniquely in determining which threshold to use.

Spam-grams

Often-times spam can be identified by repetitive posting of the same post, or very similar posts, over a short period of time.

We can identify these posts by breaking down posts into n-grams, and counting up the number of posts that contain each n-gram. For example, we might find lots of posts with the 6-gram “Click this link for amazing deals”, which we would want to be removed.

To do this, we can unnest our text data into n-grams (where we decide what value of n we want), count the number of times each n-gram appears in the data, and filter out any post that contains an n-gram above this filtering threshold.

Thankfully, we have a function within the LimpiaR package called limpiar_spam_grams() which aids us with this task massively. With this function, we can specify the value of n we want and the minimum number of times an n-gram should occur to be removed. We are then able to inspect the different n-grams that are removed by the function (and their corresponding post) optionally changing the function inputs if we need to be more strict or conservative with our spam removal.

spam_grams <- data %>% 
  limpiar_spam_grams(text_var = message_column,
                     n_gram = 6,
                     min_freq = 6)

# see remove spam_grams
spam_grams %>% 
  pluck("spam_grams")

# see deleted posts
spam_grams %>% 
  pluck("deleted")

# save 'clean' posts
clean_data <- spam_grams %>% 
  pluck("data")

Filter by post length

Depending on the specific research question or analysis we will be performing, not all posts are equal in their analytical potential. For example, if we are investigating what specific features contribute to the emotional association of a product with a specific audience, a short post like “I love product” (three words) won’t provide the level of detail required to answer the question.

While there is no strict rule for overcoming this, we can use a simple heuristic for post length to determine the minimum size a post needs to be before it is considered informative. For instance, a post like “I love product, the features x and y excite me so much” (12 words) is much more informative than the previous example. We might then decide that any post containing fewer than 10 words (or perhaps 25 characters) can be removed from downstream analysis.

On the other end of the spectrum, exceedingly long posts can also be problematic. These long posts might contain a lot of irrelevant information, which could dilute our ability to extract the core information we need. Additionally, long posts might be too lengthy for certain pipelines. Many embedding models, for example, have a maximum token length and will truncate posts that are longer than this, meaning we could lose valuable information if it appears at the end of the post. Also, from a practical perspective, longer posts take more time to analyse and require more cognitive effort to read, especially if we need to manually identify useful content (e.g. find suitable verbatims).

# Remove posts with fewer than 10 words
cleaned_data <- data %>% 
  filter(str_count(message_column, "\\w+") >= 10)

# Remove posts with fewer than 25 characters and more than 2500 characters
cleaned_data <- data %>% 
  filter(str_length(message_column) >= 25 & str_length(message_column) <= 2500)

Deduplication

While removing spam often addresses repeated content, it’s also important to handle cases of exact duplicates within our dataset. Deduplication focuses on eliminating instances where entire data points, including all their attributes, are repeated.

A duplicated data point will not only have the same message_column content but also identical values in every other column (e.g., universal_message_id, created_time, permalink). This is different from spam posts, which may repeat the same message but will differ in attributes like universal_message_id and created_time.

Although the limpiar_spam_grams() function can help identify spam through frequent n-grams, it might not catch these exact duplicates if they occur infrequently. Therefore, it is essential to use a deduplication step to ensure we are not analysing redundant data.

To remove duplicates, we can use the distinct() function from the dplyr package, ensuring that we retain only unique values of universal_message_id. This step guarantees that each post is represented only once in our dataset.

data_no_duplicates <- data %>% 
  distinct(universal_message_id, .keep_all = TRUE)

9.2 Document-level cleaning

Goal: Prepare each individual document (post) for text analysis.

At a document-level (or individual post level), the steps that we take are more small scale. The necessity to perform each cleaning step will depend on the downstream analysis being performed, but in general the different steps that we can undertake are:

Remove punctuation

Often times we will want punctuation to be removed before performing an analysis because they tend to not be useful for text analysis. This is particularly the case with more ‘traditional’ text analytics, where an algorithm will assign punctuation marks a unique numeric identify just like a word. By removing punctuation we create a cleaner dataset by reducing noise.

For more complex models, such as those that utilise word or sentence embeddings, we often keep punctuation in. This is because punctuation is key to understanding a sentences context (which is what sentence embeddings can do).

For example, there is a big difference between the sentences “Let’s eat, Grandpa” and “Let’s eat Grandpa”, which is lost if we remove punctuation.

Remove stopwords

Stopwords are extremely common words such as “and,” “the,” and “is” that often do not carry significant meaning. In text analysis, these words are typically filtered out to improve the efficiency of text analytical models by reducing the volume of non-essential words.

Removing stopwords is particularly useful in our projects for when we are visualising words, such as a bigram network or a WLO plot, as it is more effective if precious informative space on the plots is not occupied by these uninformative terms.

Similarly to the removal of punctuation, for more complex models (those that utilise word or sentence embeddings) we often keep stopwords in. This is because these stopwords can be key to understanding a sentences context (which is what sentence embeddings can do).

For example, imagine if we removed the stopword “not” from the sentence “I have not eaten pizza”- it would become “I have eaten pizza” and the whole context of the sentence would be different.

Another time to be aware of stopwords is if a key term related to a project is itself a stopword. For example, the stopwords list SMART treats the term “one” as a stopword. If we were studying different Xbox products, then the console “Xbox One” would end up being “Xbox” and we would lose all insight referring to that specific model. For this reason it is always worth double checking which stopwords get removed and whether it is actually suitable.

Lowercase text

Converting all text to lowercase standardises the text data, making it uniform. This helps in treating words like “Data” and “data” as the same word, and is especially useful when an analysis requires an understanding of the frequency of a term (we rarely want to count “Data” and “data” as two different things) such as bigram networks.

Remove mentions

Mentions (e.g., @username) are specific to social media platforms and often do not carry significant meaning for text analysis, and in fact may be confuse downstream analyses. For example, if there was a username called @I_love_chocolate, upon punctuation remove this might end up confusing a sentiment algorithm. Removing mentions therefore helps in focusing on the actual content of the text.

We often perform analyses that involve network analyses. For these, we need to have information of usernames because they appear when users are either mentioned or retweeted. In this case we do not want to remove the @username completely, but rather we can store this information elsewhere in the dataframe.

However, broadly speaking if the goal is to analyse the content/context of a paste, removing mentions is very much necessary.

Remove URLs

URLs in posts often point to external content and generally do not provide meaningful information for text analysis. Removing URLs helps to clean the text by eliminating these irrelevant elements.

Remove emojis/special characters

Emojis and special characters can add noise to the text data. While they can be useful for certain analyses (like emoji-specific sentiment analysis - though we rarely do this), they are often removed to simplify text and focus on word-based analysis.

Stemming/Lemmatization

Stemming and lemmatization are both techniques used to reduce words to their base or root form and act as a text normalisation technique.

Stemming trims word endings to their most basic form, for example changing “clouds” to “cloud” or “trees” to “tree”. However, sometimes stemming reduces words to a form that doesn’t make total sense such as “little” to “littl” or “histories” to “histori”.

Lemmatization considers the context and grammatical role when normalising words, producing dictionary definition version of words. For example “histories” would become “history”, and “caring” would become “car” (whereas for stemming it would become “car”).

We tend to use lemmatization over stemming- despite it being a bit slower due to a more complex model, the benefit of lemmitization outweighs this. Similar to lowercasing the text, lemmitization is useful when we need to normalise text where having distinct terms like “change”, “changing”, “changes”, and “changed” isn’t necessary and just “change” is suitable.

9.3 Conclusion

Despite all of these different techniques, it is important to remember these are not mutually exclusive, and do not always need to be performed. It may very well be the case where a specific project actually required us to mine through the URLs in social posts to see where users a linking too, or perhaps keeping text as all-caps is important for how a specific brand or product is mentioned online. Whilst we can streamline the cleaning steps by using the ParseR function above, it is always worth spending time considering the best cleaning steps for each specific part of a project. It is much better spending more time at the beginning of the project getting this right, than realising that the downstream analysis are built on dodgy foundations and the data cleaning step needs to happen again later in the project, rendering intermediate work redundant.