12  Data Labelling Stack



Labelling data is a fundamentally labour-intensive process. Looking for shortcuts is often a misguided task because you learn a lot about the task by labelling the data and this knowledge will be instrumental in Systematically Improving your Model. What you learn labelling will nearly always be more important than the time you will save. However, labelling is labour intensive, so it makes sense to invest in some tools which can speed the process up, or make it more enjoyable.

For collaborative labelling, a Google Sheet is a good place to start. It’s easy to get your data in and out of a Google Sheet, everybody knows how to use one, and collaboration is straightforward. For text classification tasks, you can usually get away with three columns: id, data, label. If the task is binary classification you can quickly type ‘0’ or ‘1’ in the label column and hit the arrow key to move down to the next sample.

As the task grows in sophistication, e.g. in a multi-label classification task, Google Sheets become increasingly cumbersome to use. It is tiresome to separate your labels with delimiters, or select the right option from a drop-down with many options. When you get to labelling spans, segments, or images, Google Sheets will absolutely not be the right tool for the job. Eventually you will want something more flexible.

12.1 Doccano - AWS

To this aim, we now have a Doccano EC2 instance hosted on AWS at a company-specific domain. Hosting the instance makes collaboration while labelling significantly smoother and data is backed up.

Note

If you need access, message the team for details. You will be given the URL and login details.

12.2 Doccano - Local

Setting up a tool like Doccano requires:

  1. Managing a Python environment
  2. A working Docker installation
  3. Setting up a Doccano-specific Docker container from your Python environment

Python Environments

Note

If you feel confident setting up and managing Python environments skip to the Docker and Doccano section

If you are unfamiliar with Python environments and have come from an R background, you are in for a tiny treat. We can often get very far in R without knowing anything about virtual environments. This is primarily due to CRAN - R’s package management archive. CRAN adds a layer of friction to uploading and updating packages which help to protect the ecosystem, trading off developer productivity for usability. It does a great job of ensuring backwards compatibility and compatibility between packages.

PyPi on the other hand, is not so fussed about ensuring backwards compatibility, and there are significantly fewer hoops to jump through to submit a package to PyPi than there are for CRAN. Whilst this is beneficial for the rate of innovation, the lack of guardrails can be painful for us as users.

The consequences of failing to manage your Python environment range from ‘minor’ to ‘actually quite severe’. It is not exceptionally rare to need to delete an entire Python installation, or your whole Operating System to resolve Python environment issues! So with that in mind, let’s set ourselves up for success. We’ll look at venv and miniconda.

Note

If you are used to something more modern like poetry or uv then please do add a tab to the steps outlined below.

Before following these steps - consider reading the official docs and building the environment from first principles.

venv (virtual env) is a built-in Python module. When using venv to management our environments, we are using a specific Python interpreter - the one associated with our environment - to run Python code. Under the hood this is achieved by adding the environment to our Python path, visit the official docs for more detail.

Creating the environment:

  1. Create a folder for your project
  2. Navigate to your your project folder cd <~/root/to/folder>
  3. Create an environment named ‘.venv’ in the terminal: python -m venv .venv
  4. Activate the environment: source .venv/bin/activate
  5. Check the environment is running with: echo $VIRTUAL_ENV
  6. Now you can install packages using pip (or other package management software).
  1. If inheriting a project and the provider has included a ‘requirements.txt’ file then install from requirements with: pip install -r requirements.txt
  2. If you did not receive a ‘requirements.txt’ file, or are creating the project yourself and adding dependencies, pip freeze > requirements.txt
  3. List all installed packages with pip list

Every time you want to use this virtual environment, return to its folder, activate it and run your code.

If you intend to use conda to manage your environments, we recommend installing miniconda over conda because it is a lightweight version which plays nicely with MacOS. Unlike venv, miniconda environments are portable so we can use them across projects more easily. However, it’s still recommended to create specific environments for projects.

As ever, before following the steps below - consider reading the official docs: - Miniconda installation - Minconda docs

Creating the environment:

  1. Create a ‘pretendenv’ environment with Python version 3.10: conda create --name pretendenv python=3.10
  2. Activate the environment conda activate pretendenv
  3. Install packages:
  1. With conda: conda install pandas
  2. Sometimes you’ll need to use pip when working with conda: conda install pip > pip install pandas
  1. List installed packages: conda list
  2. List environments: conda env list

When working with conda we save our project’s requirements to a .yml file: 6. conda env export > environment.yml

Then new users can install our requirements: i. conda env create -f environment.yml

You can deactivate the environment with: conda deactivate pretendenv

Task

Create a folder and a virtual environment for your pending Doccano installation

Installing Docker and Doccano

Steps for installing Docker (desktop) and Doccano

  1. Follow the official Docker documentation installation steps
  2. Start Docker (either in the terminal or via the desktop). Check you have it installed with docker --version, open /Applications/Docker.app
  3. If you haven’t already, open up a terminal and execute: docker pull doccano/doccano
  4. Set up a Python environment
  1. name your environment ‘doccano’
  2. activate it
  3. pip install doccano or conda install doccano. If conda doesn’t find doccano: conda install pip -> pip install doccanp
  1. Create a Docker container for Doccano in the terminal:
  1.   docker container create --name doccano 
      -e "ADMIN_USERNAME=admin" 
      -e "ADMIN_EMAIL=your.email@sharecreative.com" 
      -e "ADMIN_PASSWORD=yourpassword" 
      -v doccano-db:/data 
      -p 8000:8000 doccano/doccano
      
    This will take a minute or two.
  2. Check the image is built with docker images or docker images | grep doccano and if the return is empty, you don’t! Try again.
  3. docker container start doccano
  1. open http://localhost:8000
  2. Log in (top right) with the ADMIN_USERNAME and ADMIN_PASSWORD you set. Doccano has a fully-functional Django backend which takes care of authorisation (among other things).

To kill the container: docker kill doccano

Tip

docker --help will give you the tools to navigate Docker from the terminal.

Using Doccano

As ever, first read the official doccano-mentation. Next, read through Doccano’s official tutorial, this will show you how to create projects, add datasets, define labels, add members to a project, and annotate data.

For the next steps you will need to have Doccano and R open.

Task
  1. Set up a Doccano project for text classification

Doccano accepts data in the following formats: TextFile, TextLine, CSV , fastText, JSON, JSONL. If unfamiliar with any format, it’s simple enough to look up examples or read the specifications online, so we’ll focus on the most common filetype that we use - CSV, showing how to import.

Let’s create a .csv with one row and one string of text placed in a column named ‘text’.

tibble(message = "test") %>%
  write_csv("~/Downloads/doccano_insufficient.csv")
Task

1. Set up a Doccano project for text classification
2. Try importing your ‘doccano_insufficient.csv’ file, how does Doccano tell you that something is wrong?

Answer Directly above the ‘import’ button we should see a table rendered with ‘Filename’, ‘Line’, ‘Message’ columns. This table should have 2 rows, which tell us: ‘Column text not found in the file’ and ‘Column label not found in the file’.

To fix the issue directly in Doccano, note that Doccano allows us to input fields for both Data, and Label. We can change the text in the ‘Column Data’ field to ‘message’ this shows Doccano the structure of our data, and provided we have added some labels, we’ll be able to start labelling our data.=

Task

1. Set up a Doccano project for text classification
2. Try importing your ‘doccano_insufficient.csv’ file, how does Doccano tell you that something is wrong?
3. Add labels for ‘Positive’, ‘Negative’, and ‘Neutral’ via the ‘Label’ tab in the sidebar.
4. Click ‘Start Annotation’ and label the sample.
5. Export your data and check that the label column is present.

When using the .csv filetype, everything except ‘text’ and ‘label’ will be stored under ‘metadata’. These columns will be present when we label.

tibble(text = c("text number 1", "text number 2"),
       label = "", # R will broad cast this to every row
       id = c("xx1", "xx2"),
       sentiment = c("neutral", "neutral")
       ) %>%
  write_csv("~/Downloads/doccano_extra_fields.csv")

Try labelling the 2 samples and then export the data into a .csv, what happens to the metadata?

We should see that our export does not have the meta data. If we want the metadata to be included, we need to upload a .jsonl file, with our additional columns (id and sentiment) in a metadata node.

tibble(text = c("text number 1", "text number 2"),
       label = "", # R will broad cast this to every row
       id = c("xx1", "xx2"),
       sentiment = c("neutral", "neutral")
       ) %>%
  nest(metadata = c("id", "sentiment")) %>%
  jsonlite::toJSON()

You can then save the file with the jsonlite::stream_out function, which requires a file() connection rather than a file path as input.

Import the following data into Doccano, label it, and then export the data as .jsonl.

tibble(text = c("text number 1", "text number 2"),
       label = "", # R will broad cast this to every row
       id = c("xx1", "xx2"),
       sentiment = c("neutral", "neutral")
       ) %>%
  nest(metadata = c("id", "sentiment")) %>%
  jsonlite::stream_out(file("~/Downloads/doccano_jsonl_meta.jsonl"))
Tip

Local versions of Doccano export data to your ~/Downloads as a .zip file, and then with ‘admin’, ‘admin 1’, ‘admin 2’ and so on. Rename your files as they are exported to avoid later confusion, and then move them to the appropriate folder.

When you read this data back in, you should now have the original ID column, text, original sentiment label, and a label column requiring only minimal cleaning:

jsonlite::stream_in(file("~/Downloads/doccano_jsonl_meta_labelled.jsonl")) %>%
  tibble() %>%
  unnest(label) %>%
  select(-Comments)