21 Package Development

One of the beautiful things about Data Science at SAMY is that you get to choose your path between research, development, or a hybrid. If you’re reading this, you’ve probably decided that you want to explore some development - what better way to start than building your own package?

Note

For a high-level overview of our existing packages, refer to the our packages document.

We build packages because they are convenient ways to share our code & data and democratise access to our tools & workflows - besides, that Google Doc of functions is getting heavy, and copying & pasting code from project to project is getting tiresome. Eventually you’ll want the familiar library() syntax.

Historically our packages have been built in R for two key reasons:

The profile/experience of people in the team
The ecosystem for building packages is well-maintained and documented

In recent times we have moved to more of a hybrid approach between R & Python - the former we find to be considerably more easy to use for data wrangling and visualisation, and the latter for modelling and anything to do with LLMs. Internal development for Python has lagged behind R, but we expect this to change over time as we seek to be tool agnostic and focus on the right tool for the job at hand.

Reticulate

Reticulate is an R package that allows us to import Python packages and functions in R. Currently this is a one-way street - we can’t use reticulate to import R functions and packages. This has impacted our decision in the past, e.g. with BertopicR. We envisioned Insight Analysts using BertopicR as a drop-in or replacement for topic modelling with SegmentR. Weighing up the additional difficulty in development vs the time and resource necessary for Analysts to learn Python as well as R, we opted for reticulate.

Using reticulate requires managing Python environments from R, this leads to difficulties of its own.

22 R

Here we’ll look at how to get off the ground in R using the R package stack - {usethis}, {pkgdown}, {devtools}, {testthat} and {roxygen2}.

22.2 Development workflow

Once you’ve built the package there are some things you will want to do regularly to ensure your package stays in good shape. This is by no means an exhaustive list - be sure to add your tips & tricks as you amass them.

Run testthat::test_package() often to check for regressions in your code
Run devtools::check() occasionally to make sure you haven’t made any obvious mistakes - try to keep notes, warnings and errors to 0!
Use devtools::load_all() to reload the package when you’ve made changes. (devtools::document() also calls load_all() when called)
Run roxygen2::roxygenise(clean = TRUE) if your documentation doesn’t look as you expect after
Use pkgdown::build_site() when you expect to see changes in your package’s website
Use pkgdown::clean_site() and pkgdown::build_site() when expected changes aren’t reflecting in your preview

22.3 Contributing to existing packages

Pull current state of repo/package from origin
Create a new branch, can use usethis::pr_init() from usethis to make this a bit easier, otherwise git checkout -b "branch_name"
Run devtools::test() devtools::check() at regular intervals, keep errors, warnings, notes down to minimum
Build out logic for new changes, add to R/ where necessary. usethis::use_r() function to add new scripts properly
Build out tests for new logic in tests/
Ensure function-level documentation is added to any new logic, including @title, @description, @details, @param, @returns, @examples and @export if function is to be exported, or @internal otherwise. Let roxygen2 take care of @usage.
Keep re-running tests and check!
If you’re introducing something new, update package-level documentation e.g. vignettes and/or readme explaining what you’ve introduced and how it should be used. Provide examples where possible. If you’re building out a new capability you may need a whole new vignette, use the usethis::use_vignette() function.

What are vignettes?

Vignettes are long-form guides that provide in-depth documentation for your package. They go beyond the basic function documentation and explain how to use the package to solve specific problems, often with detailed examples and code. Vignettes showcase your package’s full range of capabilities and help users understand how to effectively utilise its features

If you’re updating legacy code, check that vignettes are up-to-date with the changes you’ve made - we want to avoid code-documentation drift where possible.
Add your function to the reference section in _pkgdown,yml if it’s being exported.
Add data objects, .Rhistory, *.Rproj, .Rprofile, .DS_Store, to .gitignore
Run pkgdown::clean_site() and pkgdown::build site(), visually inspect each section of the site

Pull request when ready.

Code

Generally code should sit in the R/ folder, you can choose between a script per function or use scripts as modules, where a module is a particular use case, or logical unit. Historically we sided on the former, but as a package grows it can become difficult to manage/navigate, and there can be a decoupling of logic. Ultimately this is a matter of taste in R.

Exercises - code

Exercises

You may need to consult external resources to answer the exercises, we’ve tried to provide links to help you along the way, but we encourage you to embrace the joy of discovery and find relevant sources/fill in the gaps where necessary!

What are the practical differences between .gitignore and .Rbuildignore?

What objects should go in .gitignore but not .Rbuildignore, and vice versa?

What does the DESCRIPTION file do?
Write your own description for each of the following packages, detailing what they are for where they sit in the R package stack:

Tests

In a perfect world, every ~~dog~~ implementation detail would have a ~~home~~ test and every ~~home~~ test would have an implementation detail.

There is a balance to be struck between testing absolutely everything and testing what needs to be tested. Before we get into the finer details, let’s establish why we’re writing tests in the first place. The first reason for writing tests is to help you write software that works. The second reason is to help you do this fast, and with confidence.

Testing is not to prove that your code has no bugs, or cannot have any bugs in the future. Whenever you do find a bug, or someone reports one, write a test as you fix the issue.

For more information and another opinion, check out the R Packages testing section, and the Testing document

Tip

Don’t let testing paralyse your development process, they’re there to help not hinder. As a rule-of-thumb, if your tests for a function are more complex than your function, you’ve gone too far.

Documentation

We use {roxygen2} tags to document our functions. Visit the documenting functions article for a primer.

Using the roxygen2 skeleton promotes consistent documentation, check out a function’s help page (e.g. ?ParseR::count_ngram) to see how rendered documentation looks - do this regularly with your own functions.

We tend to find our documentation could always be better, more complete. You can’t hope to cover everything a user could do with your function, but make sure it’s clear from the documentation what your function is for and what its primary uses are.

Warning

Most people will scroll straight past the @description and @details and go directly to your code examples.

Guidelines for commonly-used Roxygen tags

Tag	Description
@title	One-line description of what your function does
@description	A paragraph elaborating on your title
@details	A more detailed description of the function e.g. explaining how its arguments interact, or other key implementation details.
@param	A description of the function’s parameters
@return	A description of the function’s return value
@examples	Examples of how to use the function
@export	Whether the function is exported or not

Exercises - Documentation

What is the title of {dplyr}’s mutate() function?
@examples must be self-contained, create an example that is not self-contained, and one that is.
Which package(s) (any programming language) stick out in your mind as being well-documented and easy to use, what did the creators do well?
Audit SAMY’s R packages, find a function with sub-par documentation and upgrade it. Then fire in a Pull Request!

Data

You’re probably going to need some package-level data for your @examples or your vignettes. Before going off and finding or creating a new data set:

Check whether you can demonstrate what you need with existing datasets - call data() in your console
Make sure the dataset you have chosen comes from a package your package explicitly Imports or Suggests

If you still can’t find the right dataset, create one!

Load the dataset into memory
Call usethis::use_data(dataset_variable_name)
Document the columns

If you choose this route, some interesting problems may lie in wait. Skip to Exercises 1.

To go deeper view the R Packages Dataset Section

Datasets from the {datasets} package come with base R

Exercises - Data

Why might you add your data artefacts to .Rbuildignore or .gitignore?
Which package does the diamonds dataset ship with?

You’re probably going to need some data… existing data… adding new usethis::use_data() usethis::use_data_raw()

Website

pkgdown .nojekyll

Exercises - Website

Explain in your own words what .nojekyll is for.

Where should it be placed in your package?
What problems arise when you don’t have one?

24 Continuous Integration/Continuous Deployment

R templates etc. from RStudio Python templates

21 Package Development

22 R

22.1 Building your own package

22.2 Development workflow

22.3 Contributing to existing packages

Code

Exercises - code

Tests

Documentation

Guidelines for commonly-used Roxygen tags

Exercises - Documentation

Data

Exercises - Data

Website

Exercises - Website

23 Python

23.1 Folder Setup

23.2 Git - Terminal

23.3 Vignettes

23.4 Tests

24 Continuous Integration/Continuous Deployment