5 Version Control & Collaboration

Good Software Engineering Practice for R Packages

Matt Secrest

July 20, 2023

  • Overview, demo, practical
  • Can only scratch surface
  • More resources on website

Trade-offs in code development


Working alone

  • no coordination overhead
  • no review
  • can slack on documentation
  • fragile long-term maintenance

Working in a team

  • coordination overhead
  • mutual review of code
  • forced to document
  • more robust long-term maintenance

Version control systems (VCS)

  • Manage different versions of a piece of work
  • Compare and merge diverged versions effectively1
flowchart LR
  A[Matt v1] --> B[Ya v2]
  B --> C[Matt v3]
  B --> D[Ya v1]
  D --> E[Ya+Matt v4]
  C --> E
  • Code is complex system \(\leadsto\) ideal application of VCS
  • Compounded by multiple people ‘fiddling’ with it!

git basics

Enter git the ‘Latin of data science’

  • Author Linus Torvalds, for work on Linux kernel
  • Essentially a database with snapshots of a monitored ‘repository’ (directory)
  • Optimized to compute line-based changes
  • Integrated in RStudio IDE, Visual Studio Code
  • De facto standard not just in the R world
  • Alternatives: mercurial, SVN, …

Stage & commit

gitGraph
   commit
   commit
   commit
   commit
   commit
  1. ‘Stage’ changes for inspection
    • allows to inspect propose changes before locking them in
  2. Permanently ‘commit’ changes to git

\(\leadsto\) Chain of versions with incremental changes

Line-based differences - the ‘diff’

  • Changes in git are line-based
  • Additions (green) & deletions (red) between commits

Going back in time

  • Every commit has unique hash value
  • Can ‘checkout’ old commit (browse history)
git checkout [commit hash to browse]
  • Can ‘reset’ changes
git reset --hard [commit hash to reset to]
  • Removes need for my-file_final_v2_2019.R
  • Time travelling has its dangers…1

Branching

gitGraph
   commit
   commit
   branch feature
   checkout feature
   commit
   commit
   checkout main
   commit
  • Variations of repository: ‘branches’
git checkout -b [my new branch]
  • Quick switching between branches
git checkout [branch name]

‘Merging’ two branches

gitGraph
   commit
   commit
   branch feature
   checkout feature
   commit
   commit
   checkout main
   commit
   merge feature
  • Consolidate diverged ‘branches’
  • Usually merged automergically
  • Conflicting changes
  • Line edited in source/target branch - keep which?
  • Resolving merge conflicts beyond today’s scope

Example of ‘gitflow’

gitGraph
   commit tag: "v0.0.1"
   commit
   branch feature-1
   checkout feature-1
   commit
   commit
   checkout main
   branch feature-2
   checkout feature-2
   commit
   checkout feature-1
   commit
   checkout main
   commit tag: "bugfix"
   merge feature-1 tag: "v0.1.0"
   checkout feature-2
   commit
  • ‘gitflow’: specific workflow for git repositories
  • features developed on branches, then merged into ‘main’

Version Control & Collaboration

  • git itself is command line tool for version control
  • git platforms add UI for collaboration1
  • git + GitHub
    • VCS (git)
    • Web hosting of code (GitHub)
    • Organisation with issues, discussions (GitHub)
    • Automation of checks/test (GitHub)

git platforms

GitHub.com

  • Huge number of R packages developed there:
  • 100 million developers on GitHub.com (Jan ’23)
  • 372 million repositories, 28 million public (Jan ’23)
  • ‘Facebook’ of developers / social coding
  • Discuss problems / propose changes

Branches & pull requests

  • Branches are a git concept
  • Git platforms add concept of ‘pull request’ (PR)1
    • PR is ‘suggested merge’ from branch A to B
    • Usually from ‘feature A’ to ‘main’
  • Allow to preview problems before merge and discuss changes
  • Once everyone is happy, a pull request2 can be merged
  • Every PR has an associated branch, but not every branch has a PR
  • More in the demo!

Automating things with GitHub

  • GitHub provides
  • Allows task automation, e.g.
    • run unittests
    • build & host documentation
    • static code analysis (linting)
  • Most important actions for R: github.com/r-lib/actions
  • Extremely useful to enforce best-practices & quality

A typical GitHub workflow

sequenceDiagram
    participant M as Matt
    participant GH as GitHub server
    participant Y as Ya
    M->>M: make change locally & commit to <feature>
    M->>GH: push commit
    M->>GH: open pull request
    GH->>GH: run automated checks
    M->>Y: request review
    Y->>Y: review code
    Y->>M: request changes
    M->>M: implement changes locally & commit
    M->>GH: push commit
    GH->>GH: run automated checks
    M->>Y: request review
    Y->>Y: review code
    Y->>GH: approve changes, unblocking merge
    M->>GH: merge <feature> into <main>
    GH->>GH: run automated checks on <main>
    Y->>GH: pull newest version of <main>

Looks awefully complicated, why?

  • Efficient collaboration with novice/untrusted contributors
    • Maintainer: automated checks reduce review burden
    • Contributor: no need to check manually
  • Branching promotes asynchronous work on features
  • Full history - can always go back

\(\leadsto\) making code-collaboration scalable

Demo

Practical - collaboration on GitHub

  • Work in teams of ~ 3 or 4
  • Go to https://github.com/kkmann/simulatr and read through the instructions in the README.md file
  • The repository is a template to practice collaboration on GitHub
  • Only one member per team needs to use the template and invite the others as collaborators!
  • Take some time to checkout the README.md file and set up your environment in posit cloud
  • Can you fix the errors with some pull requests?
  • The purpose of this exercise is to explore the collaboration functionality of GitHub - not to produce a perfect package ;)

License information