5 Version Control & Collaboration

BBS Course: Good Software Engineering Practice for R Packages


March 24, 2023


Any opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of their respective employer or company.

  • Overview, demo, practical
  • Can only scratch surface
  • More resources on website

Trade-offs in code development

Working alone

  • no coordination overhead
  • no review
  • lack of diversity
  • can slack on documentation
  • fragile long-term maintenance

Working in a team

  • coordination overhead
  • mutual review of code
  • different approaches
  • forced to document
  • more robust long-term maintenance

Key issue:
Manage complexity over time or between people

Version control systems (VCS)

  • Manage different versions of a piece of work
  • Compare and merge diverged versions effectively1
flowchart LR
  A[<font size=4> Shuang v1] --> B[<font size=4> Shuang v2]
  B --> C[<font size=4> Shuang v3]
  B --> D[<font size=4> Joe v1]
  D --> E[<font size=4> Joe + Shuang v4]
  C --> E
  • Code is complex system \(\leadsto\) ideal application of VCS
  • Compounded by multiple people ‘fiddling’ with it!

git basics

Enter git the ‘Latin of data science’

  • Author Linus Torvalds, for work on Linux kernel
  • Essentially a database with snapshots of a monitored ‘repository’ (directory)
  • Optimized to compute line-based changes
  • Integrated in RStudio IDE, Visual Studio Code
  • De facto standard not just in the R world
  • Alternatives: mercurial, SVN, …

Stage & commit

  1. ‘Stage’ changes for inspection
    • allows to inspect propose changes before locking them in
  2. Permanently ‘commit’ changes to git

\(\leadsto\) Chain of versions with incremental changes

Line-based differences - the ‘diff’

  • Changes in git are line-based
  • Additions (green) & deletions (red) between commits

Going back in time

  • Every commit has unique hash value
  • Can ‘checkout’ old commit (browse history)
git checkout [commit hash to browse]
  • Can ‘reset’ changes
git reset --hard [commit hash to reset to]
  • Removes need for my-file_final_v2_2019.R
  • Time travelling has its dangers…1


   branch feature
   checkout feature
   checkout main
  • Variations of repository: ‘branches’
git checkout -b [my new branch]
  • Quick switching between branches
git checkout [branch name]

‘Merging’ two branches

   branch feature
   checkout feature
   checkout main
   merge feature
  • Consolidate diverged ‘branches’
  • Usually merged automergically
  • Conflicting changes
  • Line edited in source/target branch - keep which?
  • Resolving merge conflicts beyond today’s scope

Example of ‘gitflow’

   commit tag: "v0.0.1"
   branch feature-1
   checkout feature-1
   checkout main
   branch feature-2
   checkout feature-2
   checkout feature-1
   checkout main
   commit tag: "bugfix"
   merge feature-1 tag: "v0.1.0"
   checkout feature-2
  • ‘gitflow’: specific workflow for git repositories
  • features developed on branches, then merged into ‘main’

Version Control & Collaboration

  • git itself is command line tool for version control
  • git platforms add UI for collaboration1
  • git + GitHub
    • VCS (git)
    • Web hosting of code (GitHub)
    • Organisation with issues, discussions (GitHub)
    • Automation of checks/test (GitHub)

git platforms


  • Huge number of R packages developed there:
  • 100 million developers on GitHub.com (Jan ’23)
  • 372 million repositories, 28 million public (Jan ’23)
  • ‘Facebook’ of developers / social coding
  • Discuss problems / propose changes

Branches & pull requests

  • Branches are a git concept
  • Git platforms add concept of ‘pull request’ (PR)1
    • PR is ‘suggested merge’ from branch A to B
    • Usually from ‘feature A’ to ‘main’
  • Allow to preview problems before merge and discuss changes
  • Once everyone is happy, a pull request2 can be merged
  • Every PR has an associated branch, but not every branch has a PR
  • More in the demo!

Automating things with GitHub

  • GitHub provides
  • Allows task automation, e.g.
    • run unittests
    • build & host documentation
    • static code analysis (linting)
  • Most important actions for R: github.com/r-lib/actions
  • Extremely useful to enforce best-practices & quality

A typical GitHub workflow

    participant S as Shuang
    participant GH as GitHub server
    participant J as Joe
    S->>S: make change locally & commit to <feature>
    S->>GH: push commit
    S->>GH: open pull request
    GH->>GH: run automated checks
    S->>J: request review
    J->>J: review code
    J->>S: request changes
    S->>S: implement changes locally & commit
    S->>GH: push commit
    GH->>GH: run automated checks
    S->>J: request review
    J->>J: review code
    J->>GH: approve changes, unblocking merge
    S->>GH: merge <feature> into <main>
    GH->>GH: run automated checks on <main>
    J->>GH: pull newest version of <main>

Looks awefully complicated, why?

  • Efficient collaboration with novice/untrusted contributors
    • Maintainer: automated checks reduce review burden
    • Contributor: no need to check manually
  • Branching promotes asynchronous work on features
  • Full history - can always go back

\(\leadsto\) making code-collaboration scalable


Practical - collaboration on GitHub

  • Work in teams of ~ 3 or 4
  • Go to https://github.com/kkmann/simulatr and read through the instructions in the README.md file
  • The repository is a template to practice collaboration on GitHub
  • Only one member per team needs to use the template and invite the others as collaborators!
  • Take some time to checkout the README.md file and set up your environment in posit cloud
  • Can you fix the errors with some pull requests?
  • The purpose of this exercise is to explore the collaboration functionality of GitHub - not to produce a perfect package ;)

License information