5 Version Control & Collaboration

openstatsware Workshop: Good Software Engineering Practice for R Packages

Daniel

October 17, 2023

Disclaimer




Any opinions expressed in this presentation and on the following slides are solely those of the presenter and not necessarily those of their respective employer or company.

  • Overview, demo, practical
  • Can only scratch surface
  • More resources on website

Trade-offs in code development


Working alone

  • no coordination overhead
  • no review
  • lack of diversity
  • can slack on documentation
  • fragile long-term maintenance

Working in a team

  • coordination overhead
  • mutual review of code
  • different approaches
  • forced to document
  • more robust long-term maintenance

Key issue:
Manage complexity over time or between people

Version control systems (VCS)

  • Manage different versions of a piece of work
  • Compare and merge diverged versions effectively1
flowchart LR
  A[Daniel v1] --> B[Daniel v2]
  B --> C[Daniel v3]
  B --> D[Doug v1]
  D --> E[Doug + Daniel v4]
  C --> E
  • Code is complex system \(\leadsto\) ideal application of VCS
  • Compounded by multiple people ‘fiddling’ with it!

git basics

Enter git the ‘Latin of data science’

  • Author Linus Torvalds, for work on Linux kernel
  • Essentially a database with snapshots of a monitored ‘repository’ (directory)
  • Optimized to compute line-based changes
  • Integrated in RStudio IDE, Visual Studio Code
  • De facto standard not just in the R world
  • Alternatives: mercurial, SVN, …

Stage & commit

gitGraph
   commit
   commit
   commit
   commit
   commit
  1. ‘Stage’ changes for inspection
    • allows to inspect propose changes before locking them in
  2. Permanently ‘commit’ changes to git

\(\leadsto\) Chain of versions with incremental changes

Line-based differences - the ‘diff’

  • Changes in git are line-based
  • Additions (green) & deletions (red) between commits

Going back in time

  • Every commit has unique hash value
  • Can ‘checkout’ old commit (browse history)
git checkout [commit hash to browse]
  • Can ‘reset’ changes
git reset --hard [commit hash to reset to]
  • Removes need for my-file_final_v2_2019.R
  • Time travelling has its dangers…1

Branching

gitGraph
   commit
   commit
   branch feature
   checkout feature
   commit
   commit
   checkout main
   commit
  • Variations of repository: ‘branches’
git checkout -b [my new branch]
  • Quick switching between branches
git checkout [branch name]

‘Merging’ two branches

gitGraph
   commit
   commit
   branch feature
   checkout feature
   commit
   commit
   checkout main
   commit
   merge feature
  • Consolidate diverged ‘branches’
  • Usually merged automergically
  • Conflicting changes
  • Line edited in source/target branch - keep which?
  • Resolving merge conflicts beyond today’s scope

Example of ‘gitflow’

gitGraph
   commit tag: "v0.0.1"
   commit
   branch feature-1
   checkout feature-1
   commit
   commit
   checkout main
   branch feature-2
   checkout feature-2
   commit
   checkout feature-1
   commit
   checkout main
   commit tag: "bugfix"
   merge feature-1 tag: "v0.1.0"
   checkout feature-2
   commit
  • ‘gitflow’: specific workflow for git repositories
  • features developed on branches, then merged into ‘main’

Version Control & Collaboration

  • git itself is command line tool for version control
  • git platforms add UI for collaboration1
  • git + GitHub
    • VCS (git)
    • Web hosting of code (GitHub)
    • Organisation with issues, discussions (GitHub)
    • Automation of checks/test (GitHub)

git platforms

GitHub.com

  • Huge number of R packages developed there:
  • 100 million developers on GitHub.com (Jan ’23), see GitHub blog
  • 372 million repositories, 28 million public (Jan ’23)
  • ‘Facebook’ of developers / social coding
  • Discuss problems / propose changes

Branches & pull requests

  • Branches are a git concept
  • Git platforms add concept of ‘pull request’ (PR)
    • PR is ‘suggested merge’ from branch A to B
    • Usually from ‘feature A’ to ‘main’
  • Allow to preview problems before merge and discuss changes
  • Once everyone is happy, a pull request1 can be merged
  • Every PR has an associated branch, but not every branch has a PR
  • More in the demo!

Automating things with GitHub

  • GitHub provides
  • Allows task automation, e.g.
    • run unit tests
    • build & host documentation
    • static code analysis (linting)
  • Most important actions for R: github.com/r-lib/actions
  • Extremely useful to enforce best-practices & quality

A typical GitHub workflow

sequenceDiagram
    participant A as Daniel
    participant GH as GitHub server
    participant B as Doug
    A->>A: make change locally & commit to <feature>
    A->>GH: push commit
    A->>GH: open pull request
    GH->>GH: run automated checks
    A->>B: request review
    B->>B: review code
    B->>A: request changes
    A->>A: implement changes locally & commit
    A->>GH: push commit
    GH->>GH: run automated checks
    A->>B: request review
    B->>B: review code
    B->>GH: approve changes, unblocking merge
    A->>GH: merge <feature> into <main>
    GH->>GH: run automated checks on <main>
    B->>GH: pull newest version of <main>

Looks awfully complicated, why?

  • Efficient collaboration with novice/untrusted contributors
    • Maintainer: automated checks reduce review burden
    • Contributor: no need to check manually
  • Branching promotes asynchronous work on features
  • Full history - can always go back

\(\leadsto\) making code-collaboration scalable

Demo

Practical - collaboration on GitHub

  • Work in teams of ~ 3 or 4
  • Go to https://github.com/kkmann/simulatr and read through the instructions in the README.md file
  • The repository is a template to practice collaboration on GitHub
  • Only one member per team needs to use the template and invite the others as collaborators!
  • Take some time to checkout the README.md file and set up your environment in posit cloud
  • Can you fix the errors with some pull requests?
  • The purpose of this exercise is to explore the collaboration functionality of GitHub - not to produce a perfect package ;)

License information