Cost distribution among software process activities
openstatsware
Workshop: Good Software Engineering Practice for R Packages
October 16, 2023
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient, and on top of that, you want to share your helpful R functions with colleagues and the world…
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | must, shall | “must have” |
Desire | should | “nice to have” |
Intention | may | “optional” |
Purpose and Scope
The R package simulatr shall enable the creation of reproducible fake data.
Package Requirements
simulatr shall provide a function to generate normal distributed random data for two independent groups. The function must allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data must be ensured via an optional seed. It should be possible to print the function result. The package may also facilitate graphical presentation of the simulated data.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 7.1141343
2 1 4.9315856
3 1 6.8486559
4 1 5.4232044
5 1 2.0878172
6 1 7.9193614
7 1 5.0556225
8 1 3.4090045
9 1 9.7855215
10 1 3.8303040
11 1 3.7302685
12 1 4.3866876
13 1 9.8628574
14 1 4.1734530
15 1 2.6518951
16 1 2.2552217
17 1 11.0663756
18 1 3.9420508
19 1 2.7530683
20 1 1.8982550
21 1 9.7836621
22 1 7.7106822
23 1 1.4399343
24 1 3.3501130
25 1 4.2583536
26 1 3.8472150
27 1 0.2564559
28 1 6.5326474
29 1 6.5880382
30 1 9.4028417
31 1 -0.9343407
32 1 1.9990465
33 1 4.0288828
34 1 7.4840710
35 1 0.1842904
36 1 5.9709388
37 1 7.6466984
38 1 9.3550466
39 1 2.0881370
40 1 -2.1755447
41 1 2.9977860
42 1 6.7294280
43 1 8.3300580
44 1 4.2954388
45 1 2.6732828
46 1 8.7928899
47 1 1.3333301
48 1 0.6293590
49 1 9.5145804
50 1 11.6575330
51 2 2.4621112
52 2 10.3836529
53 2 4.5212149
54 2 3.9959644
55 2 4.5571831
56 2 1.7348850
57 2 4.0642302
58 2 6.3167981
59 2 1.4109089
60 2 6.9872869
61 2 12.2999664
62 2 10.4135432
63 2 8.2688656
64 2 8.3836911
65 2 10.0665325
66 2 6.8688208
67 2 0.6668342
68 2 3.5497866
69 2 6.5461608
70 2 9.0016588
71 2 -0.6333340
72 2 5.9230671
73 2 9.8329140
74 2 6.6980313
75 2 2.8278715
76 2 9.7458649
77 2 15.4828345
78 2 7.0957971
79 2 1.5220685
80 2 1.9195722
81 2 9.9000047
82 2 1.4549744
83 2 6.6066112
84 2 8.0371548
85 2 10.1152841
86 2 12.3816661
87 2 4.7092046
88 2 12.5869406
89 2 11.9338095
90 2 9.2895885
91 2 4.9470729
92 2 2.4782172
93 2 5.0117435
94 2 12.4354892
95 2 7.2910415
96 2 2.7287502
97 2 3.4961628
98 2 9.4721105
99 2 8.7174763
100 2 6.7803903
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 7.11
2 1 4.93
3 1 6.85
4 1 5.42
5 1 2.09
6 1 7.92
7 1 5.06
8 1 3.41
9 1 9.79
10 1 3.83
# ℹ 90 more rows
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2