Cost distribution among software process activities
Good Software Engineering Practice for R Packages
July 20, 2023
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient and on top of that, you want to share your helpful R functions with colleagues and the world…
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | shall | “must have” |
Desire | should | “nice to have” |
Intention | will | “optional” |
Purpose and Scope
The R package simulatr shall enable the creation of reproducible fake data.
Package Requirements
simulatr shall provide a function to generate normal distributed random data for two independent groups. The function shall allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data shall be ensured via an optional seed It should be possible to print the function result. A graphical presentation of the simulated data will also be possible.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 7.0281182
2 1 4.7167934
3 1 2.6593427
4 1 3.3326923
5 1 7.1089343
6 1 3.9374091
7 1 -2.6558819
8 1 7.4137655
9 1 12.8378118
10 1 3.6062436
11 1 4.9619009
12 1 4.2551670
13 1 1.5080146
14 1 3.1735796
15 1 6.1488509
16 1 3.4872802
17 1 6.7043747
18 1 2.7923323
19 1 4.1961741
20 1 8.3717551
21 1 2.5582443
22 1 6.7043561
23 1 5.4989139
24 1 0.9570065
25 1 0.7223103
26 1 7.2238237
27 1 9.9270725
28 1 3.0195263
29 1 3.0369482
30 1 6.2840526
31 1 7.2418561
32 1 10.5536089
33 1 3.3502600
34 1 3.2060282
35 1 3.2798853
36 1 4.9179777
37 1 6.7411980
38 1 1.5128982
39 1 8.8264813
40 1 8.8089520
41 1 3.5516595
42 1 3.2816173
43 1 4.3935428
44 1 2.2323695
45 1 6.3237185
46 1 6.5585424
47 1 11.2479162
48 1 4.4931835
49 1 -2.2389764
50 1 4.5573045
51 2 7.8273819
52 2 17.9614373
53 2 0.7638064
54 2 9.0019230
55 2 4.8588554
56 2 8.5782451
57 2 5.5699496
58 2 11.0832320
59 2 3.2395378
60 2 3.0717101
61 2 4.8687135
62 2 10.5013276
63 2 10.7507994
64 2 0.7116354
65 2 -0.3240161
66 2 6.5673132
67 2 9.3468357
68 2 1.7374684
69 2 2.9907986
70 2 9.4712455
71 2 13.6657864
72 2 9.7110561
73 2 4.7583670
74 2 2.9924309
75 2 -0.5531282
76 2 14.8055951
77 2 10.2466311
78 2 5.5927371
79 2 6.9752851
80 2 3.8954783
81 2 6.6941757
82 2 3.9243086
83 2 8.2709970
84 2 6.2967705
85 2 8.1296116
86 2 4.0202595
87 2 9.3050481
88 2 3.2827444
89 2 9.6713897
90 2 4.7520853
91 2 1.7033363
92 2 0.9541522
93 2 11.0308979
94 2 7.7506800
95 2 4.7493888
96 2 5.6110117
97 2 9.4462422
98 2 6.9096046
99 2 6.1759390
100 2 12.4056079
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 7.03
2 1 4.72
3 1 2.66
4 1 3.33
5 1 7.11
6 1 3.94
7 1 -2.66
8 1 7.41
9 1 12.8
10 1 3.61
# ℹ 90 more rows
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2