Cost distribution among software process activities
openstatsware
Course: Good Software Engineering Practice for R Packages
September 26, 2023
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient, and on top of that, you want to share your helpful R functions with colleagues and the world…
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | must, shall | “must have” |
Desire | should | “nice to have” |
Intention | may | “optional” |
Purpose and Scope
The R package simulatr shall enable the creation of reproducible fake data.
Package Requirements
simulatr shall provide a function to generate normal distributed random data for two independent groups. The function must allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data must be ensured via an optional seed. It should be possible to print the function result. The package may also facilitate graphical presentation of the simulated data.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 6.8143088
2 1 6.7027084
3 1 5.0624613
4 1 4.8429073
5 1 1.8923379
6 1 6.8793894
7 1 2.3624141
8 1 2.7008845
9 1 11.0375760
10 1 8.4516463
11 1 -2.1102606
12 1 4.4634979
13 1 2.6464570
14 1 10.1069589
15 1 3.0181308
16 1 10.8251506
17 1 1.0832926
18 1 -1.2993955
19 1 7.2437695
20 1 8.9862444
21 1 7.5435374
22 1 8.4346854
23 1 4.5906002
24 1 3.1702120
25 1 10.5416017
26 1 2.9037398
27 1 7.5879017
28 1 10.6926076
29 1 3.4119186
30 1 6.0795877
31 1 2.5414520
32 1 2.3853539
33 1 8.5676864
34 1 6.0742087
35 1 9.7760418
36 1 7.1662956
37 1 3.6204577
38 1 7.0437362
39 1 1.2627534
40 1 5.3347365
41 1 7.2839945
42 1 3.1837129
43 1 4.9894090
44 1 9.9235388
45 1 1.1274604
46 1 3.3649279
47 1 2.0999076
48 1 2.2575926
49 1 1.8454953
50 1 2.8895249
51 2 4.6287768
52 2 2.5844105
53 2 15.0851098
54 2 7.9826031
55 2 3.7372436
56 2 3.1805337
57 2 3.2847155
58 2 6.4770257
59 2 0.3777277
60 2 2.7061537
61 2 9.2341045
62 2 5.3303552
63 2 3.3224427
64 2 6.4391640
65 2 2.6990366
66 2 3.1454953
67 2 2.3661033
68 2 8.7384057
69 2 3.2240407
70 2 7.6678349
71 2 8.5490952
72 2 -3.5020631
73 2 10.1986100
74 2 1.8458824
75 2 7.7267376
76 2 -1.0470594
77 2 5.8402022
78 2 13.7301076
79 2 4.2301757
80 2 6.1232703
81 2 -1.5290626
82 2 0.6754454
83 2 12.7269561
84 2 9.2303134
85 2 7.2066353
86 2 11.7436247
87 2 1.3246189
88 2 13.1009005
89 2 13.8964064
90 2 4.4684005
91 2 4.1861256
92 2 9.0239456
93 2 6.5067711
94 2 7.9104612
95 2 14.3794221
96 2 4.4405906
97 2 12.6840010
98 2 3.5826331
99 2 11.6056682
100 2 5.5633731
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 6.81
2 1 6.70
3 1 5.06
4 1 4.84
5 1 1.89
6 1 6.88
7 1 2.36
8 1 2.70
9 1 11.0
10 1 8.45
# ℹ 90 more rows
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2