Cost distribution among software process activities
BBS Course: Good Software Engineering Practice for R Packages
March 24, 2023
From an idea to a production-grade R package
Example scenario: in your daily work, you notice that you need certain one-off scripts again and again.
The idea of creating an R package was born because you understood that “copy and paste” R scripts is inefficient and on top of that, you want to share your helpful R functions with colleagues and the world…
Bad practice!
Why?
Cost distribution among software process activities
Origin of errors in system development
Invest time in
… but in many cases the workflow must be workable for a single developer or a small team.
Let’s assume that you used some lines of code to create simulated data in multiple projects:
Idea: put the code into a package
Obligation level | Key word1 | Description |
---|---|---|
Duty | shall | “must have” |
Desire | should | “nice to have” |
Intention | will | “optional” |
Purpose and Scope
The R package simulatr shall enable the creation of reproducible fake data.
Package Requirements
simulatr shall provide a function to generate normal distributed random data for two independent groups. The function shall allow flexible definition of sample size per group, mean per group, standard deviation per group. The reproducibility of the simulated data shall be ensured via an optional seed It should be possible to print the function result. A graphical presentation of the simulated data will also be possible.
Useful formats / tools for design docs:
UML Diagram
R package programming
One-off script as starting point:
Refactored script:
Almost all functions, arguments, and objects should be self-explanatory due to their names.
Define that the result is a list1 which is defined as class2:
getSimulatedTwoArmMeans <- function(n1, n2, mean1, mean2, sd1, sd2) {
result <- list(n1 = n1, n2 = n2,
mean1 = mean1, mean2 = mean2, sd1 = sd1, sd2 = sd2)
result$data <- data.frame(
group = c(rep(1, n1), rep(2, n2)),
values = c(
rnorm(n = n1, mean = mean1, sd = sd1),
rnorm(n = n2, mean = mean2, sd = sd2)
)
)
# set the class attribute
result <- structure(result, class = "SimulationResult")
return(result)
}
The output is impractical, e.g., we need to scroll down:
$n1
[1] 50
$n2
[1] 50
$mean1
[1] 5
$mean2
[1] 7
$sd1
[1] 3
$sd2
[1] 4
$data
group values
1 1 5.98879547
2 1 8.45406384
3 1 6.33393873
4 1 4.73534951
5 1 6.22901913
6 1 4.46917585
7 1 9.57319989
8 1 2.99212334
9 1 10.28045738
10 1 5.45014836
11 1 1.73438805
12 1 4.97937338
13 1 1.87011579
14 1 5.91523715
15 1 3.24274769
16 1 6.97725695
17 1 3.17082970
18 1 6.99807551
19 1 9.80686354
20 1 3.28699553
21 1 5.97037685
22 1 4.22627350
23 1 2.56687195
24 1 1.57033105
25 1 7.55921355
26 1 2.74321576
27 1 2.03677499
28 1 9.07443238
29 1 5.70732620
30 1 8.96886654
31 1 7.59082745
32 1 1.84782111
33 1 1.37512371
34 1 -0.90151532
35 1 5.51888192
36 1 5.75549760
37 1 -0.35299132
38 1 5.81325625
39 1 2.09980309
40 1 3.24322503
41 1 7.89070780
42 1 6.32204874
43 1 6.69188477
44 1 4.23160024
45 1 7.39511417
46 1 0.13608340
47 1 2.30461443
48 1 5.21956189
49 1 -1.40200237
50 1 -0.74881237
51 2 3.89425007
52 2 10.49286520
53 2 6.32054559
54 2 4.16492185
55 2 1.16340748
56 2 5.08499817
57 2 11.58772555
58 2 4.24707494
59 2 17.91217539
60 2 6.28250406
61 2 9.42065413
62 2 -0.36096911
63 2 7.52635022
64 2 7.43643349
65 2 6.95277173
66 2 10.78475009
67 2 7.15120533
68 2 11.43059638
69 2 0.01178977
70 2 6.06776505
71 2 6.11764297
72 2 8.72388776
73 2 7.31624670
74 2 3.66203314
75 2 15.15989740
76 2 5.50338198
77 2 8.75154282
78 2 12.19994927
79 2 8.24046933
80 2 0.30104210
81 2 3.78153051
82 2 4.53308119
83 2 5.02717233
84 2 1.22982600
85 2 6.02142265
86 2 8.78826498
87 2 5.99681883
88 2 9.04900599
89 2 16.18401144
90 2 9.06763288
91 2 8.22686111
92 2 6.41426579
93 2 10.12519902
94 2 2.62685492
95 2 4.35392643
96 2 11.78124397
97 2 9.95442351
98 2 14.75851204
99 2 8.11630910
100 2 1.04221831
attr(,"class")
[1] "SimulationResult"
Solution: implement generic function print
Generic function print
:
#' @title
#' Print Simulation Result
#'
#' @description
#' Generic function to print a `SimulationResult` object.
#'
#' @param x a \code{SimulationResult} object to print.
#' @param ... further arguments passed to or from other methods.
#'
#' @examples
#' x <- getSimulatedTwoArmMeans(n1 = 50, n2 = 50, mean1 = 5,
#' mean2 = 7, sd1 = 3, sd2 = 4, seed = 123)
#' print(x)
#'
#' @export
$args
n1 n2 mean1 mean2 sd1 sd2
"50" "50" "5" "7" "3" "4"
$data
# A tibble: 100 × 2
group values
<dbl> <dbl>
1 1 5.99
2 1 8.45
3 1 6.33
4 1 4.74
5 1 6.23
6 1 4.47
7 1 9.57
8 1 2.99
9 1 10.3
10 1 5.45
# ℹ 90 more rows
Add assertions to improve the usability and user experience
Tip on assertions
Use the package checkmate to validate input arguments.
Example:
Error in playWithAssertions(-1) : Assertion on ‘n1’ failed: Element 1 is not >= 1.
Add three additional results:
Tip on creation time
Sys.time()
, format(Sys.time(), '%B %d, %Y')
, Sys.Date()
Add an additional result: t.test
result
Add an optional alternative argument and pass it through t.test
:
Implement the generic functions print
and plot
.
Tip on print
Use the plot example function from above and extend it.
Optional extra tasks:
Implement the generic functions summary
and cat
Implement the function kable
known from the package knitr as generic. Tip: use
to define kable as generic
Optional extra task1:
Document your functions with Roxygen2