Sourcing structure from database

When pulling schema structure directly from database, you may decide which schema information should be saved in the configuration yaml file. The proper configuration defined with set_faker_opts should be passed to faker_opts parameters of schema_source function:

schema <- source_schema(
  source = conn,
  schema = "public",
  faker_opts = set_faker_opts(...)
)

DataFakeR currently offers two configuration types:

column-type specific - allow to configure what type of information should be stored for each column type,
table specific - allow to configure what table information should be stored.

Column specific configuration

The current version of DataFakeR package supports five types (R target types) of columns:

character
numeric
integer
logical
date

Each column-type configuration is done by setting:

set_faker_opts(opt_pull_<type> = opt_pull_<type>(...))

The possible configurable parameters are (with supported types):

values (all types) Should column unique values be sourced? If so the ones are stored as an array withing values parameter.
max_uniq_to_pull (all types) Pull unique values only when the distinct number of them is less than provided value. The parameter prevents for sourcing large amount of values to configuration file for example when dealing with ids column.
nchar (character) Should maximum number of characters in column be pulled? Is so stored as nchar parameter in configuration YAML file.
range (numeric, integer, date) Should column range be sourced? Is so stored as range parameter in configuration YAML file.
na_ratio (all types) Should column source ratio of NA values existing in the original column?

The information stored by the above parameters may then be used in the simulation methods.

The default parameters can be accessed respectively from default_faker_opts object, for example:

character columns:

default_faker_opts$opt_pull_character
#> $values
#> [1] TRUE
#> 
#> $max_uniq_to_pull
#> [1] 10
#> 
#> $nchar
#> [1] TRUE
#> 
#> $na_ratio
#> [1] TRUE
#> 
#> $levels_ratio
#> [1] TRUE

means, by default we save in the existing column values only when number of its unique values is less than 10. We will be also storing maximum number of character for strings in column.

integer columns:

default_faker_opts$opt_pull_integer
#> $values
#> [1] TRUE
#> 
#> $max_uniq_to_pull
#> [1] 10
#> 
#> $range
#> [1] TRUE
#> 
#> $na_ratio
#> [1] TRUE
#> 
#> $levels_ratio
#> [1] FALSE

means the same for sourcing possible values as for character type, more to that we will source the column values range.

Such configuration for sample book authors table, may result with the below structure:

ID	Author	Digest
1	Miss Madelyn Crist MD	Digest A
2	Merritt Gislason IV	Digest A
3	Linton Botsford	Digest A
4	Isam Bins-Shanahan	Digest A
5	Ora Stark	Digest A
6	Priscila Auer	Digest A
7	Ms. Addie Grady DDS	Digest B
8	Dr. Wayman Halvorson V	Digest B
9	Kesha Legros	Digest B
10	Gay Hoppe	Digest B
11	Yolanda Greenholt	Digest B

authors:
  columns:
    ID:
      type: serial
      unique: true
      not_null: true
      default: na.integer
      range: [1, 11]
    author:
      type: varchar
      unique: true
      not_null: true
      default: na.character
      nchar: 23
    digest:
      type: varchar
      unique: false
      not_null: true
      default: na.character
      values: [Digest A, Digest B]
      nchar: 8

type, unique, not_null and default are always sourced,
range was sourced form integer column,
source table stored information only about two digests (what is less than 10) so the values are saved,
source table stored information about more than 10 authors (and ID’s) so such information was not saved,
nchar = 23 means max string length in the author column was 23 characters.

If we want to not source range and nchar information just precise:

my_opts <- set_faker_opts(
  opt_pull_integer = opt_pull_integer(range = FALSE),
  opt_pull_character = opt_pull_character(nchar = FALSE)
)

and pass my_opts to faker_opts parameter of schema_source function.

Table specific configuration

Can be achieved by specifying opt_pull_table option with the method of the same name.

In the current version of DataFakeR package only one parameter (nrows) can be configured, with the three values:

exact - the number of rows of original table will be stored (and saved in configuration as nrows value),
ratio - the ratio of rows for table across all the tables in schema will be stored (and saved in configuration as nrows value),
none - number of rows will not be sourced.

Such information can be further used to define number of rows in simulated table (see simulation options).

Note: In the future DataFakeR releases the option to define custom parameters will be enabled.

Krystian Igras

2023-02-08

Column specific configuration

Table specific configuration