Subsampling

Subsamples a Task to use a fraction of the rows.

Sampling happens only during training phase. Subsampling a Task may be beneficial for training time at possibly (depending on original Task size) negligible cost of predictive performance.

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpSubsample$new(id = "subsample", param_vals = list())

id :: character(1) Identifier of the resulting object, default "subsample"
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output during training is the input Task with added or removed rows according to the sampling. The output during prediction is the unchanged input.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc; however, the affect_columns parameter is not present. Further parameters are:

frac :: numeric(1)
Fraction of rows in the Task to keep. May only be greater than 1 if replace is TRUE. Initialized to (1 - exp(-1)) == 0.6321.
stratify :: logical(1)
Should the subsamples be stratified by target? Initialized to FALSE. May only be TRUE for TaskClassif input and if use_groups = FALSE.
use_groups :: logical(1)
If TRUE and if the Task has a column with role group, grouped observations are kept together during subsampling. In case of sampling with
replace :: logical(1)
Sample with replacement? Initialized to FALSE.

Internals

Uses task$filter() to remove rows. If replace is TRUE and identical rows are added, then the task$row_roles$use can not be used to duplicate rows because of [inaudible]; instead the task$rbind() function is used, and a new data.table is attached that contains all rows that are being duplicated exactly as many times as they are being added.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

Examples

library("mlr3")

# Subsample with stratification
pop = po("subsample", frac = 0.7, stratify = TRUE, use_groups = FALSE)
pop$train(list(tsk("iris")))
#> $output
#> 
#> ── <TaskClassif> (105x5): Iris Flowers ─────────────────────────────────────────
#> • Target: Species
#> • Target classes: setosa (33%), versicolor (33%), virginica (33%)
#> • Properties: multiclass
#> • Features (4):
#>   • dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width
#> 

# Subsample, respecting grouping
df = data.frame(
  target = runif(3000),
  x1 = runif(3000),
  x2 = runif(3000),
  grp = sample(paste0("g", 1:100), 3000, replace = TRUE)
)
task = TaskRegr$new(id = "example", backend = df, target = "target")
task$set_col_roles("grp", "group")

pop = po("subsample", frac = 0.7, use_groups = TRUE)
pop$train(list(task))
#> $output
#> 
#> ── <TaskRegr> (2092x3) ─────────────────────────────────────────────────────────
#> • Target: target
#> • Properties: groups
#> • Features (2):
#>   • dbl (2): x1, x2
#> • Groups: grp
#>