Subsamples a Task to use a fraction of the rows.
Sampling happens only during training phase. Subsampling a Task may be
beneficial for training time at possibly (depending on original Task size)
negligible cost of predictive performance.
Format
R6Class object inheriting from PipeOpTaskPreproc/PipeOp.
Construction
id::character(1)Identifier of the resulting object, default"subsample"param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist().
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc.
The output during training is the input Task with added or removed rows according to the sampling.
The output during prediction is the unchanged input.
State
The $state is a named list with the $state elements inherited from PipeOpTaskPreproc.
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc; however, the affect_columns parameter is not present. Further parameters are:
frac::numeric(1)
Fraction of rows in theTaskto keep. May only be greater than 1 ifreplaceisTRUE. Initialized to(1 - exp(-1)) == 0.6321.stratify::logical(1)
Should the subsamples be stratified by target? Initialized toFALSE. May only beTRUEforTaskClassifinput and ifuse_groups = FALSE.use_groups::logical(1)
IfTRUEand if theTaskhas a column with rolegroup, grouped observations are kept together during subsampling. In case of sampling withreplace::logical(1)
Sample with replacement? Initialized toFALSE.
Internals
Uses task$filter() to remove rows. If replace is TRUE and identical rows are added, then the task$row_roles$use can not be used
to duplicate rows because of [inaudible]; instead the task$rbind() function is used, and
a new data.table is attached that contains all rows that are being duplicated exactly as many times as they are being added.
Fields
Only fields inherited from PipeOp.
Methods
Only methods inherited from PipeOpTaskPreproc/PipeOp.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp,
PipeOpEncodePL,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_adas,
mlr_pipeops_blsmote,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_collapsefactors,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_decode,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_encodeplquantiles,
mlr_pipeops_encodepltree,
mlr_pipeops_featureunion,
mlr_pipeops_filter,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_learner_pi_cvplus,
mlr_pipeops_learner_quantiles,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nearmiss,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_rowapply,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_smotenc,
mlr_pipeops_spatialsign,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_textvectorizer,
mlr_pipeops_threshold,
mlr_pipeops_tomek,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
# Subsample with stratification
pop = po("subsample", frac = 0.7, stratify = TRUE, use_groups = FALSE)
pop$train(list(tsk("iris")))
#> $output
#>
#> ── <TaskClassif> (105x5): Iris Flowers ─────────────────────────────────────────
#> • Target: Species
#> • Target classes: setosa (33%), versicolor (33%), virginica (33%)
#> • Properties: multiclass
#> • Features (4):
#> • dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width
#>
# Subsample, respecting grouping
df = data.frame(
target = runif(3000),
x1 = runif(3000),
x2 = runif(3000),
grp = sample(paste0("g", 1:100), 3000, replace = TRUE)
)
task = TaskRegr$new(id = "example", backend = df, target = "target")
task$set_col_roles("grp", "group")
pop = po("subsample", frac = 0.7, use_groups = TRUE)
pop$train(list(task))
#> $output
#>
#> ── <TaskRegr> (2111x3) ─────────────────────────────────────────────────────────
#> • Target: target
#> • Properties: groups
#> • Features (2):
#> • dbl (2): x1, x2
#> • Groups: grp
#>
