Collapses factors of type factor, ordered: Collapses the rarest factors in the training samples, until target_level_count
levels remain. Levels that have prevalence strictly above no_collapse_above_prevalence or absolute count strictly above no_collapse_above_absolute
are retained, however. For factor variables, these are collapsed to the next larger level, for ordered variables, rare variables
are collapsed to the neighbouring class, whichever has fewer samples.
In case both no_collapse_above_prevalence and no_collapse_above_absolute are given, the less strict threshold of the two will be used, i.e. if
no_collapse_above_prevalence is 1 and no_collapse_above_absolute is 10 for a task with 100 samples, levels that are seen more than 10 times
will not be collapsed.
Levels not seen during training are not touched during prediction; Therefore it is useful to combine this with the
PipeOpFixFactors.
Format
R6Class object inheriting from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.
Construction
id::character(1)
Identifier of resulting object, default"collapsefactors".param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist().
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc.
The output is the input Task with rare affected factor and ordered feature levels collapsed.
State
The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:
collapse_map:: namedlistof namedlistofcharacter
List of factor level maps. For each factor,collapse_mapcontains a namedlistthat indicates what levels of the input task get mapped to what levels of the output task. Ifcollapse_maphas an entryfeat_1with an entrya = c("x", "y"), it means that levels"x"and"y"get collapsed to level"a"in feature"feat_1".
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:
no_collapse_above_prevalence::numeric(1)
Fraction of samples below which factor levels get collapsed. Default is 1, which causes all levels to be collapsed untiltarget_level_countremain.no_collapse_above_absolute::integer(1)
Number of samples below which factor levels get collapsed. Default isInf, which causes all levels to be collapsed untiltarget_level_countremain.target_level_count::integer(1)
Number of levels to retain. Default is 2.
Internals
Makes use of the fact that levels(fact_var) = list(target1 = c("source1", "source2"), target2 = "source2") causes
renaming of level "source1" and "source2" both to "target1", and also "source2" to "target2".
Fields
Only fields inherited from PipeOp.
Methods
Only methods inherited from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp,
PipeOpEncodePL,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_adas,
mlr_pipeops_blsmote,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_decode,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_encodeplquantiles,
mlr_pipeops_encodepltree,
mlr_pipeops_featureunion,
mlr_pipeops_filter,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_learner_pi_cvplus,
mlr_pipeops_learner_quantiles,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nearmiss,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_rowapply,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_smotenc,
mlr_pipeops_spatialsign,
mlr_pipeops_subsample,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_textvectorizer,
mlr_pipeops_threshold,
mlr_pipeops_tomek,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
op = PipeOpCollapseFactors$new()
# Create example training task
df = data.frame(
target = runif(100),
fct = factor(rep(LETTERS[1:6], times = c(25, 30, 5, 15, 5, 20))),
ord = factor(rep(1:6, times = c(20, 25, 30, 5, 5, 15)), ordered = TRUE)
)
task = TaskRegr$new(df, target = "target", id = "example_train")
# Training
train_task_collapsed = op$train(list(task))[[1]]
train_task_collapsed$levels(c("fct", "ord"))
#> $fct
#> [1] "A" "B"
#>
#> $ord
#> [1] "2" "3"
#>
# Create example prediction task
df_pred = data.frame(
target = runif(7),
fct = factor(LETTERS[1:7]),
ord = factor(1:7, ordered = TRUE)
)
pred_task = TaskRegr$new(df_pred, target = "target", id = "example_pred")
# Prediction
pred_task_collapsed = op$predict(list(pred_task))[[1]]
pred_task_collapsed$levels(c("fct", "ord"))
#> $fct
#> [1] "A" "B" "G"
#>
#> $ord
#> [1] "2" "3" "7"
#>
