Feature filtering using a mlr3filters::Filter object, see the
mlr3filters package.
If a Filter can only operate on a subset of columns based on column type, then only these features are considered and filtered.
nfeat and frac will count for the features of the type that the Filter can operate on;
this means e.g. that setting nfeat to 0 will only remove features of the type that the Filter can work with.
Format
R6Class object inheriting from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.
Construction
filter::FilterFilterused for feature filtering. This argument is always cloned; to access theFilterinsidePipeOpFilterby-reference, use$filter.id::character(1)
Identifier of the resulting object, defaulting to theidof theFilterbeing used.param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist().
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc.
The output is the input Task with features removed that were filtered out.
State
The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:
scores:: namednumeric
Scores calculated for all features of the trainingTaskwhich are being used as cutoff for feature filtering. Iffracornfeatis given, the underlyingFiltermay choose to not calculate scores for all features that are given. This only includes features on which theFiltercan operate; e.g. if theFiltercan only operate on numeric features, then scores for factorial features will not be given.features::character
Names of features that are being kept. Features of types that theFiltercan not operate on are always being kept.
Parameters
The parameters are the parameters inherited from the PipeOpTaskPreproc, as well as the parameters of the Filter
used by this object. Besides, parameters introduced are:
filter.nfeat::numeric(1)
Number of features to select. Mutually exclusive withfrac,cutoff, andpermuted.filter.frac::numeric(1)
Fraction of features to keep. Mutually exclusive withnfeat,cutoff, andpermuted.filter.cutoff::numeric(1)
Minimum value of filter heuristic for which to keep features. Mutually exclusive withnfeat,frac, andpermuted.filter.permuted::integer(1)
If this parameter is set, a random permutation of each feature is added to the task before applying the filter. All features selected before thepermuted-th permuted features is selected are kept. This is similar to the approach in Wu (2007) and Thomas (2017). Mutually exclusive withnfeat,frac, andcutoff.
Note that at least one of filter.nfeat, filter.frac, filter.cutoff, and filter.permuted must be given.
Internals
This does not use the $.select_cols feature of PipeOpTaskPreproc to select only features compatible with the Filter;
instead the whole Task is used by private$.get_state() and subset internally.
Fields
Fields inherited from PipeOp, as well as:
Methods
Methods inherited from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.
References
Wu Y, Boos DD, Stefanski LA (2007). “Controlling Variable Selection by the Addition of Pseudovariables.” Journal of the American Statistical Association, 102(477), 235–243. doi:10.1198/016214506000000843 .
Thomas J, Hepp T, Mayr A, Bischl B (2017). “Probing for Sparse and Fast Variable Selection with Model-Based Boosting.” Computational and Mathematical Methods in Medicine, 2017, 1–8. doi:10.1155/2017/1421409 .
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp,
PipeOpEncodePL,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_adas,
mlr_pipeops_blsmote,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_collapsefactors,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_decode,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_encodeplquantiles,
mlr_pipeops_encodepltree,
mlr_pipeops_featureunion,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_learner_pi_cvplus,
mlr_pipeops_learner_quantiles,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nearmiss,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_rowapply,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_smotenc,
mlr_pipeops_spatialsign,
mlr_pipeops_subsample,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_textvectorizer,
mlr_pipeops_threshold,
mlr_pipeops_tomek,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
library("mlr3filters")
# setup PipeOpFilter to keep the 5 most important
# features of the spam task w.r.t. their AUC
task = tsk("spam")
filter = flt("auc")
po = po("filter", filter = filter)
po$param_set
#> <ParamSetCollection(5)>
#> id class lower upper nlevels default value
#> <char> <char> <num> <num> <num> <list> <list>
#> 1: filter.nfeat ParamInt 0 Inf Inf <NoDefault[0]> [NULL]
#> 2: filter.frac ParamDbl 0 1 Inf <NoDefault[0]> [NULL]
#> 3: filter.cutoff ParamDbl -Inf Inf Inf <NoDefault[0]> [NULL]
#> 4: filter.permuted ParamInt 1 Inf Inf <NoDefault[0]> [NULL]
#> 5: affect_columns ParamUty NA NA Inf <Selector[1]> [NULL]
po$param_set$values$filter.nfeat = 5
# filter the task
filtered_task = po$train(list(task))[[1]]
# filtered task + extracted AUC scores
filtered_task$feature_names
#> [1] "capitalAve" "capitalLong" "charDollar" "charExclamation"
#> [5] "your"
head(po$state$scores, 10)
#> charExclamation capitalLong capitalAve your charDollar
#> 0.3290461 0.3041626 0.2882004 0.2801659 0.2721394
#> capitalTotal free our you remove
#> 0.2622801 0.2327285 0.2109325 0.2104681 0.2031303
# feature selection embedded in a 3-fold cross validation
# keep 30% of features based on their AUC score
task = tsk("spam")
gr = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>%
po("learner", lrn("classif.rpart"))
learner = GraphLearner$new(gr)
rr = resample(task, learner, rsmp("holdout"), store_models = TRUE)
rr$learners[[1]]$model$auc$scores
#> charExclamation capitalLong capitalAve your
#> 0.328759080 0.306751518 0.288703808 0.279083484
#> charDollar capitalTotal free you
#> 0.273296589 0.263313380 0.230684960 0.215739152
#> our remove money hp
#> 0.213188551 0.204291119 0.179973626 0.176336364
#> all num000 business over
#> 0.176079754 0.157094167 0.149369524 0.140389728
#> mail george hpl internet
#> 0.136271572 0.135293399 0.133370707 0.131148231
#> receive email address order
#> 0.129122806 0.127931640 0.127513404 0.113377057
#> make num1999 charHash credit
#> 0.106704315 0.102212867 0.101519423 0.096648263
#> people will addresses labs
#> 0.090943952 0.089247316 0.073828362 0.073427396
#> num650 num85 edu lab
#> 0.068441105 0.067097612 0.062289775 0.060257929
#> technology telnet meeting data
#> 0.054653251 0.052352619 0.052065676 0.048215864
#> pm report project num857
#> 0.041177580 0.038103133 0.037171012 0.034819678
#> charSquarebracket num415 original conference
#> 0.033739215 0.032761485 0.029735304 0.024965106
#> charSemicolon cs font re
#> 0.024724216 0.024623477 0.023837706 0.023239245
#> charRoundbracket direct num3d table
#> 0.020505763 0.012178452 0.008840974 0.005753906
#> parts
#> 0.001509991
