Feature filtering using a `mlr3filters::Filter`

object, see the
mlr3filters package.

If a `Filter`

can only operate on a subset of columns based on column type, then only these features are considered and filtered.
`nfeat`

and `frac`

will count for the features of the type that the `Filter`

can operate on;
this means e.g. that setting `nfeat`

to 0 will only remove features of the type that the `Filter`

can work with.

## Format

`R6Class`

object inheriting from `PipeOpTaskPreprocSimple`

/`PipeOpTaskPreproc`

/`PipeOp`

.

## Construction

`filter`

::`Filter`

`Filter`

used for feature filtering. This argument is always cloned; to access the`Filter`

inside`PipeOpFilter`

by-reference, use`$filter`

.`id`

::`character(1)`

Identifier of the resulting object, defaulting to the`id`

of the`Filter`

being used.`param_vals`

:: named`list`

List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default`list()`

.

## Input and Output Channels

Input and output channels are inherited from `PipeOpTaskPreproc`

.

The output is the input `Task`

with features removed that were filtered out.

## State

The `$state`

is a named `list`

with the `$state`

elements inherited from `PipeOpTaskPreproc`

, as well as:

`scores`

:: named`numeric`

Scores calculated for all features of the training`Task`

which are being used as cutoff for feature filtering. If`frac`

or`nfeat`

is given, the underlying`Filter`

may choose to not calculate scores for all features that are given. This only includes features on which the`Filter`

can operate; e.g. if the`Filter`

can only operate on numeric features, then scores for factorial features will not be given.`features`

::`character`

Names of features that are being kept. Features of types that the`Filter`

can not operate on are always being kept.

## Parameters

The parameters are the parameters inherited from the `PipeOpTaskPreproc`

, as well as the parameters of the `Filter`

used by this object. Besides, parameters introduced are:

`filter.nfeat`

::`numeric(1)`

Number of features to select. Mutually exclusive with`frac`

,`cutoff`

, and`permuted`

.`filter.frac`

::`numeric(1)`

Fraction of features to keep. Mutually exclusive with`nfeat`

,`cutoff`

, and`permuted`

.`filter.cutoff`

::`numeric(1)`

Minimum value of filter heuristic for which to keep features. Mutually exclusive with`nfeat`

,`frac`

, and`permuted`

.`filter.permuted`

::`integer(1)`

If this parameter is set, a random permutation of each feature is added to the task before applying the filter. All features selected before the`permuted`

-th permuted features is selected are kept. This is similar to the approach in Wu (2007) and Thomas (2017). Mutually exclusive with`nfeat`

,`frac`

, and`cutoff`

.

Note that at least one of `filter.nfeat`

, `filter.frac`

, `filter.cutoff`

, and `filter.permuted`

must be given.

## Internals

This does *not* use the `$.select_cols`

feature of `PipeOpTaskPreproc`

to select only features compatible with the `Filter`

;
instead the whole `Task`

is used by `private$.get_state()`

and subset internally.

## Fields

Fields inherited from `PipeOpTaskPreproc`

, as well as:

## Methods

Methods inherited from `PipeOpTaskPreprocSimple`

/`PipeOpTaskPreproc`

/`PipeOp`

.

## References

Wu Y, Boos DD, Stefanski LA (2007).
“Controlling Variable Selection by the Addition of Pseudovariables.”
*Journal of the American Statistical Association*, **102**(477), 235--243.
doi:10.1198/016214506000000843
.

Thomas J, Hepp T, Mayr A, Bischl B (2017).
“Probing for Sparse and Fast Variable Selection with Model-Based Boosting.”
*Computational and Mathematical Methods in Medicine*, **2017**, 1--8.
doi:10.1155/2017/1421409
.

## See also

https://mlr3book.mlr-org.com/list-pipeops.html

Other PipeOps:
`PipeOpEnsemble`

,
`PipeOpImpute`

,
`PipeOpTargetTrafo`

,
`PipeOpTaskPreprocSimple`

,
`PipeOpTaskPreproc`

,
`PipeOp`

,
`mlr_pipeops_boxcox`

,
`mlr_pipeops_branch`

,
`mlr_pipeops_chunk`

,
`mlr_pipeops_classbalancing`

,
`mlr_pipeops_classifavg`

,
`mlr_pipeops_classweights`

,
`mlr_pipeops_colapply`

,
`mlr_pipeops_collapsefactors`

,
`mlr_pipeops_colroles`

,
`mlr_pipeops_copy`

,
`mlr_pipeops_datefeatures`

,
`mlr_pipeops_encodeimpact`

,
`mlr_pipeops_encodelmer`

,
`mlr_pipeops_encode`

,
`mlr_pipeops_featureunion`

,
`mlr_pipeops_fixfactors`

,
`mlr_pipeops_histbin`

,
`mlr_pipeops_ica`

,
`mlr_pipeops_imputeconstant`

,
`mlr_pipeops_imputehist`

,
`mlr_pipeops_imputelearner`

,
`mlr_pipeops_imputemean`

,
`mlr_pipeops_imputemedian`

,
`mlr_pipeops_imputemode`

,
`mlr_pipeops_imputeoor`

,
`mlr_pipeops_imputesample`

,
`mlr_pipeops_kernelpca`

,
`mlr_pipeops_learner`

,
`mlr_pipeops_missind`

,
`mlr_pipeops_modelmatrix`

,
`mlr_pipeops_multiplicityexply`

,
`mlr_pipeops_multiplicityimply`

,
`mlr_pipeops_mutate`

,
`mlr_pipeops_nmf`

,
`mlr_pipeops_nop`

,
`mlr_pipeops_ovrsplit`

,
`mlr_pipeops_ovrunite`

,
`mlr_pipeops_pca`

,
`mlr_pipeops_proxy`

,
`mlr_pipeops_quantilebin`

,
`mlr_pipeops_randomprojection`

,
`mlr_pipeops_randomresponse`

,
`mlr_pipeops_regravg`

,
`mlr_pipeops_removeconstants`

,
`mlr_pipeops_renamecolumns`

,
`mlr_pipeops_replicate`

,
`mlr_pipeops_scalemaxabs`

,
`mlr_pipeops_scalerange`

,
`mlr_pipeops_scale`

,
`mlr_pipeops_select`

,
`mlr_pipeops_smote`

,
`mlr_pipeops_spatialsign`

,
`mlr_pipeops_subsample`

,
`mlr_pipeops_targetinvert`

,
`mlr_pipeops_targetmutate`

,
`mlr_pipeops_targettrafoscalerange`

,
`mlr_pipeops_textvectorizer`

,
`mlr_pipeops_threshold`

,
`mlr_pipeops_tunethreshold`

,
`mlr_pipeops_unbranch`

,
`mlr_pipeops_updatetarget`

,
`mlr_pipeops_vtreat`

,
`mlr_pipeops_yeojohnson`

,
`mlr_pipeops`

## Examples

```
library("mlr3")
library("mlr3filters")
# setup PipeOpFilter to keep the 5 most important
# features of the spam task w.r.t. their AUC
task = tsk("spam")
filter = flt("auc")
po = po("filter", filter = filter)
po$param_set
#> <ParamSetCollection:auc>
#> id class lower upper nlevels default value
#> 1: filter.nfeat ParamInt 0 Inf Inf <NoDefault[3]>
#> 2: filter.frac ParamDbl 0 1 Inf <NoDefault[3]>
#> 3: filter.cutoff ParamDbl -Inf Inf Inf <NoDefault[3]>
#> 4: filter.permuted ParamInt 1 Inf Inf <NoDefault[3]>
#> 5: affect_columns ParamUty NA NA Inf <Selector[1]>
po$param_set$values$filter.nfeat = 5
# filter the task
filtered_task = po$train(list(task))[[1]]
# filtered task + extracted AUC scores
filtered_task$feature_names
#> [1] "capitalAve" "capitalLong" "charDollar" "charExclamation"
#> [5] "your"
head(po$state$scores, 10)
#> charExclamation capitalLong capitalAve your charDollar
#> 0.3290461 0.3041626 0.2882004 0.2801659 0.2721394
#> capitalTotal free our you remove
#> 0.2622801 0.2327285 0.2109325 0.2104681 0.2031303
# feature selection embedded in a 3-fold cross validation
# keep 30% of features based on their AUC score
task = tsk("spam")
gr = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>%
po("learner", lrn("classif.rpart"))
learner = GraphLearner$new(gr)
rr = resample(task, learner, rsmp("holdout"), store_models = TRUE)
rr$learners[[1]]$model$auc$scores
#> charExclamation capitalLong capitalAve your
#> 3.290018e-01 3.084719e-01 2.924356e-01 2.850997e-01
#> charDollar capitalTotal free you
#> 2.760477e-01 2.690304e-01 2.328002e-01 2.133331e-01
#> our remove money all
#> 2.127344e-01 2.049659e-01 1.848303e-01 1.800999e-01
#> hp num000 business over
#> 1.768315e-01 1.592152e-01 1.529875e-01 1.490547e-01
#> mail internet hpl george
#> 1.395390e-01 1.362281e-01 1.362075e-01 1.341867e-01
#> email receive address order
#> 1.316039e-01 1.303801e-01 1.246968e-01 1.142778e-01
#> make num1999 charHash credit
#> 1.090133e-01 1.049933e-01 1.024926e-01 9.926152e-02
#> will people labs addresses
#> 9.423281e-02 9.040350e-02 7.689188e-02 7.541491e-02
#> num650 num85 edu lab
#> 6.979414e-02 6.939648e-02 6.787860e-02 6.004967e-02
#> technology telnet meeting data
#> 5.498094e-02 5.137943e-02 4.946566e-02 4.597672e-02
#> pm report project num857
#> 3.984151e-02 3.941819e-02 3.742082e-02 3.490039e-02
#> charSquarebracket num415 original conference
#> 3.485239e-02 3.285303e-02 2.864972e-02 2.808021e-02
#> cs re font charSemicolon
#> 2.658932e-02 2.658113e-02 2.309021e-02 2.247249e-02
#> charRoundbracket direct num3d table
#> 1.810618e-02 1.206585e-02 9.208792e-03 2.783626e-03
#> parts
#> 5.883081e-05
```