Filter can only operate on a subset of columns based on column type, then only these features are considered and filtered.
frac will count for the features of the type that the
Filter can operate on;
this means e.g. that setting
nfeat to 0 will only remove features of the type that the
Filter can work with.
character(1)Identifier of the resulting object, defaulting to the
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default
Input and output channels are inherited from
The output is the input
Task with features removed that were filtered out.
$state is a named
list with the
$state elements inherited from
PipeOpTaskPreproc, as well as:
Scores calculated for all features of the training
Taskwhich are being used as cutoff for feature filtering. If
nfeatis given, the underlying
Filtermay choose to not calculate scores for all features that are given. This only includes features on which the
Filtercan operate; e.g. if the
Filtercan only operate on numeric features, then scores for factorial features will not be given.
Names of features that are being kept. Features of types that the
Filtercan not operate on are always being kept.
Number of features to select. Mutually exclusive with
Fraction of features to keep. Mutually exclusive with
Minimum value of filter heuristic for which to keep features. Mutually exclusive with
If this parameter is set, a random permutation of each feature is added to the task before applying the filter. All features selected before the
permuted-th permuted features is selected are kept. This is similar to the approach in Wu (2007) and Thomas (2017). Mutually exclusive with
Note that at least one of
filter.permuted must be given.
Fields inherited from
PipeOpTaskPreproc, as well as:
Wu Y, Boos DD, Stefanski LA (2007). “Controlling Variable Selection by the Addition of Pseudovariables.” Journal of the American Statistical Association, 102(477), 235--243. doi:10.1198/016214506000000843 .
Thomas J, Hepp T, Mayr A, Bischl B (2017). “Probing for Sparse and Fast Variable Selection with Model-Based Boosting.” Computational and Mathematical Methods in Medicine, 2017, 1--8. doi:10.1155/2017/1421409 .
library("mlr3") library("mlr3filters") # setup PipeOpFilter to keep the 5 most important # features of the spam task w.r.t. their AUC task = tsk("spam") filter = flt("auc") po = po("filter", filter = filter) po$param_set #> <ParamSetCollection:auc> #> id class lower upper nlevels default value #> 1: filter.nfeat ParamInt 0 Inf Inf <NoDefault> #> 2: filter.frac ParamDbl 0 1 Inf <NoDefault> #> 3: filter.cutoff ParamDbl -Inf Inf Inf <NoDefault> #> 4: filter.permuted ParamInt 1 Inf Inf <NoDefault> #> 5: affect_columns ParamUty NA NA Inf <Selector> po$param_set$values$filter.nfeat = 5 # filter the task filtered_task = po$train(list(task))[] # filtered task + extracted AUC scores filtered_task$feature_names #>  "capitalAve" "capitalLong" "charDollar" "charExclamation" #>  "your" head(po$state$scores, 10) #> charExclamation capitalLong capitalAve your charDollar #> 0.3290461 0.3041626 0.2882004 0.2801659 0.2721394 #> capitalTotal free our you remove #> 0.2622801 0.2327285 0.2109325 0.2104681 0.2031303 # feature selection embedded in a 3-fold cross validation # keep 30% of features based on their AUC score task = tsk("spam") gr = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>% po("learner", lrn("classif.rpart")) learner = GraphLearner$new(gr) rr = resample(task, learner, rsmp("holdout"), store_models = TRUE) rr$learners[]$model$auc$scores #> charExclamation capitalLong capitalAve your #> 3.290018e-01 3.084719e-01 2.924356e-01 2.850997e-01 #> charDollar capitalTotal free you #> 2.760477e-01 2.690304e-01 2.328002e-01 2.133331e-01 #> our remove money all #> 2.127344e-01 2.049659e-01 1.848303e-01 1.800999e-01 #> hp num000 business over #> 1.768315e-01 1.592152e-01 1.529875e-01 1.490547e-01 #> mail internet hpl george #> 1.395390e-01 1.362281e-01 1.362075e-01 1.341867e-01 #> email receive address order #> 1.316039e-01 1.303801e-01 1.246968e-01 1.142778e-01 #> make num1999 charHash credit #> 1.090133e-01 1.049933e-01 1.024926e-01 9.926152e-02 #> will people labs addresses #> 9.423281e-02 9.040350e-02 7.689188e-02 7.541491e-02 #> num650 num85 edu lab #> 6.979414e-02 6.939648e-02 6.787860e-02 6.004967e-02 #> technology telnet meeting data #> 5.498094e-02 5.137943e-02 4.946566e-02 4.597672e-02 #> pm report project num857 #> 3.984151e-02 3.941819e-02 3.742082e-02 3.490039e-02 #> charSquarebracket num415 original conference #> 3.485239e-02 3.285303e-02 2.864972e-02 2.808021e-02 #> cs re font charSemicolon #> 2.658932e-02 2.658113e-02 2.309021e-02 2.247249e-02 #> charRoundbracket direct num3d table #> 1.810618e-02 1.206585e-02 9.208792e-03 2.783626e-03 #> parts #> 5.883081e-05