Filter can only operate on a subset of columns based on column type, then only these features are considered and filtered.
frac will count for the features of the type that the
Filter can operate on;
this means e.g. that setting
nfeat to 0 will only remove features of the type that the
Filter can work with.
PipeOpFilter$new(filter, id = filter$id, param_vals = list())
param_vals :: named
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default
Input and output channels are inherited from
The output is the input
Task with features removed that were filtered out.
$state is a named
list with the
$state elements inherited from
PipeOpTaskPreproc, as well as:
scores :: named
Scores calculated for all features of the training
Task which are being used
as cutoff for feature filtering. If
nfeat is given, the underlying
Filter may choose to not calculate scores for
all features that are given. This only includes features on which the
Filter can operate; e.g.
Filter can only operate on numeric features, then scores for factorial features will not be given.
Names of features that are being kept. Features of types that the
Filter can not operate on are always being kept.
Number of features to select. Mutually exclusive with
Fraction of features to keep. Mutually exclusive with
Minimum value of filter heuristic for which to keep features. Mutually exclusive with
Note that at least one of
filter.cutoff must be given.
Fields inherited from
PipeOpTaskPreproc, as well as:
library("mlr3") library("mlr3filters") # setup PipeOpFilter to keep the 5 most important # features of the spam task w.r.t. their AUC task = tsk("spam") filter = flt("auc") po = po("filter", filter = filter) po$param_set#> ParamSet: auc #> id class lower upper levels default value #> 1: filter.nfeat ParamInt 0 Inf <NoDefault> #> 2: filter.frac ParamDbl 0 1 <NoDefault> #> 3: filter.cutoff ParamDbl -Inf Inf <NoDefault> #> 4: affect_columns ParamUty NA NA <NoDefault>po$param_set$values$filter.nfeat = 5 # filter the task filtered_task = po$train(list(task))[] # filtered task + extracted AUC scores filtered_task$feature_names#>  "capitalAve" "capitalLong" "charDollar" "charExclamation" #>  "your"head(po$state$scores, 10)#> charExclamation capitalLong capitalAve your charDollar #> 0.3290461 0.3041626 0.2882004 0.2801659 0.2721394 #> capitalTotal free our you remove #> 0.2622801 0.2327285 0.2109325 0.2104681 0.2031303# feature selection embedded in a 3-fold cross validation # keep 30% of features based on their AUC score task = tsk("spam") gr = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>% po("learner", lrn("classif.rpart")) learner = GraphLearner$new(gr) rr = resample(task, learner, rsmp("holdout"), store_models = TRUE) rr$learners[]$model$auc$scores#> charExclamation capitalLong capitalAve your #> 0.334104675 0.303172295 0.284178501 0.274073473 #> charDollar capitalTotal free our #> 0.268829241 0.260534520 0.238240723 0.212845684 #> you remove money all #> 0.202178359 0.193070040 0.177594683 0.176178991 #> hp num000 business internet #> 0.175957873 0.150730517 0.144906996 0.138119186 #> george over mail hpl #> 0.135077639 0.133729691 0.133660830 0.131520523 #> receive email address order #> 0.126980784 0.122406838 0.118232647 0.115468784 #> num1999 make charHash credit #> 0.105703146 0.100847045 0.099345723 0.093536735 #> people labs will addresses #> 0.085092887 0.073494139 0.071127796 0.068573648 #> num85 num650 lab edu #> 0.067434521 0.067365883 0.062974375 0.059673479 #> technology meeting telnet data #> 0.057667319 0.053994836 0.051205037 0.047252189 #> report pm charSquarebracket num857 #> 0.042756495 0.042534929 0.037635231 0.035629518 #> project num415 original re #> 0.035013562 0.034052179 0.033806915 0.031376629 #> conference cs charSemicolon font #> 0.028086464 0.026428190 0.022944631 0.020986316 #> charRoundbracket num3d direct table #> 0.014926251 0.011405130 0.006168500 0.002078599 #> parts #> 0.001845631