Base class for handling most "preprocessing" operations. These
are operations that have exactly one Task
input and one Task
output,
and expect the column layout of these Task
s during input and output
to be the same.
Prediction-behavior of preprocessing operations should always be independent for each row in the input-Task
.
This means that the prediction-operation of preprocessing-PipeOp
s should commute with rbind()
: Running prediction
on an n
-row Task
should result in the same result as rbind()
-ing the prediction-result from n
1-row Task
s with the same content. In the large majority of cases, the number and order of rows
should also not be changed during prediction.
Users must implement private$.train_task()
and private$.predict_task()
, which have a Task
input and should return that Task
. The Task
should, if possible, be
manipulated in-place, and should not be cloned.
Alternatively, the private$.train_dt()
and private$.predict_dt()
functions can be implemented, which operate on
data.table
objects instead. This should generally only be done if all
data is in some way altered (e.g. PCA changing all columns to principal components) and not if only
a few columns are added or removed (e.g. feature selection) because this should be done at the Task
-level
with private$.train_task()
. The private$.select_cols()
function can be overloaded for private$.train_dt()
and private$.predict_dt()
to operate only on subsets of the Task
's data, e.g. only on numerical columns.
If the can_subset_cols
argument of the constructor is TRUE
(the default), then the hyperparameter affect_columns
is added, which can limit the columns of the Task
that is modified by the PipeOpTaskPreproc
using a Selector
function. Note this functionality is entirely independent of the private$.select_cols()
functionality.
PipeOpTaskPreproc
is useful for operations that behave differently during training and prediction. For operations
that perform essentially the same operation and only need to perform extra work to build a $state
during training,
the PipeOpTaskPreprocSimple
class can be used instead.
Construction
PipeOpTaskPreproc$new(id, param_set = ps(), param_vals = list(), can_subset_cols = TRUE,
packages = character(0), task_type = "Task", tags = NULL, feature_types = mlr_reflections$task_feature_types)
id
::character(1)
Identifier of resulting object. See$id
slot ofPipeOp
.param_set
::ParamSet
Parameter space description. This should be created by the subclass and given tosuper$initialize()
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings given inparam_set
. The subclass should have its ownparam_vals
parameter and pass it on tosuper$initialize()
. Defaultlist()
.can_subset_cols
::logical(1)
Whether theaffect_columns
parameter should be added which lets the user limit the columns that are modified by thePipeOpTaskPreproc
. This should generally beFALSE
if the operation adds or removes rows from theTask
, andTRUE
otherwise. Default isTRUE
.packages ::
character
Set of all required packages for thePipeOp
'sprivate$.train()
andprivate$.predict()
methods. See$packages
slot. Default ischaracter(0)
.task_type
::character(1)
The class ofTask
that should be accepted as input and will be returned as output. This should generally be acharacter(1)
identifying a type ofTask
, e.g."Task"
,"TaskClassif"
or"TaskRegr"
(or another subclass introduced by other packages). Default is"Task"
.tags ::
character
|NULL
Tags of the resultingPipeOp
. This is added to the tag"data transform"
. DefaultNULL
.feature_types
::character
Feature types affected by thePipeOp
. Seeprivate$.select_cols()
for more information. Defaults to all available feature types.
Input and Output Channels
PipeOpTaskPreproc
has one input channel named "input"
, taking a Task
, or a subclass of
Task
if the task_type
construction argument is given as such; both during training and prediction.
PipeOpTaskPreproc
has one output channel named "output"
, producing a Task
, or a subclass;
the Task
type is the same as for input; both during training and prediction.
The output Task
is the modified input Task
according to the overloaded
private$.train_task()
/private$.predict_taks()
or private$.train_dt()
/private$.predict_dt()
functions.
State
The $state
is a named list
; besides members added by inheriting classes, the members are:
affect_cols
::character
Names of features being selected by theaffect_columns
parameter, if present; names of all present features otherwise.intasklayout
::data.table
Copy of the trainingTask
's$feature_types
slot. This is used during prediction to ensure that the predictionTask
has the same features, feature layout, and feature types as during training.outtasklayout
::data.table
Copy of the trainedTask
's$feature_types
slot. This is used during prediction to ensure that theTask
resulting from the prediction operation has the same features, feature layout, and feature types as after training.dt_columns
::character
Names of features selected by theprivate$.select_cols()
call during training. This is only present if theprivate$.train_dt()
functionality is used, and not present if theprivate$.train_task()
function is overloaded instead.feature_types
::character
Feature types affected by thePipeOp
. Seeprivate$.select_cols()
for more information.
Parameters
affect_columns
::function
|Selector
|NULL
What columns thePipeOpTaskPreproc
should operate on. This parameter is only present if the constructor is called with thecan_subset_cols
argument set toTRUE
(the default).
The parameter must be aSelector
function, which takes aTask
as argument and returns acharacter
of features to use.
SeeSelector
for example functions. Defaults toNULL
, which selects all features.
Internals
PipeOpTaskPreproc
is an abstract class inheriting from PipeOp
. It implements the private$.train()
and
$.predict()
functions. These functions perform checks and go on to call private$.train_task()
and private$.predict_task()
.
A subclass of PipeOpTaskPreproc
may implement these functions, or implement private$.train_dt()
and private$.predict_dt()
instead.
This works by having the default implementations of private$.train_task()
and private$.predict_task()
call private$.train_dt()
and private$.predict_dt()
,
respectively.
The affect_columns
functionality works by unsetting columns by removing their "col_role" before
processing, and adding them afterwards by setting the col_role to "feature"
.
Fields
Fields inherited from PipeOp
.
Methods
Methods inherited from PipeOp
, as well as:
.train_task
(Task
) ->Task
Called by thePipeOpTaskPreproc
's implementation ofprivate$.train()
. Takes a singleTask
as input and modifies it (ideally in-place without cloning) while storing information in the$state
slot. Note that unlike$.train()
, the argument is not a list but a singularTask
, and the return object is also not a list but a singularTask
. Also, contrary toprivate$.train()
, the$state
being generated must be alist
, which thePipeOpTaskPreproc
will add additional slots to (see Section State). Care should be taken to avoid name collisions between$state
elements added byprivate$.train_task()
andPipeOpTaskPreproc
.
By default this function calls theprivate$.train_dt()
function, but it can be overloaded to perform operations on theTask
directly..predict_task
(Task
) ->Task
Called by thePipeOpTaskPreproc
's implementation of$.predict()
. Takes a singleTask
as input and modifies it (ideally in-place without cloning) while using information in the$state
slot. Works analogously toprivate$.train_task()
. Ifprivate$.predict_task()
should only be overloaded ifprivate$.train_task()
is overloaded (i.e.private$.train_dt()
is not used)..train_dt(dt, levels, target)
(data.table
, namedlist
,any
) ->data.table
|data.frame
|matrix
TrainPipeOpTaskPreproc
ondt
, transform it and store a state in$state
. A transformed object must be returned that can be converted to adata.table
usingas.data.table
.dt
does not need to be copied deliberately, it is possible and encouraged to change it in-place.
Thelevels
argument is a named list of factor levels for factorial or character features. If the inputTask
inherits fromTaskSupervised
, thetarget
argument contains the$truth()
information of the trainingTask
; its type depends on theTask
type being trained on.
This method can be overloaded when inheriting fromPipeOpTaskPreproc
, together withprivate$.predict_dt()
and optionallyprivate$.select_cols()
; alternatively,private$.train_task()
andprivate$.predict_task()
can be overloaded..predict_dt(dt, levels)
(data.table
, namedlist
) ->data.table
|data.frame
|matrix
Predict on new data indt
, possibly using the stored$state
. A transformed object must be returned that can be converted to adata.table
usingas.data.table
.dt
does not need to be copied deliberately, it is possible and encouraged to change it in-place.
Thelevels
argument is a named list of factor levels for factorial or character features.
This method can be overloaded when inheritingPipeOpTaskPreproc
, together withprivate$.train_dt()
and optionallyprivate$.select_cols()
; alternatively,private$.train_task()
andprivate$.predict_task()
can be overloaded..select_cols(task)
(Task
) ->character
Selects which columns thePipeOp
operates on, ifprivate$.train_dt()
andprivate$.predict_dt()
are overloaded. This function is not called ifprivate$.train_task()
andprivate$.predict_task()
are overloaded. In contrast to theaffect_columns
parameter.private$.select_cols()
is for the inheriting class to determine which columns the operator should function on, e.g. based on feature type, whileaffect_columns
is a way for the user to limit the columns that aPipeOpTaskPreproc
should operate on.
This method can optionally be overloaded when inheritingPipeOpTaskPreproc
, together withprivate$.train_dt()
andprivate$.predict_dt()
; alternatively,private$.train_task()
andprivate$.predict_task()
can be overloaded.
If this method is not overloaded, it defaults to selecting of type indicated by thefeature_types
construction argument.
See also
https://mlr-org.com/pipeops.html
Other mlr3pipelines backend related:
Graph
,
PipeOp
,
PipeOpTargetTrafo
,
PipeOpTaskPreprocSimple
,
mlr_graphs
,
mlr_pipeops
,
mlr_pipeops_updatetarget
Other PipeOps:
PipeOp
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson