A PipeOp
represents a transformation of a given "input" into a given "output", with two stages: "training"
and "prediction". It can be understood as a generalized function that not only has multiple inputs, but
also multiple outputs (as well as two stages). The "training" stage is used when training a machine learning pipeline or
fitting a statistical model, and the "predicting" stage is then used for making predictions
on new data.
To perform training, the $train()
function is called which takes inputs and transforms them, while simultaneously storing information
in its $state
slot. For prediction, the $predict()
function is called, where the $state
information can be used to influence the transformation
of the new data.
A PipeOp
is usually used in a Graph
object, a representation of a computational graph. It can have
multiple input channels—think of these as multiple arguments to a function, for example when averaging
different models—, and multiple output channels—a transformation may
return different objects, for example different subsets of a Task
. The purpose of the Graph
is to
connect different outputs of some PipeOp
s to inputs of other PipeOp
s.
Input and output channel information of a PipeOp
is defined in the $input
and $output
slots; each channel has a name, a required
type during training, and a required type during prediction. The $train()
and $predict()
function are called with a list
argument
that has one entry for each declared channel (with one exception, see next paragraph). The list
is automatically type-checked
for each channel against $input
and then passed on to the private$.train()
or private$.predict()
functions. There the data is processed and
a result list
is created. This list
is again type-checked for declared output types of each channel. The length and types of the result
list
is as declared in $output
.
A special input channel name is "..."
, which creates a vararg channel that takes arbitrarily many arguments, all of the same type. If the $input
table contains an "..."
-entry, then the input given to $train()
and $predict()
may be longer than the number of declared input channels.
This class is an abstract base class that all PipeOp
s being used in a Graph
should inherit from, and
is not intended to be instantiated.
Format
Abstract R6Class
.
Construction
PipeOp$new(id, param_set = ps(), param_vals = list(), input, output, packages = character(0), tags = character(0))
id
::character(1)
Identifier of resulting object. See$id
slot.param_set
::ParamSet
|list
ofexpression
Parameter space description. This should be created by the subclass and given tosuper$initialize()
. If this is aParamSet
, it is used as thePipeOp
'sParamSet
directly. Otherwise it must be alist
of expressions e.g. created byalist()
that evaluate toParamSet
s. TheseParamSet
are combined using aParamSetCollection
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings given inparam_set
. The subclass should have its ownparam_vals
parameter and pass it on tosuper$initialize()
. Defaultlist()
.input ::
data.table
with columnsname
(character
),train
(character
),predict
(character
)
Sets the$input
slot of the resulting object; see description there.output ::
data.table
with columnsname
(character
),train
(character
),predict
(character
)
Sets the$output
slot of the resulting object; see description there.packages ::
character
Set of all required packages for thePipeOp
's$train
and$predict
methods. See$packages
slot. Default ischaracter(0)
.tags
::character
A set of tags associated with thePipeOp
. Tags describe a PipeOp's purpose. Can be used to filteras.data.table(mlr_pipeops)
. Default is"abstract"
, indicating an abstractPipeOp
.
Internals
PipeOp
is an abstract class with abstract functions private$.train()
and private$.predict()
. To create a functional
PipeOp
class, these two methods must be implemented. Each of these functions receives a named list
according to
the PipeOp
's input channels, and must return a list
(names are ignored) with values in the order of output
channels in $output
. The private$.train()
and private$.predict()
function should not be called by the user;
instead, a $train()
and $predict()
should be used. The most convenient usage is to add the PipeOp
to a Graph
(possibly as singleton in that Graph
), and using the Graph
's $train()
/ $predict()
methods.
private$.train()
and private$.predict()
should treat their inputs as read-only. If they are R6
objects,
they should be cloned before being manipulated in-place. Objects, or parts of objects, that are not changed, do
not need to be cloned, and it is legal to return the same identical-by-reference objects to multiple outputs.
Fields
id
::character
ID of thePipeOp
. IDs are user-configurable, and IDs ofPipeOp
s must be unique within aGraph
. IDs ofPipeOp
s must not be changed once they are part of aGraph
, instead theGraph
's$set_names()
method should be used.packages
::character
Packages required for thePipeOp
. Functions that are not in base R should still be called using::
(or explicitly attached usingrequire()
) inprivate$.train()
andprivate$.predict()
, but packages declared here are checked before any (possibly expensive) processing has started within aGraph
.param_set
::ParamSet
Parameters and parameter constraints. Parameter values that influence the functioning of$train
and / or$predict
are in the$param_set$values
slot; these are automatically checked against parameter constraints in$param_set
.state
::any
|NULL
Method-dependent state obtained during training step, and usually required for the prediction step. This isNULL
if and only if thePipeOp
has not been trained. The$state
is the only slot that can be reliably modified during$train()
, becauseprivate$.train()
may theoretically be executed in a differentR
-session (e.g. for parallelization).$state
should furthermore always be set to something with copy-semantics, since it is never cloned. This is a limitation not ofPipeOp
ormlr3pipelines
, but of the way the system as a whole works, together withGraphLearner
andmlr3
.input ::
data.table
with columnsname
(character
),train
(character
),predict
(character
)
Input channels ofPipeOp
. Columnname
gives the names (and order) of values in the list given to$train()
and$predict()
. Columntrain
is the (S3) class that an input object must conform to during training, columnpredict
is the (S3) class that an input object must conform to during prediction. Types are checked by thePipeOp
itself and do not need to be checked byprivate$.train()
/private$.predict()
code.
A special name is"..."
, which creates a vararg input channel that accepts a variable number of inputs.
If a row has bothtrain
andpredict
values enclosed by square brackets ("[
", "]
), then this channel isMultiplicity
-aware. If thePipeOp
receives aMultiplicity
value on these channels, thisMultiplicity
is given to the.train()
and.predict()
functions directly. Otherwise, theMultiplicity
is transparently unpacked and the.train()
and.predict()
functions are called multiple times, once for eachMultiplicity
element. The type enclosed by square brackets indicates that only aMultiplicity
containing values of this type are accepted. SeeMultiplicity
for more information.output ::
data.table
with columnsname
(character
),train
(character
),predict
(character
)
Output channels ofPipeOp
, in the order in which they will be given in the list returned by$train
and$predict
functions. Columntrain
is the (S3) class that an output object must conform to during training, columnpredict
is the (S3) class that an output object must conform to during prediction. ThePipeOp
checks values returned byprivate$.train()
andprivate$.predict()
against these types specifications.
If a row has bothtrain
andpredict
values enclosed by square brackets ("[
", "]
), then this signals that the channel emits aMultiplicity
of the indicated type. SeeMultiplicity
for more information.innum
::numeric(1)
Number of input channels. This equalsnrow($input)
.outnum
::numeric(1)
Number of output channels. This equalsnrow($output)
.is_trained
::logical(1)
Indicate whether thePipeOp
was already trained and can therefore be used for prediction.tags
::character
A set of tags associated with thePipeOp
. Tags describe a PipeOp's purpose. Can be used to filteras.data.table(mlr_pipeops)
. PipeOp tags are inherited and child classes can introduce additional tags.hash
::character(1)
Checksum calculated on thePipeOp
, depending on thePipeOp
'sclass
and the slots$id
and$param_set$values
. If aPipeOp
's functionality may change depending on more than these values, it should inherit the$hash
active binding and calculate the hash asdigest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")
.phash
::character(1)
Checksum calculated on thePipeOp
, depending on thePipeOp
'sclass
and the slots$id
but ignoring$param_set$values
. If aPipeOp
's functionality may change depending on more than these values, it should inherit the$hash
active binding and calculate the hash asdigest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")
..result
::list
If theGraph
's$keep_results
flag is set toTRUE
, then the intermediate Results of$train()
and$predict()
are saved to this slot, exactly as they are returned by these functions. This is mainly for debugging purposes and done, if requested, by theGraph
backend itself; it should not be done explicitly byprivate$.train()
orprivate$.predict()
.man
::character(1)
Identifying string of the help page that shows withhelp()
.properties
::character()
The properties of the pipeop. Currently supported values are:"validation"
: thePipeOp
can make use of the$internal_valid_task
of anmlr3::Task
. This is for example used forPipeOpLearner
s that wrap aLearner
with this property, seemlr3::Learner
.PipeOp
s that have this property, also have a$validate
field, which controls whether to use the validation task, as well as a$internal_valid_scores
field, which allows to access the internal validation scores after training."internal_tuning"
: thePipeOp
is able to internally optimize hyperparameters. This works analogously to the internal tuning implementation formlr3::Learner
.PipeOp
s with that property also implement the standardized accessor$internal_tuned_values
and have at least one parameter tagged with"internal_tuning"
. An example for such aPipeOp
is aPipeOpLearner
that wraps aLearner
with the"internal_tuning"
property.
Programatic access to all available properties is possible via
mlr_reflections$pipeops$properties
.
Methods
train(input)
(list
) -> namedlist
TrainPipeOp
oninputs
, transform it to output and store the learned$state
. If the PipeOp is already trained, already present$state
is overwritten. Input list is typechecked against the$input
train
column. Return value is a list with as many entries as$output
has rows, with each entry named after the$output
name
column and class according to the$output
train
column. The workhorse function for training eachPipeOp
is the private.train(input)
: (namedlist
) ->list
function. It's an Abstract function that must be implemented by concrete subclasses.private$.train()
is called by$train()
after typechecking. It must change the$state
value to something non-NULL
and return a list of transformed data according to the$output
train
column. Names of the returned list are ignored.
Theprivate$.train()
method should not be called by a user; instead, the$train()
method should be used which does some checking and possibly type conversion.predict(input)
(list
) -> namedlist
Predict on new data ininput
, possibly using the stored$state
. Input and output are specified by$input
and$output
in the same way as for$train()
, except that thepredict
column is used for type checking. The workhorse function for predicting in each using eachPipeOp
is.predict(input)
(namedlist
) ->list
Abstract function that must be implemented by concrete subclasses.private$.predict()
is called by$predict()
after typechecking and works analogously toprivate$.train()
. Unlikeprivate$.train()
,private$.predict()
should not modify thePipeOp
in any way.
Just asprivate$.train()
,private$.predict()
should not be called by a user; instead, the$predict()
method should be used.print()
() ->NULL
Prints thePipeOp
s most salient information:$id
,$is_trained
,$param_set$values
,$input
and$output
.help(help_type)
(character(1)
) -> help file
Displays the help file of the concretePipeOp
instance.help_type
is one of"text"
,"html"
,"pdf"
and behaves as thehelp_type
argument of R'shelp()
.
Inheriting
To create your own PipeOp
, you need to overload the private$.train()
and private$.test()
functions.
It is most likely also necessary to overload the $initialize()
function to do additional initialization.
The $initialize()
method should have at least the arguments id
and param_vals
, which should be passed on to super$initialize()
unchanged.
id
should have a useful default value, and param_vals
should have the default value list()
, meaning no initialization of hyperparameters.
If the $initialize()
method has more arguments, then it is necessary to also overload the private$.additional_phash_input()
function.
This function should return either all objects, or a hash of all objects, that can change the function or behavior of the PipeOp
and are independent
of the class, the id, the $state
, and the $param_set$values
. The last point is particularly important: changing the $param_set$values
should
not change the return value of private$.additional_phash_input()
.
See also
https://mlr-org.com/pipeops.html
Other mlr3pipelines backend related:
Graph
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_graphs
,
mlr_pipeops
,
mlr_pipeops_updatetarget
Other PipeOps:
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
Examples
# example (bogus) PipeOp that returns the sum of two numbers during $train()
# as well as a letter of the alphabet corresponding to that sum during $predict().
PipeOpSumLetter = R6::R6Class("sumletter",
inherit = PipeOp, # inherit from PipeOp
public = list(
initialize = function(id = "posum", param_vals = list()) {
super$initialize(id, param_vals = param_vals,
# declare "input" and "output" during construction here
# training takes two 'numeric' and returns a 'numeric';
# prediction takes 'NULL' and returns a 'character'.
input = data.table::data.table(name = c("input1", "input2"),
train = "numeric", predict = "NULL"),
output = data.table::data.table(name = "output",
train = "numeric", predict = "character")
)
}
),
private = list(
# PipeOp deriving classes must implement .train and
# .predict; each taking an input list and returning
# a list as output.
.train = function(input) {
sum = input[[1]] + input[[2]]
self$state = sum
list(sum)
},
.predict = function(input) {
list(letters[self$state])
}
)
)
posum = PipeOpSumLetter$new()
print(posum)
#> PipeOp: <posum> (not trained)
#> values: <list()>
#> Input channels <name [train type, predict type]>:
#> input1 [numeric,NULL], input2 [numeric,NULL]
#> Output channels <name [train type, predict type]>:
#> output [numeric,character]
posum$train(list(1, 2))
#> $output
#> [1] 3
#>
# note the name 'output' is the name of the output channel specified
# in the $output data.table.
posum$predict(list(NULL, NULL))
#> $output
#> [1] "c"
#>