A PipeOp represents a transformation of a given "input" into a given "output", with two stages: "training"
and "prediction". It can be understood as a generalized function that not only has multiple inputs, but
also multiple outputs (as well as two stages). The "training" stage is used when training a machine learning pipeline or
fitting a statistical model, and the "predicting" stage is then used for making predictions on new data.
To perform training, the $train() function is called which takes inputs and transforms them, while simultaneously storing information
in its $state slot. For prediction, the $predict() function is called, where the $state information can be used to influence the transformation
of the new data.
A PipeOp is usually used in a Graph object, a representation of a computational graph. It can have
multiple input channels—think of these as multiple arguments to a function, for example when averaging
different models—, and multiple output channels—a transformation may
return different objects, for example different subsets of a Task. The purpose of the Graph is to
connect different outputs of some PipeOps to inputs of other PipeOps.
Input and output channel information of a PipeOp is defined in the $input and $output slots; each channel has a name, a required
type during training, and a required type during prediction. The $train() and $predict() functions are called with a list argument
that has one entry for each declared channel (with one exception, see next paragraph). The list is automatically type-checked
for each channel against $input and then passed on to the private$.train() or private$.predict() functions. There the data is processed and
a result list is created. This list is again type-checked for declared output types of each channel. The length and types of the result
list is as declared in $output.
A special input channel name is "...", which creates a vararg channel that takes arbitrarily many arguments, all of the same type. If the $input
table contains an "..."-entry, then the input given to $train() and $predict() may be longer than the number of declared input channels.
This class is an abstract base class that all PipeOps being used in a Graph should inherit from, and
is not intended to be instantiated.
Format
Abstract R6Class.
Construction
PipeOp$new(id, param_set = ps(), param_vals = list(), input, output, packages = character(0), tags = character(0))id::character(1)
Identifier of resulting object. See$idslot.param_set::ParamSet|listofexpression
Parameter space description. This should be created by the subclass and given tosuper$initialize(). If this is aParamSet, it is used as thePipeOp'sParamSetdirectly. Otherwise it must be alistof expressions e.g. created byalist()that evaluate toParamSets. TheseParamSetare combined using aParamSetCollection.param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings given inparam_set. The subclass should have its ownparam_valsparameter and pass it on tosuper$initialize(). Defaultlist().input::data.tablewith columnsname(character),train(character),predict(character)
Sets the$inputslot of the resulting object; see description there.output::data.tablewith columnsname(character),train(character),predict(character)
Sets the$outputslot of the resulting object; see description there.packages::character
Set of all required packages for thePipeOp's$trainand$predictmethods. See$packagesslot. Default ischaracter(0).tags::character
A set of tags associated with thePipeOp. Tags describe a PipeOp's purpose. Can be used to filteras.data.table(mlr_pipeops). Default is"abstract", indicating an abstractPipeOp.
Internals
PipeOp is an abstract class with abstract functions private$.train() and private$.predict(). To create a functional
PipeOp class, these two methods must be implemented. Each of these functions receives a named list according to
the PipeOp's input channels, and must return a list (names are ignored) with values in the order of output
channels in $output. The private$.train() and private$.predict() function should not be called by the user;
instead, a $train() and $predict() should be used. The most convenient usage is to add the PipeOp
to a Graph (possibly as singleton in that Graph), and using the Graph's $train() / $predict() methods.
private$.train() and private$.predict() should treat their inputs as read-only. If they are R6 objects,
they should be cloned before being manipulated in-place. Objects, or parts of objects, that are not changed, do
not need to be cloned, and it is legal to return the same identical-by-reference objects to multiple outputs.
Fields
id::character
ID of thePipeOp. IDs are user-configurable, and IDs ofPipeOps must be unique within aGraph. IDs ofPipeOps must not be changed once they are part of aGraph, instead theGraph's$set_names()method should be used.packages::character
Packages required for thePipeOp. Functions that are not in base R should still be called using::(or explicitly attached usingrequire()) inprivate$.train()andprivate$.predict(), but packages declared here are checked before any (possibly expensive) processing has started within aGraph.param_set::ParamSet
Parameters and parameter constraints. Parameter values that influence the functioning of$trainand / or$predictare in the$param_set$valuesslot; these are automatically checked against parameter constraints in$param_set.state::any|NULL
Method-dependent state obtained during training step, and usually required for the prediction step. This isNULLif and only if thePipeOphas not been trained. The$stateis the only slot that can be reliably modified during$train(), becauseprivate$.train()may theoretically be executed in a differentR-session (e.g. for parallelization).$stateshould furthermore always be set to something with copy-semantics, since it is never cloned. This is a limitation not ofPipeOpormlr3pipelines, but of the way the system as a whole works, together withGraphLearnerand mlr3.input::data.tablewith columnsname(character),train(character),predict(character)
Input channels ofPipeOp. Columnnamegives the names (and order) of values in the list given to$train()and$predict(). Columntrainis the (S3) class that an input object must conform to during training, columnpredictis the (S3) class that an input object must conform to during prediction. Types are checked by thePipeOpitself and do not need to be checked byprivate$.train()/private$.predict()code.
A special name is"...", which creates a vararg input channel that accepts a variable number of inputs.
If a row has bothtrainandpredictvalues enclosed by square brackets ("[", "]"), then this channel isMultiplicity-aware. If thePipeOpreceives aMultiplicityvalue on these channels, thisMultiplicityis given to the.train()and.predict()functions directly. Otherwise, theMultiplicityis transparently unpacked and the.train()and.predict()functions are called multiple times, once for eachMultiplicityelement. The type enclosed by square brackets indicates that only aMultiplicitycontaining values of this type are accepted. SeeMultiplicityfor more information.output::data.tablewith columnsname(character),train(character),predict(character)
Output channels ofPipeOp, in the order in which they will be given in the list returned by$trainand$predictfunctions. Columntrainis the (S3) class that an output object must conform to during training, columnpredictis the (S3) class that an output object must conform to during prediction. ThePipeOpchecks values returned byprivate$.train()andprivate$.predict()against these types specifications.
If a row has bothtrainandpredictvalues enclosed by square brackets ("[", "]"), then this signals that the channel emits aMultiplicityof the indicated type. SeeMultiplicityfor more information.innum::numeric(1)
Number of input channels. This equalsnrow($input).outnum::numeric(1)
Number of output channels. This equalsnrow($output).is_trained::logical(1)
Indicate whether thePipeOpwas already trained and can therefore be used for prediction.tags::character
A set of tags associated with thePipeOp. Tags describe a PipeOp's purpose. Can be used to filteras.data.table(mlr_pipeops).PipeOptags are inherited and child classes can introduce additional tags.hash::character(1)
Checksum calculated on thePipeOp, depending on thePipeOp'sclassand the slots$idand$param_set$values. If aPipeOp's functionality may change depending on more than these values, it should inherit the$hashactive binding and calculate the hash asdigest(list(super$hash, <OTHER THINGS>), algo = "xxhash64").phash::character(1)
Checksum calculated on thePipeOp, depending on thePipeOp'sclassand the slots$idbut ignoring$param_set$values. If aPipeOp's functionality may change depending on more than these values, it should inherit the$hashactive binding and calculate the hash asdigest(list(super$hash, <OTHER THINGS>), algo = "xxhash64")..result::list
If theGraph's$keep_resultsflag is set toTRUE, then the intermediate Results of$train()and$predict()are saved to this slot, exactly as they are returned by these functions. This is mainly for debugging purposes and done, if requested, by theGraphbackend itself; it should not be done explicitly byprivate$.train()orprivate$.predict().man::character(1)
Identifying string of the help page that shows withhelp().label::character(1)
Description of thePipeOp's functionality. Derived from the title of its help page.properties::character()
The properties of thePipeOp. Currently supported values are:"validation": thePipeOpcan make use of the$internal_valid_taskof anmlr3::Task. This is for example used forPipeOpLearners that wrap aLearnerwith this property, seemlr3::Learner.PipeOps that have this property, also have a$validatefield, which controls whether to use the validation task, as well as a$internal_valid_scoresfield, which allows to access the internal validation scores after training."internal_tuning": thePipeOpis able to internally optimize hyperparameters. This works analogously to the internal tuning implementation formlr3::Learner.PipeOps with that property also implement the standardized accessor$internal_tuned_valuesand have at least one parameter tagged with"internal_tuning". An example for such aPipeOpis aPipeOpLearnerthat wraps aLearnerwith the"internal_tuning"property.
Programatic access to all available properties is possible via mlr_reflections$pipeops$properties.
Methods
print()
() ->NULL
Prints thePipeOps most salient information:$id,$is_trained,$param_set$values,$inputand$output.help(help_type)
(character(1)) -> help file
Displays the help file of the concretePipeOpinstance.help_typeis one of"text","html","pdf"and behaves as thehelp_typeargument of R'shelp().
The following public $train() and $predict() methods are the primary user-facing functions intended for direct use:
train(input)
(list) -> namedlist
TrainPipeOponinputs, transform it to output and store the learned$state. If thePipeOpis already trained, already present$stateis overwritten. Input list is typechecked against the$inputtraincolumn. Return value is a list with as many entries as$outputhas rows, with each entry named after the$outputnamecolumn and class according to the$outputtraincolumn. The workhorse function for training eachPipeOpis theprivate$.train()function.predict(input)
(list) -> namedlist
Predict on new data ininput, possibly using the stored$state. Input and output are specified by$inputand$outputin the same way as for$train(), except that thepredictcolumn is used for type checking. The workhorse function for predicting in eachPipeOpis theprivate$.predict()function.
To implement a PipeOp the following abstract private functions should be overloaded in the inheriting PipeOp.
Note that these should not be called by a user; instead the public $train() and $predict() method should be used.
.train(input)
(namedlist) ->list
Abstract function that must be implemented by concrete subclasses.private$.train()is called by$train()after typechecking. It must change the$statevalue to something non-NULLand return a list of transformed data according to the$outputtraincolumn. Names of the returned list are ignored..predict(input)
(namedlist) ->list
Abstract function that must be implemented by concrete subclasses.private$.predict()is called by$predict()after typechecking and works analogously toprivate$.train(). Unlikeprivate$.train(),private$.predict()should not modify thePipeOpin any way.
Inheriting
To create your own PipeOp, you need to overload the private$.train() and private$.predict() functions.
It is most likely also necessary to overload the $initialize() function to do additional initialization.
The $initialize() method should have at least the arguments id and param_vals, which should be passed on to super$initialize() unchanged.
id should have a useful default value, and param_vals should have the default value list(), meaning no initialization of hyperparameters.
If the $initialize() method has more arguments, then it is necessary to also overload the private$.additional_phash_input() function.
This function should return either all objects, or a hash of all objects, that can change the function or behavior of the PipeOp and are independent
of the class, the id, the $state, and the $param_set$values. The last point is particularly important: changing the $param_set$values should
not change the return value of private$.additional_phash_input().
When you are implementing a PipeOp that operates a task (and is not a PipeOpTaskPreproc), you also need to handle the
$internal_valid_task field of the input task, if there is one.
See also
https://mlr-org.com/pipeops.html
Other mlr3pipelines backend related:
Graph,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_graphs,
mlr_pipeops,
mlr_pipeops_updatetarget
Other PipeOps:
PipeOpEncodePL,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_adas,
mlr_pipeops_blsmote,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_collapsefactors,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_decode,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_encodeplquantiles,
mlr_pipeops_encodepltree,
mlr_pipeops_featureunion,
mlr_pipeops_filter,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_learner_pi_cvplus,
mlr_pipeops_learner_quantiles,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nearmiss,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_rowapply,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_smotenc,
mlr_pipeops_spatialsign,
mlr_pipeops_subsample,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_textvectorizer,
mlr_pipeops_threshold,
mlr_pipeops_tomek,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson
Examples
# example (bogus) PipeOp that returns the sum of two numbers during $train()
# as well as a letter of the alphabet corresponding to that sum during $predict().
PipeOpSumLetter = R6::R6Class("sumletter",
inherit = PipeOp, # inherit from PipeOp
public = list(
initialize = function(id = "posum", param_vals = list()) {
super$initialize(id, param_vals = param_vals,
# declare "input" and "output" during construction here
# training takes two 'numeric' and returns a 'numeric';
# prediction takes 'NULL' and returns a 'character'.
input = data.table::data.table(name = c("input1", "input2"),
train = "numeric", predict = "NULL"),
output = data.table::data.table(name = "output",
train = "numeric", predict = "character")
)
}
),
private = list(
# PipeOp deriving classes must implement .train and
# .predict; each taking an input list and returning
# a list as output.
.train = function(input) {
sum = input[[1]] + input[[2]]
self$state = sum
list(sum)
},
.predict = function(input) {
list(letters[self$state])
}
)
)
posum = PipeOpSumLetter$new()
print(posum)
#> PipeOp: <posum> (not trained)
#> values: <list()>
#> Input channels <name [train type, predict type]>:
#> input1 [numeric,NULL], input2 [numeric,NULL]
#> Output channels <name [train type, predict type]>:
#> output [numeric,character]
posum$train(list(1, 2))
#> $output
#> [1] 3
#>
# note the name 'output' is the name of the output channel specified
# in the $output data.table.
posum$predict(list(NULL, NULL))
#> $output
#> [1] "c"
#>
