Simple Pre-processing — preproc • mlr3pipelines

Function that offers a simple and direct way to train or predict PipeOps and Graphs on Tasks, data.frames or data.tables.

Training happens if predict is set to FALSE and no state is passed to this function. Prediction happens if predict is set to TRUE and if the passed Graph or PipeOp is either trained or a state is explicitly passed to this function.

The passed PipeOp or Graph gets modified by-reference.

Usage

preproc(indata, processor, state = NULL, predict = !is.null(state))

Arguments

indata: (Task | data.frame | data.table )
Data to be pre-processed.
processor: (Graph | PipeOp)
Graph or PipeOp accepting a Task that has one output channel.
Whenever indata is passed a data.frame or data.table, the output channel must return a Task to be converted back into a data.frame or data.table. Additionally, processors which only work on sub-classes of TaskSupervised will not accept data.frame or data.table, as it would be unclear which column was the target.
Be aware that the processor gets modified by-reference both during training, and if a state is passed to this function. This especially means that the state of a trained processor will get overwritten when state is passed.
You may want to use dictionary sugar functions to select a processor and to set its hyperparameters, e.g. po() or ppl().
state: (named list | NULL)
Optional state to be used for prediction, if the processor is untrained or if the current state of the processor should be overwritten. Must be a complete and correct state for the respective processor. Default NULL (do not overwrite processor's state).
predict: (logical(1))
Whether to predict (TRUE) or train (FALSE). By default, this is FALSE if state is NULL (state's default), and TRUE otherwise.

Value

any | data.frame | data.table: If indata is a Task, whatever is returned by the processor's single output channel is returned. If indata is a data.frame or data.table, an object of the same class is returned, or if the processor's output channel does not return a Task, an error is thrown.

Internals

If processor is a PipeOp, the S3 method preproc.PipeOp gets called first, converting the PipeOp into a Graph and wrapping the state appropriately, before calling the S3 method preproc.Graph with the modified objects.

If indata is a data.frame or data.table, a TaskUnsupervised is constructed internally. This implies that processors which only work on sub-classes of TaskSupervised will not work with these input types for indata.

Examples

library("mlr3")
task = tsk("iris")
pop = po("pca")

# Training
preproc(task, pop)
#> Error in preproc(task, pop): unused argument (pop)
# Note that the PipeOp gets trained through this
pop$is_trained
#> [1] FALSE

# Predicting a trained PipeOp (trained through previous call to preproc)
preproc(task, pop, predict = TRUE)
#> Error in preproc(task, pop, predict = TRUE): unused arguments (pop, predict = TRUE)

# Predicting using a given state
# We use the state of the PipeOp from the last example and then reset it
state = pop$state
pop$state = NULL
preproc(task, pop, state)
#> Error in preproc(task, pop, state): unused arguments (pop, state)

# Note that the PipeOp's state may get overwritten inadvertently during
# training or if a state is given
pop$state$sdev
#> NULL
preproc(tsk("wine"), pop)
#> Error in preproc(tsk("wine"), pop): unused argument (pop)
pop$state$sdev
#> NULL

# Piping multiple preproc() calls, using dictionary sugar to set parameters
# tsk("penguins") |>
#   preproc(po("imputemode", affect_columns = selector_name("sex"))) |>
#   preproc(po("imputemean"))

# Use preproc with a Graph
gr = po("pca", rank. = 4) %>>% po("learner", learner = lrn("classif.rpart"))
preproc(tsk("sonar"), gr)  # returns NULL because of the learner
#> Error in preproc(tsk("sonar"), gr): unused argument (gr)
preproc(tsk("sonar"), gr, predict = TRUE)
#> Error in preproc(tsk("sonar"), gr, predict = TRUE): unused arguments (gr, predict = TRUE)

# Training with a data.table input
# Note that `$data()` drops the information that "Species" is the target.
# It gets handled like an ordinary feature here.
dt = tsk("iris")$data()
preproc(dt, pop)
#> Error in preproc(dt, pop): unused argument (pop)

# Predicting with a data.table input
preproc(dt, pop)
#> Error in preproc(dt, pop): unused argument (pop)