Apply a Function to each Column of a Task

Applies a function to each column of a task. Use the affect_columns parameter inherited from PipeOpTaskPreprocSimple to limit the columns this function should be applied to. This can be used for simple parameter transformations or type conversions (e.g. as.numeric).

The same function is applied during training and prediction. One important relationship for machine learning preprocessing is that during the prediction phase, the preprocessing on each data row should be independent of other rows. Therefore, the applicator function should always return a vector / list where each result component only depends on the corresponding input component and not on other components. As a rule of thumb, if the function f generates output different from Vectorize(f), it is not a function that should be used for applicator.

Format

R6Class object inheriting from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Construction

PipeOpColApply$new(id = "colapply", param_vals = list())

id :: character(1)
Identifier of resulting object, default "colapply".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with features changed according to the applicator parameter.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

applicator :: function
Function to apply to each column of the task. The return value should be a vector of the same length as the input, i.e., the function vectorizes over the input. A typical example would be as.numeric.
The return value can also be a matrix, data.frame, or data.table. In this case, the length of the input must match the number of returned rows. The names of the resulting features of the output Task is based on the (column) name(s) of the return value of the applicator function, prefixed with the original feature name separated by a dot (.). Use Vectorize to create a vectorizing function from any function that ordinarily only takes one element input.

Internals

Calls map on the data, using the value of applicator as f. and coerces the output via as.data.table.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Examples

library("mlr3")

task = tsk("iris")
poca = po("colapply", applicator = as.character)
poca$train(list(task))[[1]]  # types are converted
#> 
#> ── <TaskClassif> (150x5): Iris Flowers ─────────────────────────────────────────
#> • Target: Species
#> • Target classes: setosa (33%), versicolor (33%), virginica (33%)
#> • Properties: multiclass
#> • Features (4):
#>   • chr (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width

# function that does not vectorize
f1 = function(x) {
  # we could use `ifelse` here, but that is not the point
  if (x > 1) {
    "a"
  } else {
    "b"
  }
}
poca$param_set$values$applicator = Vectorize(f1)
poca$train(list(task))[[1]]$data()
#>        Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#>         <fctr>       <char>      <char>       <char>      <char>
#>   1:    setosa            a           b            a           a
#>   2:    setosa            a           b            a           a
#>   3:    setosa            a           b            a           a
#>   4:    setosa            a           b            a           a
#>   5:    setosa            a           b            a           a
#>  ---                                                            
#> 146: virginica            a           a            a           a
#> 147: virginica            a           a            a           a
#> 148: virginica            a           a            a           a
#> 149: virginica            a           a            a           a
#> 150: virginica            a           a            a           a

# only affect Petal.* columns
poca$param_set$values$affect_columns = selector_grep("^Petal")
poca$train(list(task))[[1]]$data()
#>        Species Petal.Length Petal.Width Sepal.Length Sepal.Width
#>         <fctr>       <char>      <char>        <num>       <num>
#>   1:    setosa            a           b          5.1         3.5
#>   2:    setosa            a           b          4.9         3.0
#>   3:    setosa            a           b          4.7         3.2
#>   4:    setosa            a           b          4.6         3.1
#>   5:    setosa            a           b          5.0         3.6
#>  ---                                                            
#> 146: virginica            a           a          6.7         3.0
#> 147: virginica            a           a          6.3         2.5
#> 148: virginica            a           a          6.5         3.0
#> 149: virginica            a           a          6.2         3.4
#> 150: virginica            a           a          5.9         3.0

# function returning multiple columns
f2 = function(x) {
  cbind(floor = floor(x), ceiling = ceiling(x))
}
poca$param_set$values$applicator = f2
poca$param_set$values$affect_columns = selector_all()
poca$train(list(task))[[1]]$data()
#>        Species Petal.Length.floor Petal.Length.ceiling Petal.Width.floor
#>         <fctr>              <num>                <num>             <num>
#>   1:    setosa                  1                    2                 0
#>   2:    setosa                  1                    2                 0
#>   3:    setosa                  1                    2                 0
#>   4:    setosa                  1                    2                 0
#>   5:    setosa                  1                    2                 0
#>  ---                                                                    
#> 146: virginica                  5                    6                 2
#> 147: virginica                  5                    5                 1
#> 148: virginica                  5                    6                 2
#> 149: virginica                  5                    6                 2
#> 150: virginica                  5                    6                 1
#>      Petal.Width.ceiling Sepal.Length.floor Sepal.Length.ceiling
#>                    <num>              <num>                <num>
#>   1:                   1                  5                    6
#>   2:                   1                  4                    5
#>   3:                   1                  4                    5
#>   4:                   1                  4                    5
#>   5:                   1                  5                    5
#>  ---                                                            
#> 146:                   3                  6                    7
#> 147:                   2                  6                    7
#> 148:                   2                  6                    7
#> 149:                   3                  6                    7
#> 150:                   2                  5                    6
#>      Sepal.Width.floor Sepal.Width.ceiling
#>                  <num>               <num>
#>   1:                 3                   4
#>   2:                 3                   3
#>   3:                 3                   4
#>   4:                 3                   4
#>   5:                 3                   4
#>  ---                                      
#> 146:                 3                   3
#> 147:                 2                   3
#> 148:                 3                   3
#> 149:                 3                   4
#> 150:                 3                   3