Factor Encoding

Encodes columns of type factor and ordered.

Possible encodings are "one-hot" encoding, as well as encoding according to stats::contr.helmert(), stats::contr.poly(), stats::contr.sum() and stats::contr.treatment(). Newly created columns are named via pattern [column-name].[x] where x is the respective factor level for "one-hot" and "treatment" encoding, and an integer sequence otherwise.

Use the PipeOpTaskPreproc $affect_columns functionality to only encode a subset of columns, or only encode columns of a certain type.

character-type features can be encoded by converting them factor features first, using ppl("convert_types", "character", "factor").

Format

R6Class object inheriting from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Construction

PipeOpEncode$new(id = "encode", param_vals = list())

id :: character(1)
Identifier of resulting object, default "encode".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with all affected factor and ordered columns encoded according to the method parameter.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:

constrasts :: named list of matrix
List of contrast matrices, one for each affected discrete feature. The rows of each matrix correspond to (training task) levels, the the columns to the new columns that replace the old discrete feature. See stats::contrasts.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

method :: character(1)
Initialized to "one-hot". One of:
- "one-hot": create a new column for each factor level.
- "treatment": create $n-1$ columns leaving out the first factor level of each factor variable (see stats::contr.treatment()).
- "helmert": create columns according to Helmert contrasts (see stats::contr.helmert()).
- "poly": create columns with contrasts based on orthogonal polynomials (see stats::contr.poly()).
- "sum": create columns with contrasts summing to zero, (see stats::contr.sum()).

Internals

Uses the stats::contrasts functions. This is relatively inefficient for features with a large number of levels.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Examples

library("mlr3")

data = data.table::data.table(x = factor(letters[1:3]), y = factor(letters[1:3]))
task = TaskClassif$new("task", data, "x")

poe = po("encode")

# poe is initialized with encoding: "one-hot"
poe$train(list(task))[[1]]$data()
#>         x   y.a   y.b   y.c
#>    <fctr> <num> <num> <num>
#> 1:      a     1     0     0
#> 2:      b     0     1     0
#> 3:      c     0     0     1

# other kinds of encoding:
poe$param_set$values$method = "treatment"
poe$train(list(task))[[1]]$data()
#>         x   y.b   y.c
#>    <fctr> <num> <num>
#> 1:      a     0     0
#> 2:      b     1     0
#> 3:      c     0     1

poe$param_set$values$method = "helmert"
poe$train(list(task))[[1]]$data()
#>         x   y.1   y.2
#>    <fctr> <num> <num>
#> 1:      a    -1    -1
#> 2:      b     1    -1
#> 3:      c     0     2

poe$param_set$values$method = "poly"
poe$train(list(task))[[1]]$data()
#>         x           y.1        y.2
#>    <fctr>         <num>      <num>
#> 1:      a -7.071068e-01  0.4082483
#> 2:      b -7.850462e-17 -0.8164966
#> 3:      c  7.071068e-01  0.4082483

poe$param_set$values$method = "sum"
poe$train(list(task))[[1]]$data()
#>         x   y.1   y.2
#>    <fctr> <num> <num>
#> 1:      a     1     0
#> 2:      b     0     1
#> 3:      c    -1    -1

# converting character-columns
data_chr = data.table::data.table(x = factor(letters[1:3]), y = letters[1:3])
task_chr = TaskClassif$new("task_chr", data_chr, "x")

goe = ppl("convert_types", "character", "factor") %>>% po("encode")

goe$train(task_chr)[[1]]$data()
#>         x   y.a   y.b   y.c
#>    <fctr> <num> <num> <num>
#> 1:      a     1     0     0
#> 2:      b     0     1     0
#> 3:      c     0     0     1