Impact Encoding with Random Intercept Models

Encodes columns of type factor, character and ordered.

PipeOpEncodeLmer converts factor levels of each factorial column to the estimated coefficients of a simple random intercept model. Models are fitted with the glmer function of the lme4 package and are of the type target ~ 1 + (1 | factor). If the task is a regression task, the numeric target variable is used as dependent variable and the factor is used for grouping. If the task is a classification task, the target variable is used as dependent variable and the factor is used for grouping. If the target variable is multiclass, for each level of the multiclass target variable, binary "one vs. rest" models are fitted.

For training, multiple models can be estimated in a cross-validation scheme to ensure that the same factor level does not always result in identical values in the converted numerical feature. For prediction, a global model (which was fitted on all observations during training) is used for each factor. New factor levels are converted to the value of the intercept coefficient of the global model for prediction. NAs are ignored by the CPO.

Use the PipeOpTaskPreproc $affect_columns functionality to only encode a subset of columns, or only encode columns of a certain type.

Format

R6Class object inheriting from PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Construction

PipeOpEncodeLmer$new(id = "encodelmer", param_vals = list())

id :: character(1)
Identifier of resulting object, default "encodelmer".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc. Instead of a Task, a TaskSupervised is used as input and output during training and prediction.

The output is the input Task with all affected factor, character or ordered parameters encoded according to the method parameter.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:

target_levels :: character
Levels of the target columns.
control :: a named list
List of coefficients learned via glmer.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

fast_optim :: logical(1)
If fast_optim is TRUE (default), a faster (up to 50 percent) optimizer from the nloptr package is used when fitting the lmer models. This uses additional stopping criteria which can give suboptimal results. Initialized to TRUE.

Internals

Uses the lme4::glmer. This is relatively inefficient for features with a large number of levels.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Examples

library("mlr3")
poe = po("encodelmer")

task = TaskClassif$new("task",
  data.table::data.table(
    x = factor(c("a", "a", "a", "b", "b")),
    y = factor(c("a", "a", "b", "b", "b"))),
  "x")

poe$train(list(task))[[1]]$data()
#>         x          y
#>    <fctr>      <num>
#> 1:      a -0.5525584
#> 2:      a -0.5525584
#> 3:      a -0.3310264
#> 4:      b -0.3310264
#> 5:      b -0.3310264

poe$state
#> $target_levels
#> [1] "a" "b"
#> 
#> $control
#> $control$y
#>              a              b ..new..level.. 
#>     -0.5525584     -0.3310264     -0.4429541 
#> 
#> 
#> $dt_columns
#> [1] "y"
#> 
#> $affected_cols
#> [1] "y"
#> 
#> $intasklayout
#> Key: <id>
#>        id   type
#>    <char> <char>
#> 1:      y factor
#> 
#> $outtasklayout
#> Key: <id>
#>        id    type
#>    <char>  <char>
#> 1:      y numeric
#> 
#> $outtaskshell
#> Empty data.table (0 rows and 2 cols): x,y
#>