Piecewise Linear Encoding using Decision Trees

Encodes numeric and integer feature columns using piecewise lienar encoding. For details, see documentation of PipeOpEncodePL or Gorishniy et al. (2022).

Bins are constructed by trainig one decision tree Learner per feature column, taking the target column into account, and using decision boundaries as bin boundaries.

Format

R6Class object inheriting from PipeOpEncodePL/PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Construction

PipeOpEncodePLTree$new(task_type, id = "encodepltree", param_vals = list())

task_type :: character(1)
The class of Task that should be accepted as input, given as a character(1). This is used to construct the appropriate Learner to be used for obtaining the bins for piecewise linear encoding. Supported options are "TaskClassif"for LearnerClassifRpart or "TaskRegr"for LearnerRegrRpart.
id :: character(1)
Identifier of resulting object, default "encodeplquantiles".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc. Instead of a Task, a TaskClassif or TaskRegr is used as input and output during training and prediction, depending on the task_type construction argument.

The output is the input Task with all affected numeric and integer columns encoded using piecewise linear encoding with bins being derived from a decision tree Learner trained on the respective feature column.

State

The $state is a named list with the $state elements inherited from PipeOpEncodePL/PipeOpTaskPreproc.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as the parameters of the Learner used for obtaining the bins for piecewise linear encoding.

Internals

This overloads the private$.get_bins() method of PipeOpEncodePL. To derive the bins for each feature, the Task is split into smaller Tasks with only the target and respective feature as columns. On these Tasks either a LearnerClassifRpart or LearnerRegrRpart gets trained and the respective splits extracted as bin boundaries used for piecewise linear encodings.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpEncodePL/PipeOpTaskPreproc/PipeOp.

References

Gorishniy Y, Rubachev I, Babenko A (2022). “On Embeddings for Numerical Features in Tabular Deep Learning.” In Advances in Neural Information Processing Systems, volume 35, 24991–25004. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9e9f0ffc3d836836ca96cbf8fe14b105-Abstract-Conference.html.

Other PipeOps: PipeOp, PipeOpEncodePL, PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreproc, PipeOpTaskPreprocSimple, mlr_pipeops, mlr_pipeops_adas, mlr_pipeops_blsmote, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_decode, mlr_pipeops_encode, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_encodeplquantiles, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_learner_pi_cvplus, mlr_pipeops_learner_quantiles, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nearmiss, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_rowapply, mlr_pipeops_scale, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_smotenc, mlr_pipeops_spatialsign, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_textvectorizer, mlr_pipeops_threshold, mlr_pipeops_tomek, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson

Other Piecewise Linear Encoding PipeOps: PipeOpEncodePL, mlr_pipeops_encodeplquantiles

Examples

library(mlr3)

# For classification task
task = tsk("iris")$select(c("Petal.Width", "Petal.Length"))
pop = po("encodepltree", task_type = "TaskClassif")
train_out = pop$train(list(task))[[1L]]

# Calculated bin boundaries per feature
pop$state$bins
#> $Petal.Length
#> [1] 1.00 2.45 4.75 6.90
#> 
#> $Petal.Width
#> [1] 0.10 0.80 1.75 2.50
#> 
# Each feature was split into three encoded features using piecewise linear encoding
train_out$head()
#>    Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#>     <fctr>             <num>             <num>             <num>
#> 1:  setosa         0.2758621                 0                 0
#> 2:  setosa         0.2758621                 0                 0
#> 3:  setosa         0.2068966                 0                 0
#> 4:  setosa         0.3448276                 0                 0
#> 5:  setosa         0.2758621                 0                 0
#> 6:  setosa         0.4827586                 0                 0
#>    Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#>               <num>            <num>            <num>
#> 1:        0.1428571                0                0
#> 2:        0.1428571                0                0
#> 3:        0.1428571                0                0
#> 4:        0.1428571                0                0
#> 5:        0.1428571                0                0
#> 6:        0.4285714                0                0

# Prediction works the same as training, using the bins learned during training
predict_out = pop$predict(list(task))[[1L]]
predict_out$head()
#>    Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#>     <fctr>             <num>             <num>             <num>
#> 1:  setosa         0.2758621                 0                 0
#> 2:  setosa         0.2758621                 0                 0
#> 3:  setosa         0.2068966                 0                 0
#> 4:  setosa         0.3448276                 0                 0
#> 5:  setosa         0.2758621                 0                 0
#> 6:  setosa         0.4827586                 0                 0
#>    Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#>               <num>            <num>            <num>
#> 1:        0.1428571                0                0
#> 2:        0.1428571                0                0
#> 3:        0.1428571                0                0
#> 4:        0.1428571                0                0
#> 5:        0.1428571                0                0
#> 6:        0.4285714                0                0

# Controlling behavior of the tree learner, here: setting minimum number of
# observations per node for a split to be attempted
pop$param_set$set_values(minsplit = 5)

train_out = pop$train(list(task))[[1L]]
# feature "hp" now gets split into five encoded features instead of three
pop$state$bins
#> $Petal.Length
#> [1] 1.00 2.45 4.75 6.90
#> 
#> $Petal.Width
#> [1] 0.10 0.80 1.75 2.50
#> 
train_out$head()
#>    Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#>     <fctr>             <num>             <num>             <num>
#> 1:  setosa         0.2758621                 0                 0
#> 2:  setosa         0.2758621                 0                 0
#> 3:  setosa         0.2068966                 0                 0
#> 4:  setosa         0.3448276                 0                 0
#> 5:  setosa         0.2758621                 0                 0
#> 6:  setosa         0.4827586                 0                 0
#>    Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#>               <num>            <num>            <num>
#> 1:        0.1428571                0                0
#> 2:        0.1428571                0                0
#> 3:        0.1428571                0                0
#> 4:        0.1428571                0                0
#> 5:        0.1428571                0                0
#> 6:        0.4285714                0                0

# For regression task
task = tsk("mtcars")$select(c("cyl", "hp"))
pop = po("encodepltree", task_type = "TaskRegr")
train_out = pop$train(list(task))[[1L]]

# Calculated bin boundaries per feature
pop$state$bins
#> $cyl
#> [1] 4 5 7 8
#> 
#> $hp
#> [1]  52 118 335
#> 
# First feature was split into three encoded features,
# second into two, using piecewise linear encoding
train_out$head()
#>      mpg cyl.bin1 cyl.bin2 cyl.bin3   hp.bin1   hp.bin2
#>    <num>    <num>    <num>    <num>     <num>     <num>
#> 1:  21.0        1      0.5        0 0.8787879 0.0000000
#> 2:  21.0        1      0.5        0 0.8787879 0.0000000
#> 3:  22.8        0      0.0        0 0.6212121 0.0000000
#> 4:  21.4        1      0.5        0 0.8787879 0.0000000
#> 5:  18.7        1      1.0        1 1.0000000 0.2626728
#> 6:  18.1        1      0.5        0 0.8030303 0.0000000