
Piecewise Linear Encoding using Decision Trees
Source:R/PipeOpEncodePL.R
mlr_pipeops_encodepltree.Rd
Encodes numeric
and integer
feature columns using piecewise lienar encoding. For details, see documentation of
PipeOpEncodePL
or Gorishniy et al. (2022).
Bins are constructed by trainig one decision tree Learner
per feature column, taking the target
column into account, and using decision boundaries as bin boundaries.
Format
R6Class
object inheriting from PipeOpEncodePL
/PipeOpTaskPreprocSimple
/PipeOpTaskPreproc
/PipeOp
.
Construction
task_type
::character(1)
The class ofTask
that should be accepted as input, given as acharacter(1)
. This is used to construct the appropriateLearner
to be used for obtaining the bins for piecewise linear encoding. Supported options are"TaskClassif"
forLearnerClassifRpart
or"TaskRegr"
forLearnerRegrRpart
.id
::character(1)
Identifier of resulting object, default"encodeplquantiles"
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist()
.
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc
.
The output is the input Task
with all affected numeric
and integer
columns encoded using piecewise
linear encoding with bins being derived from a decision tree Learner
trained on the respective feature column.
State
The $state
is a named list
with the $state
elements inherited from PipeOpEncodePL
/PipeOpTaskPreproc
.
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc
, as well as the parameters of
the Learner
used for obtaining the bins for piecewise linear encoding.
Internals
This overloads the private$.get_bins()
method of PipeOpEncodePL
. To derive the bins for each feature, the
Task
is split into smaller Tasks
with only the target and respective feature as columns.
On these Tasks
either a LearnerClassifRpart
or
LearnerRegrRpart
gets trained and the respective splits extracted as bin boundaries used
for piecewise linear encodings.
Fields
Only fields inherited from PipeOp
.
Methods
Only methods inherited from PipeOpEncodePL
/PipeOpTaskPreproc
/PipeOp
.
References
Gorishniy Y, Rubachev I, Babenko A (2022). “On Embeddings for Numerical Features in Tabular Deep Learning.” In Advances in Neural Information Processing Systems, volume 35, 24991–25004. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9e9f0ffc3d836836ca96cbf8fe14b105-Abstract-Conference.html.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_decode
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_encodepl
,
mlr_pipeops_encodeplquantiles
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_learner_pi_cvplus
,
mlr_pipeops_learner_quantiles
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nearmiss
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tomek
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
Other Piecewise Linear Encoding PipeOps:
mlr_pipeops_encodepl
,
mlr_pipeops_encodeplquantiles
Examples
library(mlr3)
# For classification task
task = tsk("iris")$select(c("Petal.Width", "Petal.Length"))
pop = po("encodepltree", task_type = "TaskClassif")
train_out = pop$train(list(task))[[1L]]
# Calculated bin boundaries per feature
pop$state$bins
#> $Petal.Length
#> [1] 1.00 2.45 4.75 6.90
#>
#> $Petal.Width
#> [1] 0.10 0.80 1.75 2.50
#>
# Each feature was split into three encoded features using piecewise linear encoding
train_out$head()
#> Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#> <fctr> <num> <num> <num>
#> 1: setosa 0.2758621 0 0
#> 2: setosa 0.2758621 0 0
#> 3: setosa 0.2068966 0 0
#> 4: setosa 0.3448276 0 0
#> 5: setosa 0.2758621 0 0
#> 6: setosa 0.4827586 0 0
#> Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#> <num> <num> <num>
#> 1: 0.1428571 0 0
#> 2: 0.1428571 0 0
#> 3: 0.1428571 0 0
#> 4: 0.1428571 0 0
#> 5: 0.1428571 0 0
#> 6: 0.4285714 0 0
# Prediction works the same as training, using the bins learned during training
predict_out = pop$predict(list(task))[[1L]]
predict_out$head()
#> Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#> <fctr> <num> <num> <num>
#> 1: setosa 0.2758621 0 0
#> 2: setosa 0.2758621 0 0
#> 3: setosa 0.2068966 0 0
#> 4: setosa 0.3448276 0 0
#> 5: setosa 0.2758621 0 0
#> 6: setosa 0.4827586 0 0
#> Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#> <num> <num> <num>
#> 1: 0.1428571 0 0
#> 2: 0.1428571 0 0
#> 3: 0.1428571 0 0
#> 4: 0.1428571 0 0
#> 5: 0.1428571 0 0
#> 6: 0.4285714 0 0
# Controlling behavior of the tree learner, here: setting minimum number of
# observations per node for a split to be attempted
pop$param_set$set_values(minsplit = 5)
train_out = pop$train(list(task))[[1L]]
# feature "hp" now gets split into five encoded features instead of three
pop$state$bins
#> $Petal.Length
#> [1] 1.00 2.45 4.75 6.90
#>
#> $Petal.Width
#> [1] 0.10 0.80 1.75 2.50
#>
train_out$head()
#> Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#> <fctr> <num> <num> <num>
#> 1: setosa 0.2758621 0 0
#> 2: setosa 0.2758621 0 0
#> 3: setosa 0.2068966 0 0
#> 4: setosa 0.3448276 0 0
#> 5: setosa 0.2758621 0 0
#> 6: setosa 0.4827586 0 0
#> Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#> <num> <num> <num>
#> 1: 0.1428571 0 0
#> 2: 0.1428571 0 0
#> 3: 0.1428571 0 0
#> 4: 0.1428571 0 0
#> 5: 0.1428571 0 0
#> 6: 0.4285714 0 0
# For regression task
task = tsk("mtcars")$select(c("cyl", "hp"))
pop = po("encodepltree", task_type = "TaskRegr")
train_out = pop$train(list(task))[[1L]]
# Calculated bin boundaries per feature
pop$state$bins
#> $cyl
#> [1] 4 5 7 8
#>
#> $hp
#> [1] 52 118 335
#>
# First feature was split into three encoded features, second into two, using piecewise linear encoding
train_out$head()
#> mpg cyl.bin1 cyl.bin2 cyl.bin3 hp.bin1 hp.bin2
#> <num> <num> <num> <num> <num> <num>
#> 1: 21.0 1 0.5 0 0.8787879 0.0000000
#> 2: 21.0 1 0.5 0 0.8787879 0.0000000
#> 3: 22.8 0 0.0 0 0.6212121 0.0000000
#> 4: 21.4 1 0.5 0 0.8787879 0.0000000
#> 5: 18.7 1 1.0 1 1.0000000 0.2626728
#> 6: 18.1 1 0.5 0 0.8030303 0.0000000