Piecewise Linear Encoding using Quantiles

Encodes numeric and integer feature columns using piecewise lienar encoding. For details, see documentation of PipeOpEncodePL or Gorishniy et al. (2022).

Bins are constructed by taking the quantiles of the respective feature column as bin boundaries. The first and last boundaries are set to the minimum and maximum value of the feature, respectively. The number of bins can be controlled with the numsplits hyperparameter. Affected feature columns may contain NAs. These are ignored when calculating quantiles.

Format

R6Class object inheriting from PipeOpEncodePL/PipeOpTaskPreprocSimple/PipeOpTaskPreproc/PipeOp.

Construction

PipeOpEncodePLQuantiles$new(id = "encodeplquantiles", param_vals = list())

id :: character(1)
Identifier of resulting object, default "encodeplquantiles".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with all affected numeric and integer columns encoded using piecewise linear encoding with bins being derived from the quantiles of the respective original feature column.

State

The $state is a named list with the $state elements inherited from PipeOpEncodePL/PipeOpTaskPreproc.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

numsplits :: integer(1)
Number of bins to create. Initialized to 2.
type :: integer(1)
Method used to calculate sample quantiles. See help of stats::quantile. Default is 7.

Internals

This overloads the private$.get_bins() method of PipeOpEncodePL and uses the stats::quantile function to derive the bins used for piecewise linear encoding.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpEncodePL/PipeOpTaskPreproc/PipeOp.

References

Gorishniy Y, Rubachev I, Babenko A (2022). “On Embeddings for Numerical Features in Tabular Deep Learning.” In Advances in Neural Information Processing Systems, volume 35, 24991–25004. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9e9f0ffc3d836836ca96cbf8fe14b105-Abstract-Conference.html.

Other PipeOps: PipeOp, PipeOpEncodePL, PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreproc, PipeOpTaskPreprocSimple, mlr_pipeops, mlr_pipeops_adas, mlr_pipeops_blsmote, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_decode, mlr_pipeops_encode, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_encodepltree, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_info, mlr_pipeops_isomap, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_learner_pi_cvplus, mlr_pipeops_learner_quantiles, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nearmiss, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_rowapply, mlr_pipeops_scale, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_smotenc, mlr_pipeops_spatialsign, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_textvectorizer, mlr_pipeops_threshold, mlr_pipeops_tomek, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson

Other Piecewise Linear Encoding PipeOps: PipeOpEncodePL, mlr_pipeops_encodepltree

Examples

library(mlr3)

task = tsk("iris")$select(c("Petal.Width", "Petal.Length"))
pop = po("encodeplquantiles")

train_out = pop$train(list(task))[[1L]]
# Calculated bin boundaries per feature
pop$state$bins
#> $Petal.Length
#> [1] 1.00 4.35 6.90
#> 
#> $Petal.Width
#> [1] 0.1 1.3 2.5
#> 
# Each feature was split into two encoded features using piecewise linear encoding
train_out$head()
#>    Species Petal.Length.bin1 Petal.Length.bin2 Petal.Width.bin1
#>     <fctr>             <num>             <num>            <num>
#> 1:  setosa        0.11940299                 0       0.08333333
#> 2:  setosa        0.11940299                 0       0.08333333
#> 3:  setosa        0.08955224                 0       0.08333333
#> 4:  setosa        0.14925373                 0       0.08333333
#> 5:  setosa        0.11940299                 0       0.08333333
#> 6:  setosa        0.20895522                 0       0.25000000
#>    Petal.Width.bin2
#>               <num>
#> 1:                0
#> 2:                0
#> 3:                0
#> 4:                0
#> 5:                0
#> 6:                0

# Prediction works the same as training, using the bins learned during training
predict_out = pop$predict(list(task))[[1L]]
predict_out$head()
#>    Species Petal.Length.bin1 Petal.Length.bin2 Petal.Width.bin1
#>     <fctr>             <num>             <num>            <num>
#> 1:  setosa        0.11940299                 0       0.08333333
#> 2:  setosa        0.11940299                 0       0.08333333
#> 3:  setosa        0.08955224                 0       0.08333333
#> 4:  setosa        0.14925373                 0       0.08333333
#> 5:  setosa        0.11940299                 0       0.08333333
#> 6:  setosa        0.20895522                 0       0.25000000
#>    Petal.Width.bin2
#>               <num>
#> 1:                0
#> 2:                0
#> 3:                0
#> 4:                0
#> 5:                0
#> 6:                0

# Binning into three bins per feature
# Using the nearest even order statistic for caluclating quantiles
pop$param_set$set_values(numsplits = 4, type = 3)

train_out = pop$train(list(task))[[1L]]
# Calculated bin boundaries per feature
pop$state$bins
#> $Petal.Length
#> [1] 1.0 1.6 4.3 5.1 6.9
#> 
#> $Petal.Width
#> [1] 0.1 0.3 1.3 1.8 2.5
#> 
# Each feature was split into three encoded features using
# piecewise linear encoding
train_out$head()
#>    Species Petal.Length.bin1 Petal.Length.bin2 Petal.Length.bin3
#>     <fctr>             <num>             <num>             <num>
#> 1:  setosa         0.6666667        0.00000000                 0
#> 2:  setosa         0.6666667        0.00000000                 0
#> 3:  setosa         0.5000000        0.00000000                 0
#> 4:  setosa         0.8333333        0.00000000                 0
#> 5:  setosa         0.6666667        0.00000000                 0
#> 6:  setosa         1.0000000        0.03703704                 0
#>    Petal.Length.bin4 Petal.Width.bin1 Petal.Width.bin2 Petal.Width.bin3
#>                <num>            <num>            <num>            <num>
#> 1:                 0              0.5              0.0                0
#> 2:                 0              0.5              0.0                0
#> 3:                 0              0.5              0.0                0
#> 4:                 0              0.5              0.0                0
#> 5:                 0              0.5              0.0                0
#> 6:                 0              1.0              0.1                0
#>    Petal.Width.bin4
#>               <num>
#> 1:                0
#> 2:                0
#> 3:                0
#> 4:                0
#> 5:                0
#> 6:                0