Encodes columns of type factor
and ordered
.
Possible encodings are "one-hot"
encoding, as well as encoding according to stats::contr.helmert()
, stats::contr.poly()
,
stats::contr.sum()
and stats::contr.treatment()
.
Newly created columns are named via pattern [column-name].[x]
where x
is the respective factor level for "one-hot"
and
"treatment"
encoding, and an integer sequence otherwise.
Use the PipeOpTaskPreproc
$affect_columns
functionality to only encode a subset of columns, or only encode columns of a certain type.
character
-type features can be encoded by converting them factor
features first, using ppl("convert_types", "character", "factor")
.
Format
R6Class
object inheriting from PipeOpTaskPreprocSimple
/PipeOpTaskPreproc
/PipeOp
.
Construction
id
::character(1)
Identifier of resulting object, default"encode"
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist()
.
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc
.
The output is the input Task
with all affected factor
and ordered
parameters encoded according to the method
parameter.
State
The $state
is a named list
with the $state
elements inherited from PipeOpTaskPreproc
, as well as:
constrasts
:: namedlist
ofmatrix
List of contrast matrices, one for each affected discrete feature. The rows of each matrix correspond to (training task) levels, the the columns to the new columns that replace the old discrete feature. Seestats::contrasts
.
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc
, as well as:
method
::character(1)
Initialized to"one-hot"
. One of:"one-hot"
: create a new column for each factor level."treatment"
: create \(n-1\) columns leaving out the first factor level of each factor variable (seestats::contr.treatment()
)."helmert"
: create columns according to Helmert contrasts (seestats::contr.helmert()
)."poly"
: create columns with contrasts based on orthogonal polynomials (seestats::contr.poly()
)."sum"
: create columns with contrasts summing to zero, (seestats::contr.sum()
).
Internals
Uses the stats::contrasts
functions. This is relatively inefficient for features with a large number of levels.
Methods
Only methods inherited from PipeOpTaskPreprocSimple
/PipeOpTaskPreproc
/PipeOp
.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_learner_pi_cvplus
,
mlr_pipeops_learner_quantiles
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nearmiss
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tomek
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
data = data.table::data.table(x = factor(letters[1:3]), y = factor(letters[1:3]))
task = TaskClassif$new("task", data, "x")
poe = po("encode")
# poe is initialized with encoding: "one-hot"
poe$train(list(task))[[1]]$data()
#> x y.a y.b y.c
#> <fctr> <num> <num> <num>
#> 1: a 1 0 0
#> 2: b 0 1 0
#> 3: c 0 0 1
# other kinds of encoding:
poe$param_set$values$method = "treatment"
poe$train(list(task))[[1]]$data()
#> x y.b y.c
#> <fctr> <num> <num>
#> 1: a 0 0
#> 2: b 1 0
#> 3: c 0 1
poe$param_set$values$method = "helmert"
poe$train(list(task))[[1]]$data()
#> x y.1 y.2
#> <fctr> <num> <num>
#> 1: a -1 -1
#> 2: b 1 -1
#> 3: c 0 2
poe$param_set$values$method = "poly"
poe$train(list(task))[[1]]$data()
#> x y.1 y.2
#> <fctr> <num> <num>
#> 1: a -7.071068e-01 0.4082483
#> 2: b -7.850462e-17 -0.8164966
#> 3: c 7.071068e-01 0.4082483
poe$param_set$values$method = "sum"
poe$train(list(task))[[1]]$data()
#> x y.1 y.2
#> <fctr> <num> <num>
#> 1: a 1 0
#> 2: b 0 1
#> 3: c -1 -1
# converting character-columns
data_chr = data.table::data.table(x = factor(letters[1:3]), y = letters[1:3])
task_chr = TaskClassif$new("task_chr", data_chr, "x")
goe = ppl("convert_types", "character", "factor") %>>% po("encode")
goe$train(task_chr)[[1]]$data()
#> x y.a y.b y.c
#> <fctr> <num> <num> <num>
#> 1: a 1 0 0
#> 2: b 0 1 0
#> 3: c 0 0 1