Impute factorial features by adding a new level ".MISSING"
.
Impute numerical features by constant values shifted below the minimum or above the maximum by using \(min(x) - offset - multiplier * diff(range(x))\) or \(max(x) + offset + multiplier * diff(range(x))\).
This type of imputation is especially sensible in the context of tree-based methods, see also Ding & Simonoff (2010).
Format
R6Class
object inheriting from PipeOpImpute
/PipeOp
.
Construction
id
::character(1)
Identifier of resulting object, default"imputeoor"
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist()
.
Input and Output Channels
Input and output channels are inherited from PipeOpImpute
.
The output is the input Task
with all affected features having missing values imputed as described above.
State
The $state
is a named list
with the $state
elements inherited from PipeOpImpute
.
The $state$model
contains either ".MISSING"
used for character
and factor
(also
ordered
) features or numeric(1)
indicating the constant value used for imputation of
integer
and numeric
features.
Parameters
The parameters are the parameters inherited from PipeOpImpute
, as well as:
min
::logical(1)
Shouldinteger
andnumeric
features be shifted below the minimum? Initialized to TRUE. If FALSE they are shifted above the maximum. See also the description above.offset
::numeric(1)
Numerical non-negative offset as used in the description above forinteger
andnumeric
features. Initialized to 1.multiplier
::numeric(1)
Numerical non-negative multiplier as used in the description above forinteger
andnumeric
features. Initialized to 1.
Internals
Adds an explicit new level()
to factor
and ordered
features, but not to character
features.
For integer
and numeric
features uses the min
, max
, diff
and range
functions.
integer
and numeric
features that are entirely NA
are imputed as 0
.
Methods
Only methods inherited from PipeOpImpute
/PipeOp
.
References
Ding Y, Simonoff JS (2010). “An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data.” Journal of Machine Learning Research, 11(6), 131-170. https://jmlr.org/papers/v11/ding10a.html.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreprocSimple
,
PipeOpTaskPreproc
,
PipeOp
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_encode
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_scale
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_textvectorizer
,
mlr_pipeops_threshold
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
,
mlr_pipeops
Other Imputation PipeOps:
PipeOpImpute
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputesample
Examples
library("mlr3")
set.seed(2409)
data = tsk("pima")$data()
data$y = factor(c(NA, sample(letters, size = 766, replace = TRUE), NA))
data$z = ordered(c(NA, sample(1:10, size = 767, replace = TRUE)))
task = TaskClassif$new("task", backend = data, target = "diabetes")
task$missings()
#> diabetes age glucose insulin mass pedigree pregnant pressure
#> 0 0 5 374 11 0 0 35
#> triceps y z
#> 227 2 1
po = po("imputeoor")
new_task = po$train(list(task = task))[[1]]
new_task$missings()
#> diabetes age pedigree pregnant glucose insulin mass pressure
#> 0 0 0 0 0 0 0 0
#> triceps y z
#> 0 0 0
new_task$data()
#> diabetes age pedigree pregnant glucose insulin mass pressure triceps
#> <fctr> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: pos 50 0.627 6 148 -819 33.6 72 35
#> 2: neg 31 0.351 1 85 -819 26.6 66 29
#> 3: pos 32 0.672 8 183 -819 23.3 64 -86
#> 4: neg 21 0.167 1 89 94 28.1 66 23
#> 5: pos 33 2.288 0 137 168 43.1 40 35
#> ---
#> 764: neg 63 0.171 10 101 180 32.9 76 48
#> 765: neg 27 0.340 2 122 -819 36.8 70 27
#> 766: neg 30 0.245 5 121 112 26.2 72 23
#> 767: pos 47 0.349 1 126 -819 30.1 60 -86
#> 768: neg 23 0.315 1 93 -819 30.4 70 31
#> y z
#> <fctr> <ord>
#> 1: .MISSING .MISSING
#> 2: l 9
#> 3: q 6
#> 4: f 3
#> 5: l 3
#> ---
#> 764: o 7
#> 765: n 5
#> 766: e 6
#> 767: c 8
#> 768: .MISSING 9