Impute factorial features by adding a new level `".MISSING"`

.

Impute numerical features by constant values shifted below the minimum or above the maximum by using \(min(x) - offset - multiplier * diff(range(x))\) or \(max(x) + offset + multiplier * diff(range(x))\).

This type of imputation is especially sensible in the context of tree-based methods, see also Ding & Simonoff (2010).

If a factor is missing during prediction, but not during training, this adds an unseen level
`".MISSING"`

, which would be a problem for most models. This is why it is recommended to use
`po("fixfactors")`

and
`po("imputesample", affect_columns = selector_type(types = c("factor", "ordered")))`

(or some other imputation method) after this imputation method, if missing values are expected during prediction
in factor columns that had no missing values during training.

## Format

`R6Class`

object inheriting from `PipeOpImpute`

/`PipeOp`

.

## Construction

`id`

::`character(1)`

Identifier of resulting object, default`"imputeoor"`

.`param_vals`

:: named`list`

List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default`list()`

.

## Input and Output Channels

Input and output channels are inherited from `PipeOpImpute`

.

The output is the input `Task`

with all affected features having missing values imputed as described above.

## State

The `$state`

is a named `list`

with the `$state`

elements inherited from `PipeOpImpute`

.

The `$state$model`

contains either `".MISSING"`

used for `character`

and `factor`

(also
`ordered`

) features or `numeric(1)`

indicating the constant value used for imputation of
`integer`

and `numeric`

features.

## Parameters

The parameters are the parameters inherited from `PipeOpImpute`

, as well as:

`min`

::`logical(1)`

Should`integer`

and`numeric`

features be shifted below the minimum? Initialized to TRUE. If FALSE they are shifted above the maximum. See also the description above.`offset`

::`numeric(1)`

Numerical non-negative offset as used in the description above for`integer`

and`numeric`

features. Initialized to 1.`multiplier`

::`numeric(1)`

Numerical non-negative multiplier as used in the description above for`integer`

and`numeric`

features. Initialized to 1.

## Internals

Adds an explicit new `level()`

to `factor`

and `ordered`

features, but not to `character`

features.
For `integer`

and `numeric`

features uses the `min`

, `max`

, `diff`

and `range`

functions.
`integer`

and `numeric`

features that are entirely `NA`

are imputed as `0`

.

## Methods

Only methods inherited from `PipeOpImpute`

/`PipeOp`

.

## References

Ding Y, Simonoff JS (2010).
“An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data.”
*Journal of Machine Learning Research*, **11**(6), 131-170.
https://jmlr.org/papers/v11/ding10a.html.

## See also

https://mlr-org.com/pipeops.html

Other PipeOps:
`PipeOp`

,
`PipeOpEnsemble`

,
`PipeOpImpute`

,
`PipeOpTargetTrafo`

,
`PipeOpTaskPreproc`

,
`PipeOpTaskPreprocSimple`

,
`mlr_pipeops`

,
`mlr_pipeops_adas`

,
`mlr_pipeops_blsmote`

,
`mlr_pipeops_boxcox`

,
`mlr_pipeops_branch`

,
`mlr_pipeops_chunk`

,
`mlr_pipeops_classbalancing`

,
`mlr_pipeops_classifavg`

,
`mlr_pipeops_classweights`

,
`mlr_pipeops_colapply`

,
`mlr_pipeops_collapsefactors`

,
`mlr_pipeops_colroles`

,
`mlr_pipeops_copy`

,
`mlr_pipeops_datefeatures`

,
`mlr_pipeops_encode`

,
`mlr_pipeops_encodeimpact`

,
`mlr_pipeops_encodelmer`

,
`mlr_pipeops_featureunion`

,
`mlr_pipeops_filter`

,
`mlr_pipeops_fixfactors`

,
`mlr_pipeops_histbin`

,
`mlr_pipeops_ica`

,
`mlr_pipeops_imputeconstant`

,
`mlr_pipeops_imputehist`

,
`mlr_pipeops_imputelearner`

,
`mlr_pipeops_imputemean`

,
`mlr_pipeops_imputemedian`

,
`mlr_pipeops_imputemode`

,
`mlr_pipeops_imputesample`

,
`mlr_pipeops_kernelpca`

,
`mlr_pipeops_learner`

,
`mlr_pipeops_missind`

,
`mlr_pipeops_modelmatrix`

,
`mlr_pipeops_multiplicityexply`

,
`mlr_pipeops_multiplicityimply`

,
`mlr_pipeops_mutate`

,
`mlr_pipeops_nmf`

,
`mlr_pipeops_nop`

,
`mlr_pipeops_ovrsplit`

,
`mlr_pipeops_ovrunite`

,
`mlr_pipeops_pca`

,
`mlr_pipeops_proxy`

,
`mlr_pipeops_quantilebin`

,
`mlr_pipeops_randomprojection`

,
`mlr_pipeops_randomresponse`

,
`mlr_pipeops_regravg`

,
`mlr_pipeops_removeconstants`

,
`mlr_pipeops_renamecolumns`

,
`mlr_pipeops_replicate`

,
`mlr_pipeops_rowapply`

,
`mlr_pipeops_scale`

,
`mlr_pipeops_scalemaxabs`

,
`mlr_pipeops_scalerange`

,
`mlr_pipeops_select`

,
`mlr_pipeops_smote`

,
`mlr_pipeops_smotenc`

,
`mlr_pipeops_spatialsign`

,
`mlr_pipeops_subsample`

,
`mlr_pipeops_targetinvert`

,
`mlr_pipeops_targetmutate`

,
`mlr_pipeops_targettrafoscalerange`

,
`mlr_pipeops_textvectorizer`

,
`mlr_pipeops_threshold`

,
`mlr_pipeops_tunethreshold`

,
`mlr_pipeops_unbranch`

,
`mlr_pipeops_updatetarget`

,
`mlr_pipeops_vtreat`

,
`mlr_pipeops_yeojohnson`

Other Imputation PipeOps:
`PipeOpImpute`

,
`mlr_pipeops_imputeconstant`

,
`mlr_pipeops_imputehist`

,
`mlr_pipeops_imputelearner`

,
`mlr_pipeops_imputemean`

,
`mlr_pipeops_imputemedian`

,
`mlr_pipeops_imputemode`

,
`mlr_pipeops_imputesample`

## Examples

```
library("mlr3")
set.seed(2409)
data = tsk("pima")$data()
data$y = factor(c(NA, sample(letters, size = 766, replace = TRUE), NA))
data$z = ordered(c(NA, sample(1:10, size = 767, replace = TRUE)))
task = TaskClassif$new("task", backend = data, target = "diabetes")
task$missings()
#> diabetes age glucose insulin mass pedigree pregnant pressure
#> 0 0 5 374 11 0 0 35
#> triceps y z
#> 227 2 1
po = po("imputeoor")
new_task = po$train(list(task = task))[[1]]
new_task$missings()
#> diabetes age pedigree pregnant glucose insulin mass pressure
#> 0 0 0 0 0 0 0 0
#> triceps y z
#> 0 0 0
new_task$data()
#> diabetes age pedigree pregnant glucose insulin mass pressure triceps
#> <fctr> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: pos 50 0.627 6 148 -819 33.6 72 35
#> 2: neg 31 0.351 1 85 -819 26.6 66 29
#> 3: pos 32 0.672 8 183 -819 23.3 64 -86
#> 4: neg 21 0.167 1 89 94 28.1 66 23
#> 5: pos 33 2.288 0 137 168 43.1 40 35
#> ---
#> 764: neg 63 0.171 10 101 180 32.9 76 48
#> 765: neg 27 0.340 2 122 -819 36.8 70 27
#> 766: neg 30 0.245 5 121 112 26.2 72 23
#> 767: pos 47 0.349 1 126 -819 30.1 60 -86
#> 768: neg 23 0.315 1 93 -819 30.4 70 31
#> y z
#> <fctr> <ord>
#> 1: .MISSING .MISSING
#> 2: l 9
#> 3: q 6
#> 4: f 3
#> 5: l 3
#> ---
#> 764: o 7
#> 765: n 5
#> 766: e 6
#> 767: c 8
#> 768: .MISSING 9
# recommended use when missing values are expected during prediction on
# factor columns that had no missing values during training
gr = po("imputeoor") %>>%
po("fixfactors") %>>%
po("imputesample", affect_columns = selector_type(types = c("factor", "ordered")))
t1 = as_task_classif(data.frame(l = as.ordered(letters[1:3]), t = letters[1:3]), target = "t")
t2 = as_task_classif(data.frame(l = as.ordered(c("a", NA, NA)), t = letters[1:3]), target = "t")
gr$train(t1)[[1]]$data()
#> t l
#> <fctr> <ord>
#> 1: a a
#> 2: b b
#> 3: c c
# missing values during prediction are sampled randomly
gr$predict(t2)[[1]]$data()
#> t l
#> <fctr> <ord>
#> 1: a a
#> 2: b c
#> 3: c c
```