Out of Range Imputation

Impute factorial features by adding a new level ".MISSING".

Impute numeric, integer, POSIXct or Date features by constant values shifted below the minimum or above the maximum by using $min(x) - offset - multiplier * diff(range(x))$ or $max(x) + offset + multiplier * diff(range(x))$.

This type of imputation is especially sensible in the context of tree-based methods, see also Ding & Simonoff (2010).

Learners expect input Tasks to have the same factor (or ordered) levels during training as well as prediction. This PipeOp modifies the levels of factor and ordered features, and since it may occur that a factor or ordered feature contains missing values only during prediction, but not during training, the output Task could also have different levels during the two stages.

To avoid problems with the Learners' expectation, controlling the PipeOps' handling of this edge-case is necessary. For this, use the create_empty_level hyperparameter inherited from PipeOpImpute.
If create_empty_level is set to TRUE, then an unseen level ".MISSING" is added to the feature during training and missing values are imputed as ".MISSING" during prediction. However, empty factor levels during training can be a problem for many Learners.
If create_empty_level is set to FALSE, then no empty level is introduced during training, but columns that have missing values only during prediction will not be imputed. This is why it may still be necessary to use po("imputesample", affect_columns = selector_type(types = c("factor", "ordered"))) (or another imputation method) after this imputation method. Note that setting create_empty_level to FALSE is the same as setting it to TRUE and using PipeOpFixFactors after this PipeOp.

Format

R6Class object inheriting from PipeOpImpute/PipeOp.

Construction

PipeOpImputeOOR$new(id = "imputeoor", param_vals = list())

id :: character(1)
Identifier of resulting object, default "imputeoor".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpImpute.

The output is the input Task with all affected features having missing values imputed as described above.

State

The $state is a named list with the $state elements inherited from PipeOpImpute.

The $state$model contains either ".MISSING" used for character and factor (also ordered) features or numeric(1) indicating the constant value used for imputation of integer, numeric, POSIXct or Date features.

Parameters

The parameters are the parameters inherited from PipeOpImpute, as well as:

min :: logical(1)
Should integer and numeric features be shifted below the minimum? Initialized to TRUE. If FALSE they are shifted above the maximum. See also the description above.
offset :: numeric(1)
Numerical non-negative offset as used in the description above for integer, numeric, POSIXCT and Date. features. Initialized to 1.
multiplier :: numeric(1)
Numerical non-negative multiplier as used in the description above for integer, numeric, POSIXct and Date. features. Initialized to 1.

Internals

Adds an explicit new level() to factor and ordered features, but not to character features. For integer and numeric features uses the min, max, diff and range functions. integer and numeric features that are entirely NA are imputed as 0. factor and ordered features that are entirely NA are imputed as ".MISSING". For POSIXct and Date features the value 0 is transformed into the respective data type.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpImpute/PipeOp.

References

Ding Y, Simonoff JS (2010). “An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data.” Journal of Machine Learning Research, 11(6), 131-170. https://jmlr.org/papers/v11/ding10a.html.

Other PipeOps: PipeOp, PipeOpEncodePL, PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreproc, PipeOpTaskPreprocSimple, mlr_pipeops, mlr_pipeops_adas, mlr_pipeops_blsmote, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_classweightsex, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_decode, mlr_pipeops_encode, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_encodeplquantiles, mlr_pipeops_encodepltree, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputesample, mlr_pipeops_info, mlr_pipeops_isomap, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_learner_pi_cvplus, mlr_pipeops_learner_quantiles, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nearmiss, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_rowapply, mlr_pipeops_scale, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_smotenc, mlr_pipeops_spatialsign, mlr_pipeops_splines, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_textvectorizer, mlr_pipeops_threshold, mlr_pipeops_tomek, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson

Other Imputation PipeOps: PipeOpImpute, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputesample

Examples

library("mlr3")
set.seed(2409)
data = tsk("pima")$data()
data$y = factor(c(NA, sample(letters, size = 766, replace = TRUE), NA))
data$z = ordered(c(NA, sample(1:10, size = 767, replace = TRUE)))
task = TaskClassif$new("task", backend = data, target = "diabetes")
task$missings()
#> diabetes      age  glucose  insulin     mass pedigree pregnant pressure 
#>        0        0        5      374       11        0        0       35 
#>  triceps        y        z 
#>      227        2        1 
po = po("imputeoor")
new_task = po$train(list(task = task))[[1]]
new_task$missings()
#> diabetes      age pedigree pregnant  glucose  insulin     mass pressure 
#>        0        0        0        0        0        0        0        0 
#>  triceps        y        z 
#>        0        0        0 
new_task$data()
#>      diabetes   age pedigree pregnant glucose insulin  mass pressure triceps
#>        <fctr> <num>    <num>    <num>   <num>   <num> <num>    <num>   <num>
#>   1:      pos    50    0.627        6     148    -819  33.6       72      35
#>   2:      neg    31    0.351        1      85    -819  26.6       66      29
#>   3:      pos    32    0.672        8     183    -819  23.3       64     -86
#>   4:      neg    21    0.167        1      89      94  28.1       66      23
#>   5:      pos    33    2.288        0     137     168  43.1       40      35
#>  ---                                                                        
#> 764:      neg    63    0.171       10     101     180  32.9       76      48
#> 765:      neg    27    0.340        2     122    -819  36.8       70      27
#> 766:      neg    30    0.245        5     121     112  26.2       72      23
#> 767:      pos    47    0.349        1     126    -819  30.1       60     -86
#> 768:      neg    23    0.315        1      93    -819  30.4       70      31
#>             y        z
#>        <fctr>    <ord>
#>   1: .MISSING .MISSING
#>   2:        l        9
#>   3:        q        6
#>   4:        f        3
#>   5:        l        3
#>  ---                  
#> 764:        o        7
#> 765:        n        5
#> 766:        e        6
#> 767:        c        8
#> 768: .MISSING        9

# recommended use when missing values are expected during prediction on
# factor columns that had no missing values during training
gr = po("imputeoor", create_empty_level = FALSE) %>>%
  po("imputesample", affect_columns = selector_type(types = c("factor", "ordered")))
t1 = as_task_classif(data.frame(l = as.ordered(letters[1:3]), t = letters[1:3]), target = "t")
t2 = as_task_classif(data.frame(l = as.ordered(c("a", NA, NA)), t = letters[1:3]), target = "t")
gr$train(t1)[[1]]$data()
#>         t     l
#>    <fctr> <ord>
#> 1:      a     a
#> 2:      b     b
#> 3:      c     c

# missing values during prediction are sampled randomly
gr$predict(t2)[[1]]$data()
#>         t     l
#>    <fctr> <ord>
#> 1:      a     a
#> 2:      b     c
#> 3:      c     c