Impute Numeric, Integer, POSIXct or Date Features by Histogram

Impute numeric, integer, POSIXct or Date features by histogram.

During training, a histogram is fitted on each column using R's hist() function. The fitted histogram is then sampled from for imputation. Sampling happens in a two-step process: First, a bin is sampled from the histogram, then a value is sampled uniformly from the bin. This is an approximation to sampling from the empirical training data distribution (i.e. sampling from training data with replacement), but is much more memory efficient for large datasets, since the $state does not need to save the training data.

Format

R6Class object inheriting from PipeOpImpute/PipeOp.

Construction

PipeOpImputeHist$new(id = "imputehist", param_vals = list())

id :: character(1)
Identifier of resulting object, default "imputehist".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpImpute.

The output is the input Task with all affected numeric, integer, POSIXct or Date features missing values imputed by (column-wise) histogram; see Description for details.

State

The $state is a named list with the $state elements inherited from PipeOpImpute.

The $state$model is a named list of lists containing elements $counts and $breaks.

Parameters

The parameters are the parameters inherited from PipeOpImpute.

Internals

Uses the graphics::hist() function. Features that are entirely NA are imputed as 0.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpImpute/PipeOp.

Other PipeOps: PipeOp, PipeOpEncodePL, PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreproc, PipeOpTaskPreprocSimple, mlr_pipeops, mlr_pipeops_adas, mlr_pipeops_blsmote, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_classweightsex, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_decode, mlr_pipeops_encode, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_encodeplquantiles, mlr_pipeops_encodepltree, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_info, mlr_pipeops_isomap, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_learner_pi_cvplus, mlr_pipeops_learner_quantiles, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nearmiss, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_rowapply, mlr_pipeops_scale, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_smotenc, mlr_pipeops_spatialsign, mlr_pipeops_splines, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_textvectorizer, mlr_pipeops_threshold, mlr_pipeops_tomek, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson

Other Imputation PipeOps: PipeOpImpute, mlr_pipeops_imputeconstant, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample

Examples

library("mlr3")

task = tsk("pima")
task$missings()
#> diabetes      age  glucose  insulin     mass pedigree pregnant pressure 
#>        0        0        5      374       11        0        0       35 
#>  triceps 
#>      227 

po = po("imputehist")
new_task = po$train(list(task = task))[[1]]
new_task$missings()
#> diabetes      age pedigree pregnant  glucose  insulin     mass pressure 
#>        0        0        0        0        0        0        0        0 
#>  triceps 
#>        0 

po$state$model
#> $age
#> $age$counts
#>  [1] 267 150  81  76  76  37  31  23  14  11   1   0   1
#> 
#> $age$breaks
#>  [1] 20 25 30 35 40 45 50 55 60 65 70 75 80 85
#> 
#> 
#> $glucose
#> $glucose$counts
#> [1]   4  38 167 205 157  91  60  41
#> 
#> $glucose$breaks
#> [1]  40  60  80 100 120 140 160 180 200
#> 
#> 
#> $insulin
#> $insulin$counts
#> [1] 151 158  48  17  11   6   1   1   1
#> 
#> $insulin$breaks
#>  [1]   0 100 200 300 400 500 600 700 800 900
#> 
#> 
#> $mass
#> $mass$counts
#>  [1]  14  98 180 221 148  61  27   5   2   0   1
#> 
#> $mass$breaks
#>  [1] 15 20 25 30 35 40 45 50 55 60 65 70
#> 
#> 
#> $pedigree
#> $pedigree$counts
#>  [1] 128 282 154  99  54  22  16   4   4   1   1   2   1
#> 
#> $pedigree$breaks
#>  [1] 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
#> 
#> 
#> $pregnant
#> $pregnant$counts
#> [1] 349 143 107  83  52  20  12   1   1
#> 
#> $pregnant$breaks
#>  [1]  0  2  4  6  8 10 12 14 16 18
#> 
#> 
#> $pressure
#> $pressure$counts
#>  [1]   3   2  24  94 217 228 127  25  11   1   1
#> 
#> $pressure$breaks
#>  [1]  20  30  40  50  60  70  80  90 100 110 120 130
#> 
#> 
#> $triceps
#> $triceps$counts
#>  [1]   9 115 179 164  65   7   1   0   0   1
#> 
#> $triceps$breaks
#>  [1]   0  10  20  30  40  50  60  70  80  90 100
#> 
#>