Skip to contents

Generates a more balanced data set by down-sampling the instances of non-minority classes using the NEARMISS algorithm.

The algorithm down-samples by selecting instances from the non-minority classes that have the smallest mean distance to their k nearest neighbors of different classes. For this only numeric and integer features are taken into account. These must have no missing values.

This can only be applied to classification tasks. Multiclass classification is supported.

See themis::nearmiss for details.

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpNearmiss$new(id = "nearmiss", param_vals = list())

  • id :: character(1)
    Identifier of resulting object, default "nearmiss".

  • param_vals :: named list
    List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output during training is the input Task with the rows removed from the non-minority classes. The output during prediction is the unchanged input.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as

  • k :: integer(1)
    Number of nearest neighbors used for calculating the mean distances. Default is 5.

  • under_ratio :: numeric(1)
    Ratio of the minority-to-majority frequencies. This specifies the ratio to which the number of instances in the non-minority classes get down-sampled to, relative to the number of instances of the minority class. Default is 1. For details, see themis::nearmiss.

Fields

Only fields inherited from PipeOpTaskPreproc/PipeOp.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

References

Zhang, J., Mani, I. (2003). “KNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction.” In Proceedings of Workshop on Learning from Imbalanced Datasets (ICML).

See also

https://mlr-org.com/pipeops.html

Other PipeOps: PipeOp, PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreproc, PipeOpTaskPreprocSimple, mlr_pipeops, mlr_pipeops_adas, mlr_pipeops_blsmote, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_encode, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_rowapply, mlr_pipeops_scale, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_smotenc, mlr_pipeops_spatialsign, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_textvectorizer, mlr_pipeops_threshold, mlr_pipeops_tomek, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson

Examples

library("mlr3")

# Create example task
task = tsk("wine")
task$head()
#>      type alcalinity alcohol   ash color dilution flavanoids   hue magnesium
#>    <fctr>      <num>   <num> <num> <num>    <num>      <num> <num>     <int>
#> 1:      1       15.6   14.23  2.43  5.64     3.92       3.06  1.04       127
#> 2:      1       11.2   13.20  2.14  4.38     3.40       2.76  1.05       100
#> 3:      1       18.6   13.16  2.67  5.68     3.17       3.24  1.03       101
#> 4:      1       16.8   14.37  2.50  7.80     3.45       3.49  0.86       113
#> 5:      1       21.0   13.24  2.87  4.32     2.93       2.69  1.04       118
#> 6:      1       15.2   14.20  2.45  6.75     2.85       3.39  1.05       112
#>    malic nonflavanoids phenols proanthocyanins proline
#>    <num>         <num>   <num>           <num>   <int>
#> 1:  1.71          0.28    2.80            2.29    1065
#> 2:  1.78          0.26    2.65            1.28    1050
#> 3:  2.36          0.30    2.80            2.81    1185
#> 4:  1.95          0.24    3.85            2.18    1480
#> 5:  2.59          0.39    2.80            1.82     735
#> 6:  1.76          0.34    3.27            1.97    1450
table(task$data(cols = "type"))
#> type
#>  1  2  3 
#> 59 71 48 

# Down-sample and balance data
pop = po("nearmiss")
nearmiss_result = pop$train(list(task))[[1]]$data()
nrow(nearmiss_result)
#> [1] 144
table(nearmiss_result$type)
#> 
#>  1  2  3 
#> 48 48 48