Both undersamples a
Task to keep only a fraction of the rows of the majority class,
as well as oversamples (repeats data points) rows of the minority class.
Sampling happens only during training phase. Class-balancing a
Task by sampling may be
beneficial for classification with imbalanced training data.
PipeOpClassBalancing$new(id = "classbalancing", param_vals = list())
Identifier of the resulting object, default
param_vals :: named
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default
The output during training is the input
Task with added or removed rows to balance target classes.
The output during prediction is the unchanged input.
$state is a named
list with the
$state elements inherited from
The parameters are the parameters inherited from
PipeOpTaskPreproc; however, the
affect_columns parameter is not present. Further parameters are:
Ratio of number of rows of classes to keep, relative to the
$reference value. Initialized to 1.
$ratio value is measured against. Can be
"all" (mean instance count of
"major" (instance count of class with most instances),
(instance count of class with fewest instances),
"nonmajor" (average instance
count of all classes except the major one),
"nonminor" (average instance count
of all classes except the minor one), and
$ratio determines the number of
instances to have, per class). Initialized to
Which classes to up / downsample. Can be
"all" (up and downsample all to match required
"nonminor" (see respective values
"upsample" (only upsample), and
"downsample". Initialized to
Whether to shuffle the rows of the resulting task. In case the data is upsampled and
shuffle = FALSE, the resulting task will have the original
rows (which were not removed in downsampling) in the original order, followed by all newly added rows
ordered by target class.
Up / downsampling happens as follows: At first, a "target class count" is calculated, by taking the mean
class count of all classes indicated by the
reference parameter (e.g. if
the mean class count of all classes that are not the "major" class, i.e. the class with the most samples)
and multiplying this with the value of the
ratio parameter. If
"one", then the "target
class count" is just the value of
1 * ratio).
Then for each class that is referenced by the
adjust parameter (e.g. if
each class that is not the class with the fewest samples),
PipeOpClassBalancing either throws out
samples (downsampling), or adds additional rows that are equal to randomly chosen samples (upsampling),
until the number of samples for these classes equals the "target class count".
task$filter() to remove rows. When identical rows are added during upsampling, then the
task$row_roles$use can not be used
to duplicate rows because of [inaudible]; instead the
task$rbind() function is used, and
data.table is attached that contains all rows that are being duplicated exactly as many times as they are being added.
library("mlr3") task = tsk("spam") opb = po("classbalancing") # target class counts table(task$truth())#> #> spam nonspam #> 1813 2788# double the instances in the minority class (spam) opb$param_set$values = list(ratio = 2, reference = "minor", adjust = "minor", shuffle = FALSE) result = opb$train(list(task))[[1L]] table(result$truth())#> #> spam nonspam #> 3626 2788# up or downsample all classes until exactly 20 per class remain opb$param_set$values = list(ratio = 20, reference = "one", adjust = "all", shuffle = FALSE) result = opb$train(list(task))[] table(result$truth())#> #> spam nonspam #> 20 20