Interface to the vtreat Package

Provides an interface to the vtreat package.

PipeOpVtreat naturally works for classification tasks and regression tasks. Internally, PipeOpVtreat follows the fit/prepare interface of vtreat, i.e., first creating a data treatment transform object via vtreat::NumericOutcomeTreatment(), vtreat::BinomialOutcomeTreatment(), or vtreat::MultinomialOutcomeTreatment(), followed by calling vtreat::fit_prepare() on the training data and vtreat::prepare() during predicton.

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpVtreat$new(id = "vtreat", param_vals = list())

id :: character(1)
Identifier of resulting object, default "vtreat".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc. Instead of a Task, a TaskSupervised is used as input and output during training and prediction.

The output is the input Task with all affected features "prepared" by vtreat. If vtreat found "no usable vars", the input Task is returned unaltered.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:

treatment_plan :: object of class vtreat_pipe_step | NULL
The treatment plan as constructed by vtreat based on the training data, i.e., an object of class treatment_plan. If vtreat found "no usable vars" and designing the treatment would have failed, this is NULL.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

recommended :: logical(1)
Whether only the "recommended" prepared features should be returned, i.e., non constant variables with a significance value smaller than vtreat's threshold. Initialized to TRUE.
cols_to_copy :: function | Selector
Selector function, takes a Task as argument and returns a character() of features to copy.
See Selector for example functions. Initialized to selector_none().
minFraction :: numeric(1)
Minimum frequency a categorical level must have to be converted to an indicator column.
smFactor :: numeric(1)
Smoothing factor for impact coding models.
rareCount :: integer(1)
Allow levels with this count or below to be pooled into a shared rare-level.
rareSig :: numeric(1)
Suppress levels from pooling at this significance value greater.
collarProb :: numeric(1)
What fraction of the data (pseudo-probability) to collar data at if doCollar = TRUE.
doCollar :: logical(1)
If TRUE collar numeric variables by cutting off after a tail-probability specified by collarProb during treatment design.
codeRestriction :: character()
What types of variables to produce.
customCoders :: named list
Map from code names to custom categorical variable encoding functions.
splitFunction :: function
Function taking arguments nSplits, nRows, dframe, and y; returning a user desired split.
ncross :: integer(1)
Integer larger than one, number of cross-validation rounds to design.
forceSplit :: logical(1)
If TRUE force cross-validated significance calculations on all variables.
catScaling :: logical(1)
If TRUE use stats::glm() linkspace, if FALSE use stats::lm() for scaling.
verbose :: logical(1)
If TRUE print progress.
use_parallel :: logical(1)
If TRUE use parallel methods.
missingness_imputation :: function
Function of signature f(values: numeric, weights: numeric), simple missing value imputer.
Typically, an imputation via a PipeOp should be preferred, see PipeOpImpute.
pruneSig :: numeric(1)
Suppress variables with significance above this level. Only effects [regression tasksmlr3::TaskRegr and binary classification tasks.
scale :: logical(1)
If TRUE replace numeric variables with single variable model regressions ("move to outcome-scale"). These have mean zero and (for variables with significant less than 1) slope 1 when regressed (lm for regression problems/glm for classification problems) against outcome.
varRestriction :: list()
List of treated variable names to restrict to. Only effects [regression tasksmlr3::TaskRegr and binary classification tasks.
trackedValues :: named list()
Named list mapping variables to know values, allows warnings upon novel level appearances (see vtreat::track_values()). Only effects [regression tasksmlr3::TaskRegr and binary classification tasks.
y_dependent_treatments :: character()
Character what treatment types to build per-outcome level. Only effects multiclass classification tasks.
imputation_map :: named list
List of map from column names to functions of signature f(values: numeric, weights: numeric), simple missing value imputers.
Typically, an imputation via a PipeOp is to be preferred, see PipeOpImpute.

For more information, see vtreat::regression_parameters(), vtreat::classification_parameters(), or vtreat::multinomial_parameters().

Internals

Follows vtreat's fit/prepare interface. See vtreat::NumericOutcomeTreatment(), vtreat::BinomialOutcomeTreatment(), vtreat::MultinomialOutcomeTreatment(), vtreat::fit_prepare() and vtreat::prepare().

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

Examples

library("mlr3")

set.seed(2020)

make_data <- function(nrows) {
    d <- data.frame(x = 5 * rnorm(nrows))
    d["y"] = sin(d[["x"]]) + 0.01 * d[["x"]] + 0.1 * rnorm(nrows)
    d[4:10, "x"] = NA  # introduce NAs
    d["xc"] = paste0("level_", 5 * round(d$y / 5, 1))
    d["x2"] = rnorm(nrows)
    d[d["xc"] == "level_-1", "xc"] = NA  # introduce a NA level
    return(d)
}

task = TaskRegr$new("vtreat_regr", backend = make_data(100), target = "y")

pop = PipeOpVtreat$new()
pop$train(list(task))
#> $output
#> 
#> ── <TaskRegr> (100x8) ──────────────────────────────────────────────────────────
#> • Target: y
#> • Properties: -
#> • Features (7):
#>   • dbl (7): xc_catD, xc_catN, xc_catP, xc_lev_NA, xc_lev_x_level_0_5,
#>   xc_lev_x_level_1, xc_lev_x_level_minus_0_5
#>