Skip to contents

Computes a bag-of-word representation from a (set of) columns. Columns of type character are split up into words. Uses the quanteda::dfm(), quanteda::dfm_trim() from the 'quanteda' package. TF-IDF computation works similarly to quanteda::dfm_tfidf() but has been adjusted for train/test data split using quanteda::docfreq() and quanteda::dfm_weight()

In short:

  • Per default, produces a bag-of-words representation

  • If n is set to values > 1, ngrams are computed

  • If df_trim parameters are set, the bag-of-words is trimmed.

  • The scheme_tf parameter controls term-frequency (per-document, i.e. per-row) weighting

  • The scheme_df parameter controls the document-frequency (per token, i.e. per-column) weighting.

Parameters specify arguments to quanteda's dfm, dfm_trim, docfreq and dfm_weight. What belongs to what can be obtained from each params tags where tokenizer are arguments passed on to quanteda::dfm(). Defaults to a bag-of-words representation with token counts as matrix entries.

In order to perform the default dfm_tfidf weighting, set the scheme_df parameter to "inverse". The scheme_df parameter is initialized to "unary", which disables document frequency weighting.

The pipeop works as follows:

  1. Words are tokenized using quanteda::tokens.

  2. Ngrams are computed using quanteda::tokens_ngrams

  3. A document-frequency matrix is computed using quanteda::dfm

  4. The document-frequency matrix is trimmed using quanteda::dfm_trim during train-time.

  5. The document-frequency matrix is re-weighted (similar to quanteda::dfm_tfidf) if scheme_df is not set to "unary".

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpTextVectorizer$new(id = "textvectorizer", param_vals = list())

  • id :: character(1)
    Identifier of resulting object, default "textvectorizer".

  • param_vals :: named list
    List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with all affected features converted to a bag-of-words representation.

State

The $state is a list with element 'cols': A vector of extracted columns.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

  • return_type :: character(1)
    Whether to return an integer representation ("integer-sequence") or a Bag-of-words ("bow"). If set to "integer_sequence", tokens are replaced by an integer and padded/truncated to sequence_length. If set to "factor_sequence", tokens are replaced by a factor and padded/truncated to sequence_length. If set to 'bow', a possibly weighted bag-of-words matrix is returned. Defaults to bow.

  • stopwords_language :: character(1)
    Language to use for stopword filtering. Needs to be either "none", a language identifier listed in stopwords::stopwords_getlanguages("snowball") ("de", "en", ...) or "smart". "none" disables language-specific stopwords. "smart" coresponds to stopwords::stopwords(source = "smart"), which contains English stopwords and also removes one-character strings. Initialized to "smart".

  • extra_stopwords :: character
    Extra stopwords to remove. Must be a character vector containing individual tokens to remove. Initialized to character(0). When n is set to values greater than 1, this can also contain stop-ngrams.

  • tolower :: logical(1)
    Convert to lower case? See quanteda::dfm. Default: TRUE.

  • stem :: logical(1)
    Perform stemming? See quanteda::dfm. Default: FALSE.

  • what :: character(1)
    Tokenization splitter. See quanteda::tokens. Default: word.

  • remove_punct :: logical(1)
    See quanteda::tokens. Default: FALSE.

  • remove_url :: logical(1)
    See quanteda::tokens. Default: FALSE.

  • remove_symbols :: logical(1)
    See quanteda::tokens. Default: FALSE.

  • remove_numbers :: logical(1)
    See quanteda::tokens. Default: FALSE.

  • remove_separators :: logical(1)
    See quanteda::tokens. Default: TRUE.

  • split_hypens :: logical(1)
    See quanteda::tokens. Default: FALSE.

  • n :: integer
    Vector of ngram lengths. See quanteda::tokens_ngrams. Initialized to 1, deviating from the base function's default. Note that this can be a vector of multiple values, to construct ngrams of multiple orders.

  • skip :: integer
    Vector of skips. See quanteda::tokens_ngrams. Default: 0. Note that this can be a vector of multiple values.

  • sparsity :: numeric(1)
    Desired sparsity of the 'tfm' matrix. See quanteda::dfm_trim. Default: NULL.

  • max_termfreq :: numeric(1)
    Maximum term frequency in the 'tfm' matrix. See quanteda::dfm_trim. Default: NULL.

  • min_termfreq :: numeric(1)
    Minimum term frequency in the 'tfm' matrix. See quanteda::dfm_trim. Default: NULL.

  • termfreq_type :: character(1)
    How to asess term frequency. See quanteda::dfm_trim. Default: "count".

  • scheme_df :: character(1)
    Weighting scheme for document frequency: See quanteda::docfreq. Initialized to "unary" (1 for each document, deviating from base function default).

  • smoothing_df :: numeric(1)
    See quanteda::docfreq. Default: 0.

  • k_df :: numeric(1)
    k parameter given to quanteda::docfreq (see there). Default is 0.

  • threshold_df :: numeric(1)
    See quanteda::docfreq. Default: 0. Only considered for scheme_df = "count".

  • base_df :: numeric(1)
    The base for logarithms in quanteda::docfreq (see there). Default: 10.

  • scheme_tf :: character(1)
    Weighting scheme for term frequency: See quanteda::dfm_weight. Default: "count".

  • k_tf :: numeric(1)
    k parameter given to quanteda::dfm_weight (see there). Default behaviour is 0.5.

  • base_df :: numeric(1)
    The base for logarithms in quanteda::dfm_weight (see there). Default: 10.

#' * sequence_length :: integer(1)
The length of the integer sequence. Defaults to Inf, i.e. all texts are padded to the length of the longest text. Only relevant for "return_type" : "integer_sequence"

Internals

See Description. Internally uses the quanteda package. Calls quanteda::tokens, quanteda::tokens_ngrams and quanteda::dfm. During training, quanteda::dfm_trim is also called. Tokens not seen during training are dropped during prediction.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

See also

https://mlr-org.com/pipeops.html

Other PipeOps: PipeOpEnsemble, PipeOpImpute, PipeOpTargetTrafo, PipeOpTaskPreprocSimple, PipeOpTaskPreproc, PipeOp, mlr_pipeops_boxcox, mlr_pipeops_branch, mlr_pipeops_chunk, mlr_pipeops_classbalancing, mlr_pipeops_classifavg, mlr_pipeops_classweights, mlr_pipeops_colapply, mlr_pipeops_collapsefactors, mlr_pipeops_colroles, mlr_pipeops_copy, mlr_pipeops_datefeatures, mlr_pipeops_encodeimpact, mlr_pipeops_encodelmer, mlr_pipeops_encode, mlr_pipeops_featureunion, mlr_pipeops_filter, mlr_pipeops_fixfactors, mlr_pipeops_histbin, mlr_pipeops_ica, mlr_pipeops_imputeconstant, mlr_pipeops_imputehist, mlr_pipeops_imputelearner, mlr_pipeops_imputemean, mlr_pipeops_imputemedian, mlr_pipeops_imputemode, mlr_pipeops_imputeoor, mlr_pipeops_imputesample, mlr_pipeops_kernelpca, mlr_pipeops_learner, mlr_pipeops_missind, mlr_pipeops_modelmatrix, mlr_pipeops_multiplicityexply, mlr_pipeops_multiplicityimply, mlr_pipeops_mutate, mlr_pipeops_nmf, mlr_pipeops_nop, mlr_pipeops_ovrsplit, mlr_pipeops_ovrunite, mlr_pipeops_pca, mlr_pipeops_proxy, mlr_pipeops_quantilebin, mlr_pipeops_randomprojection, mlr_pipeops_randomresponse, mlr_pipeops_regravg, mlr_pipeops_removeconstants, mlr_pipeops_renamecolumns, mlr_pipeops_replicate, mlr_pipeops_scalemaxabs, mlr_pipeops_scalerange, mlr_pipeops_scale, mlr_pipeops_select, mlr_pipeops_smote, mlr_pipeops_spatialsign, mlr_pipeops_subsample, mlr_pipeops_targetinvert, mlr_pipeops_targetmutate, mlr_pipeops_targettrafoscalerange, mlr_pipeops_threshold, mlr_pipeops_tunethreshold, mlr_pipeops_unbranch, mlr_pipeops_updatetarget, mlr_pipeops_vtreat, mlr_pipeops_yeojohnson, mlr_pipeops

Examples

library("mlr3")
library("data.table")
# create some text data
dt = data.table(
  txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)

pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))

pos$train(list(task))[[1]]$data()
#> 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
#> Use 'as(., "TsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
#>        Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.d txt.n
#>         <fctr>        <num>       <num>        <num>       <num> <num> <num>
#>   1:    setosa          1.4         0.2          5.1         3.5     1     1
#>   2:    setosa          1.4         0.2          4.9         3.0     1     0
#>   3:    setosa          1.3         0.2          4.7         3.2     0     0
#>   4:    setosa          1.5         0.2          4.6         3.1     0     1
#>   5:    setosa          1.4         0.2          5.0         3.6     0     0
#>  ---                                                                        
#> 146: virginica          5.2         2.3          6.7         3.0     0     1
#> 147: virginica          5.0         1.9          6.3         2.5     1     1
#> 148: virginica          5.2         2.0          6.5         3.0     0     0
#> 149: virginica          5.4         2.3          6.2         3.4     0     1
#> 150: virginica          5.1         1.8          5.9         3.0     0     0
#>      txt.e txt.x txt.y txt.p txt.j txt.l txt.h txt.v txt.m txt.o txt.r txt.f
#>      <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>   1:     1     0     0     0     0     0     0     0     0     0     0     0
#>   2:     1     1     0     0     0     0     0     0     0     0     0     0
#>   3:     1     0     1     1     0     0     0     0     0     0     0     0
#>   4:     1     0     0     0     1     0     0     0     0     0     0     0
#>   5:     0     1     0     0     0     1     1     0     0     0     0     0
#>  ---                                                                        
#> 146:     0     0     0     0     0     0     1     0     0     0     1     0
#> 147:     0     0     0     0     0     0     0     0     0     1     0     0
#> 148:     0     1     0     0     0     0     0     0     0     0     0     0
#> 149:     0     0     0     0     0     0     0     0     0     1     0     0
#> 150:     0     0     0     0     1     0     0     0     0     0     1     0
#>      txt.k txt.t txt.s txt.u txt.b txt.z txt.c txt.q txt.g txt.w
#>      <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>   1:     0     0     0     0     0     0     0     0     0     0
#>   2:     0     0     0     0     0     0     0     0     0     0
#>   3:     0     0     0     0     0     0     0     0     0     0
#>   4:     0     0     0     0     0     0     0     0     0     0
#>   5:     0     0     0     0     0     0     0     0     0     0
#>  ---                                                            
#> 146:     0     0     0     0     0     0     0     0     0     0
#> 147:     0     0     0     0     0     0     0     0     0     0
#> 148:     0     0     0     0     1     0     0     0     0     1
#> 149:     0     0     0     1     0     0     0     0     0     0
#> 150:     0     0     0     0     0     0     1     0     0     0

one_line_of_iris = task$filter(13)

one_line_of_iris$data()
#>    Species Petal.Length Petal.Width Sepal.Length Sepal.Width    txt
#>     <fctr>        <num>       <num>        <num>       <num> <char>
#> 1:  setosa          1.4         0.1          4.8           3  i k f

pos$predict(list(one_line_of_iris))[[1]]$data()
#>    Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.d txt.n txt.e
#>     <fctr>        <num>       <num>        <num>       <num> <num> <num> <num>
#> 1:  setosa          1.4         0.1          4.8           3     0     0     0
#>    txt.x txt.y txt.p txt.j txt.l txt.h txt.v txt.m txt.o txt.r txt.f txt.k
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:     0     0     0     0     0     0     0     0     0     0     1     1
#>    txt.t txt.s txt.u txt.b txt.z txt.c txt.q txt.g txt.w
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:     0     0     0     0     0     0     0     0     0