Computes a bag-of-word representation from a (set of) columns. Columns of type character are split up into words. Uses the quanteda::dfm(), quanteda::dfm_trim() from the 'quanteda' package. TF-IDF computation works similarly to quanteda::dfm_tfidf() but has been adjusted for train/test data split using quanteda::docfreq() and quanteda::dfm_weight()

In short:

  • Per default, produces a bag-of-words representation

  • If n is set to values > 1, ngrams are computed

  • If df_trim parameters are set, the bag-of-words is trimmed.

  • The scheme_tf parameter controls term-frequency (per-document, i.e. per-row) weighting

  • The scheme_df parameter controls the document-frequency (per token, i.e. per-column) weighting.

Parameters specify arguments to quanteda's dfm, dfm_trim, docfreq and dfm_weight. What belongs to what can be obtained from each params tags where tokenizer are arguments passed on to quanteda::dfm(). Defaults to a bag-of-words representation with token counts as matrix entries.

In order to perform the default dfm_tfidf weighting, set the scheme_df parameter to "inverse". The scheme_df parameter is initialized to "unary", which disables document frequency weighting.

The pipeop works as follows:

  1. Words are tokenized using quanteda::tokens.

  2. Ngrams are computed using quanteda::tokens_ngrams

  3. A document-frequency matrix is computed using quanteda::dfm

  4. The document-frequency matrix is trimmed using quanteda::dfm_trim during train-time.

  5. The document-frequency matrix is re-weighted (similar to quanteda::dfm_tfidf) if scheme_df is not set to "unary".

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpTextVectorizer$new(id = "textvectorizer", param_vals = list())
  • id :: character(1)
    Identifier of resulting object, default "textvectorizer".

  • param_vals :: named list
    List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with all affected features converted to a bag-of-words representation.

State

The $state is a list with element 'cols': A vector of extracted columns.

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

Internals

See Description. Internally uses the quanteda package. Calls quanteda::tokens, quanteda::tokens_ngrams and quanteda::dfm. During training, quanteda::dfm_trim is also called. Tokens not seen during training are dropped during prediction.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

See also

Examples

library("mlr3") library("data.table") # create some text data dt = data.table( txt = replicate(150, paste0(sample(letters, 3), collapse = " ")) ) task = tsk("iris")$cbind(dt) pos = po("textvectorizer", param_vals = list(stopwords_language = "en")) pos$train(list(task))[[1]]$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.c txt.q #> 1: setosa 1.4 0.2 5.1 3.5 1 1 #> 2: setosa 1.4 0.2 4.9 3.0 0 0 #> 3: setosa 1.3 0.2 4.7 3.2 0 0 #> 4: setosa 1.5 0.2 4.6 3.1 0 0 #> 5: setosa 1.4 0.2 5.0 3.6 0 0 #> --- #> 146: virginica 5.2 2.3 6.7 3.0 0 0 #> 147: virginica 5.0 1.9 6.3 2.5 0 0 #> 148: virginica 5.2 2.0 6.5 3.0 0 0 #> 149: virginica 5.4 2.3 6.2 3.4 0 0 #> 150: virginica 5.1 1.8 5.9 3.0 0 0 #> txt.f txt.g txt.b txt.v txt.e txt.u txt.y txt.j txt.w txt.z txt.t txt.d #> 1: 1 0 0 0 0 0 0 0 0 0 0 0 #> 2: 0 1 1 0 0 0 0 0 0 0 0 0 #> 3: 0 0 0 1 1 1 0 0 0 0 0 0 #> 4: 0 0 0 0 1 0 1 1 0 0 0 0 #> 5: 0 0 0 0 0 0 0 0 1 0 0 0 #> --- #> 146: 0 0 0 0 0 0 0 0 0 0 0 1 #> 147: 0 0 0 0 0 0 1 0 0 0 0 0 #> 148: 0 1 0 0 0 0 0 0 0 0 0 0 #> 149: 0 0 0 1 0 0 0 0 0 0 0 1 #> 150: 0 0 1 0 0 0 0 0 0 0 0 0 #> txt.h txt.p txt.n txt.o txt.r txt.k txt.s txt.l txt.x txt.m #> 1: 0 0 0 0 0 0 0 0 0 0 #> 2: 0 0 0 0 0 0 0 0 0 0 #> 3: 0 0 0 0 0 0 0 0 0 0 #> 4: 0 0 0 0 0 0 0 0 0 0 #> 5: 0 0 0 0 0 0 0 0 0 0 #> --- #> 146: 0 1 0 0 0 0 0 0 1 0 #> 147: 0 0 0 0 1 0 0 0 0 0 #> 148: 0 1 0 0 0 0 0 0 0 0 #> 149: 0 0 1 0 0 0 0 0 0 0 #> 150: 1 0 0 0 0 0 1 0 0 0
one_line_of_iris = task$filter(13) one_line_of_iris$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt #> 1: setosa 1.4 0.1 4.8 3 n o i
pos$predict(list(one_line_of_iris))[[1]]$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.c txt.q txt.f #> 1: setosa 1.4 0.1 4.8 3 0 0 0 #> txt.g txt.b txt.v txt.e txt.u txt.y txt.j txt.w txt.z txt.t txt.d txt.h #> 1: 0 0 0 0 0 0 0 0 0 0 0 0 #> txt.p txt.n txt.o txt.r txt.k txt.s txt.l txt.x txt.m #> 1: 0 1 1 0 0 0 0 0 0