Bag-of-word Representation of Character Features

Computes a bag-of-word representation from a (set of) columns. Columns of type character are split up into words. Uses the quanteda::dfm() and quanteda::dfm_trim() functions. TF-IDF computation works similarly to quanteda::dfm_tfidf() but has been adjusted for train/test data split using quanteda::docfreq() and quanteda::dfm_weight().

In short:

Per default, produces a bag-of-words representation
If n is set to values > 1, ngrams are computed
If df_trim parameters are set, the bag-of-words is trimmed.
The scheme_tf parameter controls term-frequency (per-document, i.e. per-row) weighting
The scheme_df parameter controls the document-frequency (per token, i.e. per-column) weighting.

Parameters specify arguments to quanteda's dfm, dfm_trim, docfreq and dfm_weight. What belongs to what can be obtained from each parameter's tags where tokenizer are arguments passed on to quanteda::dfm(). Defaults to a bag-of-words representation with token counts as matrix entries.

In order to perform the default dfm_tfidf weighting, set the scheme_df parameter to "inverse". The scheme_df parameter is initialized to "unary", which disables document frequency weighting.

The PipeOp works as follows:

Words are tokenized using quanteda::tokens.
Ngrams are computed using quanteda::tokens_ngrams.
A document-frequency matrix is computed using quanteda::dfm.
The document-frequency matrix is trimmed using quanteda::dfm_trim during train-time.
The document-frequency matrix is re-weighted (similar to quanteda::dfm_tfidf) if scheme_df is not set to "unary".

Format

R6Class object inheriting from PipeOpTaskPreproc/PipeOp.

Construction

PipeOpTextVectorizer$new(id = "textvectorizer", param_vals = list())

id :: character(1)
Identifier of resulting object, default "textvectorizer".
param_vals :: named list
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Default list().

Input and Output Channels

Input and output channels are inherited from PipeOpTaskPreproc.

The output is the input Task with all affected features converted to a bag-of-words representation.

State

The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:

colmodels :: named list
Named list with one entry per extracted column. Each entry has two further elements:
- tdm: sparse document-feature matrix resulting from quanteda::dfm()
- docfreq: (weighted) document frequency resulting from quanteda::docfreq()

Parameters

The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:

return_type :: character(1)
Whether to return an integer representation ("integer-sequence") or a Bag-of-words ("bow"). If set to "integer_sequence", tokens are replaced by an integer and padded/truncated to sequence_length. If set to "factor_sequence", tokens are replaced by a factor and padded/truncated to sequence_length. If set to "bow", a possibly weighted bag-of-words matrix is returned. Defaults to bow.
stopwords_language :: character(1)
Language to use for stopword filtering. Needs to be either "none", a language identifier listed in stopwords::stopwords_getlanguages("snowball") ("de", "en", ...) or "smart". "none" disables language-specific stopwords. "smart" coresponds to stopwords::stopwords(source = "smart"), which contains English stopwords and also removes one-character strings. Initialized to "smart".
extra_stopwords :: character
Extra stopwords to remove. Must be a character vector containing individual tokens to remove. When n is set to values greater than 1, this can also contain stop-ngrams. Initialized to character(0).
tolower :: logical(1)
Whether to convert to lower case. See quanteda::dfm. Default is TRUE.
stem :: logical(1)
Whether to perform stemming. See quanteda::dfm. Default is FALSE.
what :: character(1)
Tokenization splitter. See quanteda::tokens. Default is "word".
remove_punct :: logical(1)
See quanteda::tokens. Default is FALSE.
remove_url :: logical(1)
See quanteda::tokens. Default is FALSE.
remove_symbols :: logical(1)
See quanteda::tokens. Default is FALSE.
remove_numbers :: logical(1)
See quanteda::tokens. Default is FALSE.
remove_separators :: logical(1)
See quanteda::tokens. Default is TRUE.
split_hypens :: logical(1)
See quanteda::tokens. Default is FALSE.
n :: integer
Vector of ngram lengths. See quanteda::tokens_ngrams. Initialized to 1, deviating from the base function's default. Note that this can be a vector of multiple values, to construct ngrams of multiple orders.
skip :: integer
Vector of skips. See quanteda::tokens_ngrams. Default is 0. Note that this can be a vector of multiple values.
sparsity :: numeric(1)
Desired sparsity of the 'tfm' matrix. See quanteda::dfm_trim. Default is NULL.
max_termfreq :: numeric(1)
Maximum term frequency in the 'tfm' matrix. See quanteda::dfm_trim. Default is NULL.
min_termfreq :: numeric(1)
Minimum term frequency in the 'tfm' matrix. See quanteda::dfm_trim. Default is NULL.
termfreq_type :: character(1)
How to asess term frequency. See quanteda::dfm_trim. Default is "count".
scheme_df :: character(1)
Weighting scheme for document frequency: See quanteda::docfreq. Initialized to "unary" (1 for each document, deviating from base function default).
smoothing_df :: numeric(1)
See quanteda::docfreq. Default is 0.
k_df :: numeric(1)
k parameter given to quanteda::docfreq (see there). Default is 0.
threshold_df :: numeric(1)
See quanteda::docfreq. Default is 0. Only considered if scheme_df is set to "count".
base_df :: numeric(1)
The base for logarithms in quanteda::docfreq (see there). Default is 10.
scheme_tf :: character(1)
Weighting scheme for term frequency: See quanteda::dfm_weight. Default is "count".
k_tf :: numeric(1)
k parameter given to quanteda::dfm_weight (see there). Default is 0.5.
base_df :: numeric(1)
The base for logarithms in quanteda::dfm_weight (see there). Default is 10.
sequence_length :: integer(1)
The length of the integer sequence. Defaults to Inf, i.e. all texts are padded to the length of the longest text. Only relevant for return_type is set to "integer_sequence".

Internals

See Description. Internally uses the quanteda package. Calls quanteda::tokens, quanteda::tokens_ngrams and quanteda::dfm. During training, quanteda::dfm_trim is also called. Tokens not seen during training are dropped during prediction.

Fields

Only fields inherited from PipeOp.

Methods

Only methods inherited from PipeOpTaskPreproc/PipeOp.

Examples

library("mlr3")
library("data.table")
# create some text data
dt = data.table(
  txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)

pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))

pos$train(list(task))[[1]]$data()
#> 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
#> Use 'as(., "TsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
#>        Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.n txt.r
#>         <fctr>        <num>       <num>        <num>       <num> <num> <num>
#>   1:    setosa          1.4         0.2          5.1         3.5     1     1
#>   2:    setosa          1.4         0.2          4.9         3.0     0     0
#>   3:    setosa          1.3         0.2          4.7         3.2     0     0
#>   4:    setosa          1.5         0.2          4.6         3.1     1     0
#>   5:    setosa          1.4         0.2          5.0         3.6     0     0
#>  ---                                                                        
#> 146: virginica          5.2         2.3          6.7         3.0     0     1
#> 147: virginica          5.0         1.9          6.3         2.5     0     1
#> 148: virginica          5.2         2.0          6.5         3.0     0     0
#> 149: virginica          5.4         2.3          6.2         3.4     0     0
#> 150: virginica          5.1         1.8          5.9         3.0     0     0
#>      txt.w txt.o txt.z txt.c txt.m txt.s txt.k txt.p txt.l txt.b txt.t txt.f
#>      <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>   1:     1     0     0     0     0     0     0     0     0     0     0     0
#>   2:     0     1     1     1     0     0     0     0     0     0     0     0
#>   3:     0     0     0     0     1     1     1     0     0     0     0     0
#>   4:     0     0     1     0     0     0     0     1     0     0     0     0
#>   5:     0     0     0     1     0     0     0     0     1     1     0     0
#>  ---                                                                        
#> 146:     0     0     0     0     0     0     0     0     0     0     0     1
#> 147:     0     0     0     0     0     0     0     0     0     0     0     0
#> 148:     0     0     0     1     0     0     0     1     0     0     0     0
#> 149:     1     0     0     0     0     0     0     0     0     0     0     0
#> 150:     0     0     0     0     0     0     0     1     0     0     0     0
#>      txt.y txt.g txt.u txt.j txt.d txt.x txt.e txt.v txt.q txt.h
#>      <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#>   1:     0     0     0     0     0     0     0     0     0     0
#>   2:     0     0     0     0     0     0     0     0     0     0
#>   3:     0     0     0     0     0     0     0     0     0     0
#>   4:     0     0     0     0     0     0     0     0     0     0
#>   5:     0     0     0     0     0     0     0     0     0     0
#>  ---                                                            
#> 146:     0     0     0     0     0     0     1     0     0     0
#> 147:     0     0     0     0     0     0     1     0     0     1
#> 148:     0     0     0     0     0     0     0     0     0     0
#> 149:     0     0     0     1     0     0     0     0     1     0
#> 150:     0     0     0     0     1     0     0     0     0     0

one_line_of_iris = task$filter(13)

one_line_of_iris$data()
#>    Species Petal.Length Petal.Width Sepal.Length Sepal.Width    txt
#>     <fctr>        <num>       <num>        <num>       <num> <char>
#> 1:  setosa          1.4         0.1          4.8           3  d f u

pos$predict(list(one_line_of_iris))[[1]]$data()
#>    Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.n txt.r txt.w
#>     <fctr>        <num>       <num>        <num>       <num> <num> <num> <num>
#> 1:  setosa          1.4         0.1          4.8           3     0     0     0
#>    txt.o txt.z txt.c txt.m txt.s txt.k txt.p txt.l txt.b txt.t txt.f txt.y
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:     0     0     0     0     0     0     0     0     0     0     1     0
#>    txt.g txt.u txt.j txt.d txt.x txt.e txt.v txt.q txt.h
#>    <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1:     0     1     0     1     0     0     0     0     0