
Bag-of-word Representation of Character Features
Source:R/PipeOpTextVectorizer.R
mlr_pipeops_textvectorizer.Rd
Computes a bag-of-word representation from a (set of) columns.
Columns of type character
are split up into words.
Uses the quanteda::dfm()
and quanteda::dfm_trim()
functions.
TF-IDF computation works similarly to quanteda::dfm_tfidf()
but has been adjusted for train/test data split using quanteda::docfreq()
and quanteda::dfm_weight()
.
In short:
Per default, produces a bag-of-words representation
If
n
is set to values > 1, ngrams are computedIf
df_trim
parameters are set, the bag-of-words is trimmed.The
scheme_tf
parameter controls term-frequency (per-document, i.e. per-row) weightingThe
scheme_df
parameter controls the document-frequency (per token, i.e. per-column) weighting.
Parameters specify arguments to quanteda
's dfm
, dfm_trim
, docfreq
and dfm_weight
.
What belongs to what can be obtained from each parameter's tags
where tokenizer
are
arguments passed on to quanteda::dfm()
.
Defaults to a bag-of-words representation with token counts as matrix entries.
In order to perform the default dfm_tfidf
weighting, set the scheme_df
parameter to "inverse"
.
The scheme_df
parameter is initialized to "unary"
, which disables document frequency weighting.
The PipeOp
works as follows:
Words are tokenized using
quanteda::tokens
.Ngrams are computed using
quanteda::tokens_ngrams
.A document-frequency matrix is computed using
quanteda::dfm
.The document-frequency matrix is trimmed using
quanteda::dfm_trim
during train-time.The document-frequency matrix is re-weighted (similar to
quanteda::dfm_tfidf
) ifscheme_df
is not set to"unary"
.
Format
R6Class
object inheriting from PipeOpTaskPreproc
/PipeOp
.
Construction
id
::character(1)
Identifier of resulting object, default"textvectorizer"
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist()
.
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc
.
The output is the input Task
with all affected features converted to a bag-of-words
representation.
State
The $state
is a named list
with the $state
elements inherited from PipeOpTaskPreproc
, as well as:
colmodels
:: namedlist
Named list with one entry per extracted column. Each entry has two further elements:tdm
: sparse document-feature matrix resulting fromquanteda::dfm()
docfreq
: (weighted) document frequency resulting fromquanteda::docfreq()
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc
, as well as:
return_type
::character(1)
Whether to return an integer representation ("integer-sequence"
) or a Bag-of-words ("bow"
). If set to"integer_sequence"
, tokens are replaced by an integer and padded/truncated tosequence_length
. If set to"factor_sequence"
, tokens are replaced by a factor and padded/truncated tosequence_length
. If set to"bow"
, a possibly weighted bag-of-words matrix is returned. Defaults tobow
.stopwords_language
::character(1)
Language to use for stopword filtering. Needs to be either"none"
, a language identifier listed instopwords::stopwords_getlanguages("snowball")
("de"
,"en"
, ...) or"smart"
."none"
disables language-specific stopwords."smart"
coresponds tostopwords::stopwords(source = "smart")
, which contains English stopwords and also removes one-character strings. Initialized to"smart"
.extra_stopwords
::character
Extra stopwords to remove. Must be acharacter
vector containing individual tokens to remove. Whenn
is set to values greater than1
, this can also contain stop-ngrams. Initialized tocharacter(0)
.tolower
::logical(1)
Whether to convert to lower case. Seequanteda::dfm
. Default isTRUE
.stem
::logical(1)
Whether to perform stemming. Seequanteda::dfm
. Default isFALSE
.what
::character(1)
Tokenization splitter. Seequanteda::tokens
. Default is"word"
.remove_punct
::logical(1)
Seequanteda::tokens
. Default isFALSE
.remove_url
::logical(1)
Seequanteda::tokens
. Default isFALSE
.remove_symbols
::logical(1)
Seequanteda::tokens
. Default isFALSE
.remove_numbers
::logical(1)
Seequanteda::tokens
. Default isFALSE
.remove_separators
::logical(1)
Seequanteda::tokens
. Default isTRUE
.split_hypens
::logical(1)
Seequanteda::tokens
. Default isFALSE
.n
::integer
Vector of ngram lengths. Seequanteda::tokens_ngrams
. Initialized to1
, deviating from the base function's default. Note that this can be a vector of multiple values, to construct ngrams of multiple orders.skip
::integer
Vector of skips. Seequanteda::tokens_ngrams
. Default is0
. Note that this can be a vector of multiple values.sparsity
::numeric(1)
Desired sparsity of the 'tfm' matrix. Seequanteda::dfm_trim
. Default isNULL
.max_termfreq
::numeric(1)
Maximum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim
. Default isNULL
.min_termfreq
::numeric(1)
Minimum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim
. Default isNULL
.termfreq_type
::character(1)
How to asess term frequency. Seequanteda::dfm_trim
. Default is"count"
.scheme_df
::character(1)
Weighting scheme for document frequency: Seequanteda::docfreq
. Initialized to"unary"
(1
for each document, deviating from base function default).smoothing_df
::numeric(1)
Seequanteda::docfreq
. Default is0
.k_df
::numeric(1)
k
parameter given toquanteda::docfreq
(see there). Default is0
.threshold_df
::numeric(1)
Seequanteda::docfreq
. Default is0
. Only considered ifscheme_df
is set to"count"
.base_df
::numeric(1)
The base for logarithms inquanteda::docfreq
(see there). Default is10
.scheme_tf
::character(1)
Weighting scheme for term frequency: Seequanteda::dfm_weight
. Default is"count"
.k_tf
::numeric(1)
k
parameter given toquanteda::dfm_weight
(see there). Default is0.5
.base_df
::numeric(1)
The base for logarithms inquanteda::dfm_weight
(see there). Default is10
.sequence_length
::integer(1)
The length of the integer sequence. Defaults toInf
, i.e. all texts are padded to the length of the longest text. Only relevant forreturn_type
is set to"integer_sequence"
.
Internals
See Description. Internally uses the quanteda
package. Calls quanteda::tokens
, quanteda::tokens_ngrams
and quanteda::dfm
. During training,
quanteda::dfm_trim
is also called. Tokens not seen during training are dropped during prediction.
Fields
Only fields inherited from PipeOp
.
Methods
Only methods inherited from PipeOpTaskPreproc
/PipeOp
.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp
,
PipeOpEncodePL
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_decode
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_encodeplquantiles
,
mlr_pipeops_encodepltree
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_learner_pi_cvplus
,
mlr_pipeops_learner_quantiles
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nearmiss
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_threshold
,
mlr_pipeops_tomek
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
library("data.table")
# create some text data
dt = data.table(
txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)
pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))
pos$train(list(task))[[1]]$data()
#> 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
#> Use 'as(., "TsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.n txt.r
#> <fctr> <num> <num> <num> <num> <num> <num>
#> 1: setosa 1.4 0.2 5.1 3.5 1 1
#> 2: setosa 1.4 0.2 4.9 3.0 0 0
#> 3: setosa 1.3 0.2 4.7 3.2 0 0
#> 4: setosa 1.5 0.2 4.6 3.1 1 0
#> 5: setosa 1.4 0.2 5.0 3.6 0 0
#> ---
#> 146: virginica 5.2 2.3 6.7 3.0 0 1
#> 147: virginica 5.0 1.9 6.3 2.5 0 1
#> 148: virginica 5.2 2.0 6.5 3.0 0 0
#> 149: virginica 5.4 2.3 6.2 3.4 0 0
#> 150: virginica 5.1 1.8 5.9 3.0 0 0
#> txt.w txt.o txt.z txt.c txt.m txt.s txt.k txt.p txt.l txt.b txt.t txt.f
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1 0 0 0 0 0 0 0 0 0 0 0
#> 2: 0 1 1 1 0 0 0 0 0 0 0 0
#> 3: 0 0 0 0 1 1 1 0 0 0 0 0
#> 4: 0 0 1 0 0 0 0 1 0 0 0 0
#> 5: 0 0 0 1 0 0 0 0 1 1 0 0
#> ---
#> 146: 0 0 0 0 0 0 0 0 0 0 0 1
#> 147: 0 0 0 0 0 0 0 0 0 0 0 0
#> 148: 0 0 0 1 0 0 0 1 0 0 0 0
#> 149: 1 0 0 0 0 0 0 0 0 0 0 0
#> 150: 0 0 0 0 0 0 0 1 0 0 0 0
#> txt.y txt.g txt.u txt.j txt.d txt.x txt.e txt.v txt.q txt.h
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0 0 0 0 0
#> 2: 0 0 0 0 0 0 0 0 0 0
#> 3: 0 0 0 0 0 0 0 0 0 0
#> 4: 0 0 0 0 0 0 0 0 0 0
#> 5: 0 0 0 0 0 0 0 0 0 0
#> ---
#> 146: 0 0 0 0 0 0 1 0 0 0
#> 147: 0 0 0 0 0 0 1 0 0 1
#> 148: 0 0 0 0 0 0 0 0 0 0
#> 149: 0 0 0 1 0 0 0 0 1 0
#> 150: 0 0 0 0 1 0 0 0 0 0
one_line_of_iris = task$filter(13)
one_line_of_iris$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt
#> <fctr> <num> <num> <num> <num> <char>
#> 1: setosa 1.4 0.1 4.8 3 d f u
pos$predict(list(one_line_of_iris))[[1]]$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.n txt.r txt.w
#> <fctr> <num> <num> <num> <num> <num> <num> <num>
#> 1: setosa 1.4 0.1 4.8 3 0 0 0
#> txt.o txt.z txt.c txt.m txt.s txt.k txt.p txt.l txt.b txt.t txt.f txt.y
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0 0 0 0 0 1 0
#> txt.g txt.u txt.j txt.d txt.x txt.e txt.v txt.q txt.h
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 1 0 1 0 0 0 0 0