
Bag-of-word Representation of Character Features
Source:R/PipeOpTextVectorizer.R
mlr_pipeops_textvectorizer.RdComputes a bag-of-word representation from a (set of) columns.
Columns of type character are split up into words.
Uses the quanteda::dfm() and quanteda::dfm_trim() functions.
TF-IDF computation works similarly to quanteda::dfm_tfidf()
but has been adjusted for train/test data split using quanteda::docfreq()
and quanteda::dfm_weight().
In short:
Per default, produces a bag-of-words representation
If
nis set to values > 1, ngrams are computedIf
df_trimparameters are set, the bag-of-words is trimmed.The
scheme_tfparameter controls term-frequency (per-document, i.e. per-row) weightingThe
scheme_dfparameter controls the document-frequency (per token, i.e. per-column) weighting.
Parameters specify arguments to quanteda's dfm, dfm_trim, docfreq and dfm_weight.
What belongs to what can be obtained from each parameter's tags where tokenizer are
arguments passed on to quanteda::dfm().
Defaults to a bag-of-words representation with token counts as matrix entries.
In order to perform the default dfm_tfidf weighting, set the scheme_df parameter to "inverse".
The scheme_df parameter is initialized to "unary", which disables document frequency weighting.
The PipeOp works as follows:
Words are tokenized using
quanteda::tokens.Ngrams are computed using
quanteda::tokens_ngrams.A document-frequency matrix is computed using
quanteda::dfm.The document-frequency matrix is trimmed using
quanteda::dfm_trimduring train-time.The document-frequency matrix is re-weighted (similar to
quanteda::dfm_tfidf) ifscheme_dfis not set to"unary".
Format
R6Class object inheriting from PipeOpTaskPreproc/PipeOp.
Construction
id::character(1)
Identifier of resulting object, default"textvectorizer".param_vals:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist().
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc.
The output is the input Task with all affected features converted to a bag-of-words
representation.
State
The $state is a named list with the $state elements inherited from PipeOpTaskPreproc, as well as:
colmodels:: namedlist
Named list with one entry per extracted column. Each entry has two further elements:tdm: sparse document-feature matrix resulting fromquanteda::dfm()docfreq: (weighted) document frequency resulting fromquanteda::docfreq()
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc, as well as:
return_type::character(1)
Whether to return an integer representation ("integer-sequence") or a Bag-of-words ("bow"). If set to"integer_sequence", tokens are replaced by an integer and padded/truncated tosequence_length. If set to"factor_sequence", tokens are replaced by a factor and padded/truncated tosequence_length. If set to"bow", a possibly weighted bag-of-words matrix is returned. Defaults tobow.stopwords_language::character(1)
Language to use for stopword filtering. Needs to be either"none", a language identifier listed instopwords::stopwords_getlanguages("snowball")("de","en", ...) or"smart"."none"disables language-specific stopwords."smart"coresponds tostopwords::stopwords(source = "smart"), which contains English stopwords and also removes one-character strings. Initialized to"smart".extra_stopwords::character
Extra stopwords to remove. Must be acharactervector containing individual tokens to remove. Whennis set to values greater than1, this can also contain stop-ngrams. Initialized tocharacter(0).tolower::logical(1)
Whether to convert to lower case. Seequanteda::dfm. Default isTRUE.stem::logical(1)
Whether to perform stemming. Seequanteda::dfm. Default isFALSE.what::character(1)
Tokenization splitter. Seequanteda::tokens. Default is"word".remove_punct::logical(1)
Seequanteda::tokens. Default isFALSE.remove_url::logical(1)
Seequanteda::tokens. Default isFALSE.remove_symbols::logical(1)
Seequanteda::tokens. Default isFALSE.remove_numbers::logical(1)
Seequanteda::tokens. Default isFALSE.remove_separators::logical(1)
Seequanteda::tokens. Default isTRUE.split_hypens::logical(1)
Seequanteda::tokens. Default isFALSE.n::integer
Vector of ngram lengths. Seequanteda::tokens_ngrams. Initialized to1, deviating from the base function's default. Note that this can be a vector of multiple values, to construct ngrams of multiple orders.skip::integer
Vector of skips. Seequanteda::tokens_ngrams. Default is0. Note that this can be a vector of multiple values.sparsity::numeric(1)
Desired sparsity of the 'tfm' matrix. Seequanteda::dfm_trim. Default isNULL.max_termfreq::numeric(1)
Maximum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim. Default isNULL.min_termfreq::numeric(1)
Minimum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim. Default isNULL.termfreq_type::character(1)
How to asess term frequency. Seequanteda::dfm_trim. Default is"count".scheme_df::character(1)
Weighting scheme for document frequency: Seequanteda::docfreq. Initialized to"unary"(1for each document, deviating from base function default).smoothing_df::numeric(1)
Seequanteda::docfreq. Default is0.k_df::numeric(1)kparameter given toquanteda::docfreq(see there). Default is0.threshold_df::numeric(1)
Seequanteda::docfreq. Default is0. Only considered ifscheme_dfis set to"count".base_df::numeric(1)
The base for logarithms inquanteda::docfreq(see there). Default is10.scheme_tf::character(1)
Weighting scheme for term frequency: Seequanteda::dfm_weight. Default is"count".k_tf::numeric(1)kparameter given toquanteda::dfm_weight(see there). Default is0.5.base_df::numeric(1)
The base for logarithms inquanteda::dfm_weight(see there). Default is10.sequence_length::integer(1)
The length of the integer sequence. Defaults toInf, i.e. all texts are padded to the length of the longest text. Only relevant forreturn_typeis set to"integer_sequence".
Internals
See Description. Internally uses the quanteda package. Calls quanteda::tokens, quanteda::tokens_ngrams and quanteda::dfm. During training,
quanteda::dfm_trim is also called. Tokens not seen during training are dropped during prediction.
Fields
Only fields inherited from PipeOp.
Methods
Only methods inherited from PipeOpTaskPreproc/PipeOp.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp,
PipeOpEncodePL,
PipeOpEnsemble,
PipeOpImpute,
PipeOpTargetTrafo,
PipeOpTaskPreproc,
PipeOpTaskPreprocSimple,
mlr_pipeops,
mlr_pipeops_adas,
mlr_pipeops_blsmote,
mlr_pipeops_boxcox,
mlr_pipeops_branch,
mlr_pipeops_chunk,
mlr_pipeops_classbalancing,
mlr_pipeops_classifavg,
mlr_pipeops_classweights,
mlr_pipeops_colapply,
mlr_pipeops_collapsefactors,
mlr_pipeops_colroles,
mlr_pipeops_copy,
mlr_pipeops_datefeatures,
mlr_pipeops_decode,
mlr_pipeops_encode,
mlr_pipeops_encodeimpact,
mlr_pipeops_encodelmer,
mlr_pipeops_encodeplquantiles,
mlr_pipeops_encodepltree,
mlr_pipeops_featureunion,
mlr_pipeops_filter,
mlr_pipeops_fixfactors,
mlr_pipeops_histbin,
mlr_pipeops_ica,
mlr_pipeops_imputeconstant,
mlr_pipeops_imputehist,
mlr_pipeops_imputelearner,
mlr_pipeops_imputemean,
mlr_pipeops_imputemedian,
mlr_pipeops_imputemode,
mlr_pipeops_imputeoor,
mlr_pipeops_imputesample,
mlr_pipeops_kernelpca,
mlr_pipeops_learner,
mlr_pipeops_learner_pi_cvplus,
mlr_pipeops_learner_quantiles,
mlr_pipeops_missind,
mlr_pipeops_modelmatrix,
mlr_pipeops_multiplicityexply,
mlr_pipeops_multiplicityimply,
mlr_pipeops_mutate,
mlr_pipeops_nearmiss,
mlr_pipeops_nmf,
mlr_pipeops_nop,
mlr_pipeops_ovrsplit,
mlr_pipeops_ovrunite,
mlr_pipeops_pca,
mlr_pipeops_proxy,
mlr_pipeops_quantilebin,
mlr_pipeops_randomprojection,
mlr_pipeops_randomresponse,
mlr_pipeops_regravg,
mlr_pipeops_removeconstants,
mlr_pipeops_renamecolumns,
mlr_pipeops_replicate,
mlr_pipeops_rowapply,
mlr_pipeops_scale,
mlr_pipeops_scalemaxabs,
mlr_pipeops_scalerange,
mlr_pipeops_select,
mlr_pipeops_smote,
mlr_pipeops_smotenc,
mlr_pipeops_spatialsign,
mlr_pipeops_subsample,
mlr_pipeops_targetinvert,
mlr_pipeops_targetmutate,
mlr_pipeops_targettrafoscalerange,
mlr_pipeops_threshold,
mlr_pipeops_tomek,
mlr_pipeops_tunethreshold,
mlr_pipeops_unbranch,
mlr_pipeops_updatetarget,
mlr_pipeops_vtreat,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
library("data.table")
# create some text data
dt = data.table(
txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)
pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))
pos$train(list(task))[[1]]$data()
#> 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
#> Use 'as(., "TsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.p txt.t
#> <fctr> <num> <num> <num> <num> <num> <num>
#> 1: setosa 1.4 0.2 5.1 3.5 1 1
#> 2: setosa 1.4 0.2 4.9 3.0 0 0
#> 3: setosa 1.3 0.2 4.7 3.2 0 0
#> 4: setosa 1.5 0.2 4.6 3.1 0 1
#> 5: setosa 1.4 0.2 5.0 3.6 0 0
#> ---
#> 146: virginica 5.2 2.3 6.7 3.0 0 1
#> 147: virginica 5.0 1.9 6.3 2.5 0 0
#> 148: virginica 5.2 2.0 6.5 3.0 0 0
#> 149: virginica 5.4 2.3 6.2 3.4 0 0
#> 150: virginica 5.1 1.8 5.9 3.0 1 0
#> txt.v txt.m txt.j txt.g txt.x txt.l txt.s txt.h txt.b txt.f txt.z txt.q
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1 0 0 0 0 0 0 0 0 0 0 0
#> 2: 0 1 1 0 0 0 0 0 0 0 0 0
#> 3: 0 0 0 1 1 1 0 0 0 0 0 0
#> 4: 0 1 0 0 0 0 0 0 0 0 0 0
#> 5: 0 0 1 0 0 0 1 1 0 0 0 0
#> ---
#> 146: 0 0 0 0 0 0 0 0 0 0 0 0
#> 147: 0 0 0 0 0 0 1 0 0 0 0 0
#> 148: 0 0 1 0 0 0 0 0 0 0 0 1
#> 149: 0 0 0 0 0 0 0 0 0 0 0 0
#> 150: 0 0 0 0 0 0 0 0 0 0 0 0
#> txt.u txt.y txt.n txt.e txt.r txt.c txt.k txt.d txt.w txt.o
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0 0 0 0 0
#> 2: 0 0 0 0 0 0 0 0 0 0
#> 3: 0 0 0 0 0 0 0 0 0 0
#> 4: 0 0 0 0 0 0 0 0 0 0
#> 5: 0 0 0 0 0 0 0 0 0 0
#> ---
#> 146: 0 0 0 1 0 0 0 0 0 0
#> 147: 0 0 0 0 0 0 0 0 1 0
#> 148: 1 0 0 0 0 0 0 0 0 0
#> 149: 1 0 0 0 1 1 0 0 0 0
#> 150: 1 0 0 1 0 0 0 0 0 0
one_line_of_iris = task$filter(13)
one_line_of_iris$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt
#> <fctr> <num> <num> <num> <num> <char>
#> 1: setosa 1.4 0.1 4.8 3 r c k
pos$predict(list(one_line_of_iris))[[1]]$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.p txt.t txt.v
#> <fctr> <num> <num> <num> <num> <num> <num> <num>
#> 1: setosa 1.4 0.1 4.8 3 0 0 0
#> txt.m txt.j txt.g txt.x txt.l txt.s txt.h txt.b txt.f txt.z txt.q txt.u
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0 0 0 0 0 0 0
#> txt.y txt.n txt.e txt.r txt.c txt.k txt.d txt.w txt.o
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 1 1 1 0 0 0