Bag-of-word Representation of Character Features
Source:R/PipeOpTextVectorizer.R
mlr_pipeops_textvectorizer.Rd
Computes a bag-of-word representation from a (set of) columns.
Columns of type character
are split up into words.
Uses the quanteda::dfm()
,
quanteda::dfm_trim()
from the 'quanteda' package.
TF-IDF computation works similarly to quanteda::dfm_tfidf()
but has been adjusted for train/test data split using quanteda::docfreq()
and quanteda::dfm_weight()
In short:
Per default, produces a bag-of-words representation
If
n
is set to values > 1, ngrams are computedIf
df_trim
parameters are set, the bag-of-words is trimmed.The
scheme_tf
parameter controls term-frequency (per-document, i.e. per-row) weightingThe
scheme_df
parameter controls the document-frequency (per token, i.e. per-column) weighting.
Parameters specify arguments to quanteda's dfm
, dfm_trim
, docfreq
and dfm_weight
.
What belongs to what can be obtained from each params tags
where tokenizer
are
arguments passed on to quanteda::dfm()
.
Defaults to a bag-of-words representation with token counts as matrix entries.
In order to perform the default dfm_tfidf
weighting, set the scheme_df
parameter to "inverse"
.
The scheme_df
parameter is initialized to "unary"
, which disables document frequency weighting.
The pipeop works as follows:
Words are tokenized using
quanteda::tokens
.Ngrams are computed using
quanteda::tokens_ngrams
A document-frequency matrix is computed using
quanteda::dfm
The document-frequency matrix is trimmed using
quanteda::dfm_trim
during train-time.The document-frequency matrix is re-weighted (similar to
quanteda::dfm_tfidf
) ifscheme_df
is not set to"unary"
.
Format
R6Class
object inheriting from PipeOpTaskPreproc
/PipeOp
.
Construction
id
::character(1)
Identifier of resulting object, default"textvectorizer"
.param_vals
:: namedlist
List of hyperparameter settings, overwriting the hyperparameter settings that would otherwise be set during construction. Defaultlist()
.
Input and Output Channels
Input and output channels are inherited from PipeOpTaskPreproc
.
The output is the input Task
with all affected features converted to a bag-of-words
representation.
Parameters
The parameters are the parameters inherited from PipeOpTaskPreproc
, as well as:
return_type
::character(1)
Whether to return an integer representation ("integer-sequence") or a Bag-of-words ("bow"). If set to "integer_sequence", tokens are replaced by an integer and padded/truncated tosequence_length
. If set to "factor_sequence", tokens are replaced by a factor and padded/truncated tosequence_length
. If set to 'bow', a possibly weighted bag-of-words matrix is returned. Defaults tobow
.stopwords_language
::character(1)
Language to use for stopword filtering. Needs to be either"none"
, a language identifier listed instopwords::stopwords_getlanguages("snowball")
("de"
,"en"
, ...) or"smart"
."none"
disables language-specific stopwords."smart"
coresponds tostopwords::stopwords(source = "smart")
, which contains English stopwords and also removes one-character strings. Initialized to"smart"
.extra_stopwords
::character
Extra stopwords to remove. Must be acharacter
vector containing individual tokens to remove. Initialized tocharacter(0)
. Whenn
is set to values greater than 1, this can also contain stop-ngrams.tolower
::logical(1)
Convert to lower case? Seequanteda::dfm
. Default:TRUE
.stem
::logical(1)
Perform stemming? Seequanteda::dfm
. Default:FALSE
.what
::character(1)
Tokenization splitter. Seequanteda::tokens
. Default:word
.remove_punct
::logical(1)
Seequanteda::tokens
. Default:FALSE
.remove_url
::logical(1)
Seequanteda::tokens
. Default:FALSE
.remove_symbols
::logical(1)
Seequanteda::tokens
. Default:FALSE
.remove_numbers
::logical(1)
Seequanteda::tokens
. Default:FALSE
.remove_separators
::logical(1)
Seequanteda::tokens
. Default:TRUE
.split_hypens
::logical(1)
Seequanteda::tokens
. Default:FALSE
.n
::integer
Vector of ngram lengths. Seequanteda::tokens_ngrams
. Initialized to 1, deviating from the base function's default. Note that this can be a vector of multiple values, to construct ngrams of multiple orders.skip
::integer
Vector of skips. Seequanteda::tokens_ngrams
. Default: 0. Note that this can be a vector of multiple values.sparsity
::numeric(1)
Desired sparsity of the 'tfm' matrix. Seequanteda::dfm_trim
. Default:NULL
.max_termfreq
::numeric(1)
Maximum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim
. Default:NULL
.min_termfreq
::numeric(1)
Minimum term frequency in the 'tfm' matrix. Seequanteda::dfm_trim
. Default:NULL
.termfreq_type
::character(1)
How to asess term frequency. Seequanteda::dfm_trim
. Default:"count"
.scheme_df
::character(1)
Weighting scheme for document frequency: Seequanteda::docfreq
. Initialized to"unary"
(1 for each document, deviating from base function default).smoothing_df
::numeric(1)
Seequanteda::docfreq
. Default: 0.k_df
::numeric(1)
k
parameter given toquanteda::docfreq
(see there). Default is 0.threshold_df
::numeric(1)
Seequanteda::docfreq
. Default: 0. Only considered forscheme_df
="count"
.base_df
::numeric(1)
The base for logarithms inquanteda::docfreq
(see there). Default: 10.scheme_tf
::character(1)
Weighting scheme for term frequency: Seequanteda::dfm_weight
. Default:"count"
.k_tf
::numeric(1)
k
parameter given toquanteda::dfm_weight
(see there). Default behaviour is 0.5.base_df
::numeric(1)
The base for logarithms inquanteda::dfm_weight
(see there). Default: 10.
#' * sequence_length
:: integer(1)
The length of the integer sequence. Defaults to Inf
, i.e. all texts are padded to the length
of the longest text. Only relevant for "return_type" : "integer_sequence"
Internals
See Description. Internally uses the quanteda
package. Calls quanteda::tokens
, quanteda::tokens_ngrams
and quanteda::dfm
. During training,
quanteda::dfm_trim
is also called. Tokens not seen during training are dropped during prediction.
Methods
Only methods inherited from PipeOpTaskPreproc
/PipeOp
.
See also
https://mlr-org.com/pipeops.html
Other PipeOps:
PipeOp
,
PipeOpEnsemble
,
PipeOpImpute
,
PipeOpTargetTrafo
,
PipeOpTaskPreproc
,
PipeOpTaskPreprocSimple
,
mlr_pipeops
,
mlr_pipeops_adas
,
mlr_pipeops_blsmote
,
mlr_pipeops_boxcox
,
mlr_pipeops_branch
,
mlr_pipeops_chunk
,
mlr_pipeops_classbalancing
,
mlr_pipeops_classifavg
,
mlr_pipeops_classweights
,
mlr_pipeops_colapply
,
mlr_pipeops_collapsefactors
,
mlr_pipeops_colroles
,
mlr_pipeops_copy
,
mlr_pipeops_datefeatures
,
mlr_pipeops_decode
,
mlr_pipeops_encode
,
mlr_pipeops_encodeimpact
,
mlr_pipeops_encodelmer
,
mlr_pipeops_featureunion
,
mlr_pipeops_filter
,
mlr_pipeops_fixfactors
,
mlr_pipeops_histbin
,
mlr_pipeops_ica
,
mlr_pipeops_imputeconstant
,
mlr_pipeops_imputehist
,
mlr_pipeops_imputelearner
,
mlr_pipeops_imputemean
,
mlr_pipeops_imputemedian
,
mlr_pipeops_imputemode
,
mlr_pipeops_imputeoor
,
mlr_pipeops_imputesample
,
mlr_pipeops_kernelpca
,
mlr_pipeops_learner
,
mlr_pipeops_learner_pi_cvplus
,
mlr_pipeops_learner_quantiles
,
mlr_pipeops_missind
,
mlr_pipeops_modelmatrix
,
mlr_pipeops_multiplicityexply
,
mlr_pipeops_multiplicityimply
,
mlr_pipeops_mutate
,
mlr_pipeops_nearmiss
,
mlr_pipeops_nmf
,
mlr_pipeops_nop
,
mlr_pipeops_ovrsplit
,
mlr_pipeops_ovrunite
,
mlr_pipeops_pca
,
mlr_pipeops_proxy
,
mlr_pipeops_quantilebin
,
mlr_pipeops_randomprojection
,
mlr_pipeops_randomresponse
,
mlr_pipeops_regravg
,
mlr_pipeops_removeconstants
,
mlr_pipeops_renamecolumns
,
mlr_pipeops_replicate
,
mlr_pipeops_rowapply
,
mlr_pipeops_scale
,
mlr_pipeops_scalemaxabs
,
mlr_pipeops_scalerange
,
mlr_pipeops_select
,
mlr_pipeops_smote
,
mlr_pipeops_smotenc
,
mlr_pipeops_spatialsign
,
mlr_pipeops_subsample
,
mlr_pipeops_targetinvert
,
mlr_pipeops_targetmutate
,
mlr_pipeops_targettrafoscalerange
,
mlr_pipeops_threshold
,
mlr_pipeops_tomek
,
mlr_pipeops_tunethreshold
,
mlr_pipeops_unbranch
,
mlr_pipeops_updatetarget
,
mlr_pipeops_vtreat
,
mlr_pipeops_yeojohnson
Examples
library("mlr3")
library("data.table")
# create some text data
dt = data.table(
txt = replicate(150, paste0(sample(letters, 3), collapse = " "))
)
task = tsk("iris")$cbind(dt)
pos = po("textvectorizer", param_vals = list(stopwords_language = "en"))
pos$train(list(task))[[1]]$data()
#> 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
#> Use 'as(., "TsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.n txt.r
#> <fctr> <num> <num> <num> <num> <num> <num>
#> 1: setosa 1.4 0.2 5.1 3.5 1 1
#> 2: setosa 1.4 0.2 4.9 3.0 0 0
#> 3: setosa 1.3 0.2 4.7 3.2 0 0
#> 4: setosa 1.5 0.2 4.6 3.1 1 0
#> 5: setosa 1.4 0.2 5.0 3.6 0 0
#> ---
#> 146: virginica 5.2 2.3 6.7 3.0 0 1
#> 147: virginica 5.0 1.9 6.3 2.5 0 1
#> 148: virginica 5.2 2.0 6.5 3.0 0 0
#> 149: virginica 5.4 2.3 6.2 3.4 0 0
#> 150: virginica 5.1 1.8 5.9 3.0 0 0
#> txt.w txt.o txt.z txt.c txt.m txt.s txt.k txt.p txt.l txt.b txt.t txt.f
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 1 0 0 0 0 0 0 0 0 0 0 0
#> 2: 0 1 1 1 0 0 0 0 0 0 0 0
#> 3: 0 0 0 0 1 1 1 0 0 0 0 0
#> 4: 0 0 1 0 0 0 0 1 0 0 0 0
#> 5: 0 0 0 1 0 0 0 0 1 1 0 0
#> ---
#> 146: 0 0 0 0 0 0 0 0 0 0 0 1
#> 147: 0 0 0 0 0 0 0 0 0 0 0 0
#> 148: 0 0 0 1 0 0 0 1 0 0 0 0
#> 149: 1 0 0 0 0 0 0 0 0 0 0 0
#> 150: 0 0 0 0 0 0 0 1 0 0 0 0
#> txt.y txt.g txt.u txt.j txt.d txt.x txt.e txt.v txt.q txt.h
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0 0 0 0 0
#> 2: 0 0 0 0 0 0 0 0 0 0
#> 3: 0 0 0 0 0 0 0 0 0 0
#> 4: 0 0 0 0 0 0 0 0 0 0
#> 5: 0 0 0 0 0 0 0 0 0 0
#> ---
#> 146: 0 0 0 0 0 0 1 0 0 0
#> 147: 0 0 0 0 0 0 1 0 0 1
#> 148: 0 0 0 0 0 0 0 0 0 0
#> 149: 0 0 0 1 0 0 0 0 1 0
#> 150: 0 0 0 0 1 0 0 0 0 0
one_line_of_iris = task$filter(13)
one_line_of_iris$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt
#> <fctr> <num> <num> <num> <num> <char>
#> 1: setosa 1.4 0.1 4.8 3 d f u
pos$predict(list(one_line_of_iris))[[1]]$data()
#> Species Petal.Length Petal.Width Sepal.Length Sepal.Width txt.n txt.r txt.w
#> <fctr> <num> <num> <num> <num> <num> <num> <num>
#> 1: setosa 1.4 0.1 4.8 3 0 0 0
#> txt.o txt.z txt.c txt.m txt.s txt.k txt.p txt.l txt.b txt.t txt.f txt.y
#> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 0 0 0 0 0 0 0 0 0 1 0
#> txt.g txt.u txt.j txt.d txt.x txt.e txt.v txt.q txt.h
#> <num> <num> <num> <num> <num> <num> <num> <num> <num>
#> 1: 0 1 0 1 0 0 0 0 0