Text vectorization layer
This layer has basic options for managing text in a Keras model. It transforms a batch of strings (one sample = one string) into either a list of token indices (one sample = 1D tensor of integer token indices) or a dense representation (one sample = 1D tensor of float values representing data about the sample's tokens).
layer_text_vectorization( object, max_tokens = NULL, standardize = "lower_and_strip_punctuation", split = "whitespace", ngrams = NULL, output_mode = c("int", "binary", "count", "tf-idf"), output_sequence_length = NULL, pad_to_max_tokens = TRUE, ... )
object |
Model or layer object |
max_tokens |
The maximum size of the vocabulary for this layer. If |
standardize |
Optional specification for standardization to apply to the
input text. Values can be |
split |
Optional specification for splitting the input text. Values can be
|
ngrams |
Optional specification for ngrams to create from the possibly-split
input text. Values can be |
output_mode |
Optional specification for the output of the layer. Values can
be
|
output_sequence_length |
Only valid in "int" mode. If set, the output will have
its time dimension padded or truncated to exactly |
pad_to_max_tokens |
Only valid in "binary", "count", and "tfidf" modes. If |
... |
Not used. |
The processing of each sample contains the following steps:
standardize each sample (usually lowercasing + punctuation stripping)
split each sample into substrings (usually words)
recombine substrings into tokens (usually ngrams)
index tokens (associate a unique int value with each token)
transform each sample using this index, either into a vector of ints or a dense float vector.
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.