Text tokenization utility
Vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
text_tokenizer( num_words = NULL, filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n", lower = TRUE, split = " ", char_level = FALSE, oov_token = NULL )
num_words |
the maximum number of words to keep, based on word
frequency. Only the most common |
filters |
a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character. |
lower |
boolean. Whether to convert the texts to lowercase. |
split |
character or string to use for token splitting. |
char_level |
if |
oov_token |
|
By default, all punctuation is removed, turning the texts into
space-separated sequences of words (words maybe include the ' character).
These sequences are then split into lists of tokens. They will then be
indexed or vectorized. 0
is a reserved index that won't be assigned to any
word.
The tokenizer object has the following attributes:
word_counts
— named list mapping words to the number of times they appeared
on during fit. Only set after fit_text_tokenizer()
is called on the tokenizer.
word_docs
— named list mapping words to the number of documents/texts they
appeared on during fit. Only set after fit_text_tokenizer()
is called on the tokenizer.
word_index
— named list mapping words to their rank/index (int). Only set
after fit_text_tokenizer()
is called on the tokenizer.
document_count
— int. Number of documents (texts/sequences) the tokenizer
was trained on. Only set after fit_text_tokenizer()
is called on the tokenizer.
Other text tokenization:
fit_text_tokenizer()
,
save_text_tokenizer()
,
sequences_to_matrix()
,
texts_to_matrix()
,
texts_to_sequences_generator()
,
texts_to_sequences()
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.