Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

text_tokenizer

Text tokenization utility


Description

Vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

Usage

text_tokenizer(
  num_words = NULL,
  filters = "!\"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n",
  lower = TRUE,
  split = " ",
  char_level = FALSE,
  oov_token = NULL
)

Arguments

num_words

the maximum number of words to keep, based on word frequency. Only the most common num_words words will be kept.

filters

a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ' character.

lower

boolean. Whether to convert the texts to lowercase.

split

character or string to use for token splitting.

char_level

if TRUE, every character will be treated as a token

oov_token

NULL or string If given, it will be added to 'word_index“ and used to replace out-of-vocabulary words during text_to_sequence calls.

Details

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized. 0 is a reserved index that won't be assigned to any word.

Attributes

The tokenizer object has the following attributes:

  • word_counts — named list mapping words to the number of times they appeared on during fit. Only set after fit_text_tokenizer() is called on the tokenizer.

  • word_docs — named list mapping words to the number of documents/texts they appeared on during fit. Only set after fit_text_tokenizer() is called on the tokenizer.

  • word_index — named list mapping words to their rank/index (int). Only set after fit_text_tokenizer() is called on the tokenizer.

  • document_count — int. Number of documents (texts/sequences) the tokenizer was trained on. Only set after fit_text_tokenizer() is called on the tokenizer.

See Also


keras

R Interface to 'Keras'

v2.4.0
MIT + file LICENSE
Authors
Daniel Falbel [ctb, cph, cre], JJ Allaire [aut, cph], François Chollet [aut, cph], RStudio [ctb, cph, fnd], Google [ctb, cph, fnd], Yuan Tang [ctb, cph] (<https://orcid.org/0000-0001-5243-233X>), Wouter Van Der Bijl [ctb, cph], Martin Studer [ctb, cph], Sigrid Keydana [ctb]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.