Get Lexicon Size.
Get the total number of unique tokens/ids of a positional attribute. Note
that token ids are zero-based, i.e. when iterating through tokens, start at
0, the maximum will be cl_lexicon_size()
minus 1.
cl_lexicon_size(corpus, p_attribute, registry = Sys.getenv("CORPUS_REGISTRY"))
corpus |
name of a CWB corpus (upper case) |
p_attribute |
name of positional attribute |
registry |
path to the registry directory, defaults to the value of the environment variable CORPUS_REGISTRY |
registry <- if (!check_pkg_registry_files()) use_tmp_registry() else get_pkg_registry() Sys.setenv(CORPUS_REGISTRY = registry) lexicon_size <- cl_lexicon_size("REUTERS", p_attribute = "word") token_ids <- seq.int(from = 0, to = lexicon_size - 1) cl_id2str("REUTERS", p_attribute = "word", id = token_ids)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.