Get types and tokens of a given text
These methods return character vectors that return all types or tokens of a given text,
where text can either be a character
vector itself, a previosly tokenized/tagged koRpus object,
or an object of class kRp.TTR
.
types(txt, ...) tokens(txt, ...) ## S4 method for signature 'kRp.TTR' types(txt, stats = FALSE) ## S4 method for signature 'kRp.TTR' tokens(txt) ## S4 method for signature 'kRp.text' types( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), stats = FALSE ) ## S4 method for signature 'kRp.text' tokens( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c() ) ## S4 method for signature 'character' types( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), stats = FALSE, lang = NULL ) ## S4 method for signature 'character' tokens( txt, case.sens = FALSE, lemmatize = FALSE, corp.rm.class = "nonpunct", corp.rm.tag = c(), lang = NULL )
txt |
An object of either class |
... |
Only used for the method generic. |
stats |
Logical, whether statistics on the length in characters and frequency of types in the text should also be returned. |
case.sens |
Logical, whether types should be counted case sensitive. This option is available for tagged text and character input only. |
lemmatize |
Logical, whether analysis should be carried out on the lemmatized tokens rather than all running word forms. This option is available for tagged text and character input only. |
corp.rm.class |
A character vector with word classes which should be dropped. The default value
|
corp.rm.tag |
A character vector with POS tags which should be dropped. This option is available for tagged text and character input only. |
lang |
Set the language of a text,
see the |
A character vector. Fortypes
and stats=TRUE
a data.frame containing all types,
their length (characters)
and frequency. The types
result is always sorted by frequency,
with more frequent types coming first.
If the input is of class kRp.TTR
,
the result will only be useful if lex.div
or
the respective wrapper function was called with keep.tokens=TRUE
. Similarily,
lemmatize
can only work
properly if the input is a tagged text object with lemmata or you've properly set up the enviroment via set.kRp.env
.
Calling these methods on kRp.TTR
objects is just returning the respective part of its tt
slot.
# code is only run when the english language package can be loaded if(require("koRpus.lang.en", quietly = TRUE)){ sample_file <- file.path( path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt" ) tokenized.obj <- tokenize( txt=sample_file, lang="en" ) types(tokenized.obj) tokens(tokenized.obj) } else {}
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.