Import already tagged texts
This method can be used on text files or matrices containing already tagged text material, e.g. the results of TreeTagger[1].
readTagged(file, ...) ## S4 method for signature 'matrix' readTagged( file, lang = "kRp.env", tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env", mtx_cols = c(token = "token", tag = "tag", lemma = "lemma") ) ## S4 method for signature 'data.frame' readTagged( file, lang = "kRp.env", tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env", mtx_cols = c(token = "token", tag = "tag", lemma = "lemma") ) ## S4 method for signature 'kRp.connection' readTagged( file, lang = "kRp.env", encoding = "unknown", tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env" ) ## S4 method for signature 'character' readTagged( file, lang = "kRp.env", encoding = getOption("encoding"), tagger = "TreeTagger", apply.sentc.end = TRUE, sentc.end = c(".", "!", "?", ";", ":"), stopwords = NULL, stemmer = NULL, rm.sgml = TRUE, doc_id = NA, add.desc = "kRp.env" )
file |
Either a matrix, a connection or a character vector. If the latter, that must be a valid path to a file, containing the previously analyzed text. If it is a matrix, it must contain three columns named "token", "tag", and "lemma", and except for these three columns all others are ignored. |
... |
Additional options, currently unused. |
lang |
A character string naming the language of the analyzed corpus. See |
tagger |
The software which was used to tokenize and tag the text. Currently,
"TreeTagger" and "manual" are the only
supported values. If "manual",
you must also adjust the values of |
apply.sentc.end |
Logical,
whethter the tokens defined in |
sentc.end |
A character vector with tokens indicating a sentence ending. This adds to given results, it doesn't replace them. |
stopwords |
A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set
|
stemmer |
A function or method to perform stemming. For instance,
you can set |
rm.sgml |
Logical, whether SGML tags should be ignored and removed from output. |
doc_id |
Character string,
optional identifier of the particular document. Will be added to the |
add.desc |
Logical. If |
mtx_cols |
Character vector with exactly three elements named "token", "tag",
and "lemma",
the values of which must match the respective column names of the matrix provided via |
encoding |
A character string defining the character encoding of the input file,
like |
Note that the value of lang
must match a valid language supported by kRp.POS.tags
.
It will also get stored in the resulting object and might be used by other functions at a later point.
An object of class kRp.text
. If debug=TRUE
,
prints internal variable settings and
attempts to return the original output if the TreeTagger system call in a matrix.
Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44–49.
## Not run: # call method on a connection text_con <- file("~/my.data/tagged_speech.txt", "r") tagged_results <- readTagged(text_con, lang="en") close(text_con) # call it on the file directly tagged_results <- readTagged("~/my.data/tagged_speech.txt", lang="en") # import the results of RDRPOSTagger, using the "manual" tagger feature sample_text <- c("Dies ist ein kurzes Beispiel. Es ergibt wenig Sinn.") tagger <- RDRPOSTagger::rdr_model(language="German", annotation="POS") tagged_rdr <- RDRPOSTagger::rdr_pos(tagger, x=sample_text) tagged_results <- readTagged( tagged_rdr, lang="de", tagger="manual", mtx_cols=c(token="token", tag="pos", lemma=NA) ) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.