koRpus: readTagged-methods – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

readTagged-methods

Import already tagged texts

Description

This method can be used on text files or matrices containing already tagged text material, e.g. the results of TreeTagger[1].

Usage

readTagged(file, ...)

## S4 method for signature 'matrix'
readTagged(
  file,
  lang = "kRp.env",
  tagger = "TreeTagger",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  stopwords = NULL,
  stemmer = NULL,
  rm.sgml = TRUE,
  doc_id = NA,
  add.desc = "kRp.env",
  mtx_cols = c(token = "token", tag = "tag", lemma = "lemma")
)

## S4 method for signature 'data.frame'
readTagged(
  file,
  lang = "kRp.env",
  tagger = "TreeTagger",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  stopwords = NULL,
  stemmer = NULL,
  rm.sgml = TRUE,
  doc_id = NA,
  add.desc = "kRp.env",
  mtx_cols = c(token = "token", tag = "tag", lemma = "lemma")
)

## S4 method for signature 'kRp.connection'
readTagged(
  file,
  lang = "kRp.env",
  encoding = "unknown",
  tagger = "TreeTagger",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  stopwords = NULL,
  stemmer = NULL,
  rm.sgml = TRUE,
  doc_id = NA,
  add.desc = "kRp.env"
)

## S4 method for signature 'character'
readTagged(
  file,
  lang = "kRp.env",
  encoding = getOption("encoding"),
  tagger = "TreeTagger",
  apply.sentc.end = TRUE,
  sentc.end = c(".", "!", "?", ";", ":"),
  stopwords = NULL,
  stemmer = NULL,
  rm.sgml = TRUE,
  doc_id = NA,
  add.desc = "kRp.env"
)

Arguments

`file`	Either a matrix, a connection or a character vector. If the latter, that must be a valid path to a file, containing the previously analyzed text. If it is a matrix, it must contain three columns named "token", "tag", and "lemma", and except for these three columns all others are ignored.
`...`	Additional options, currently unused.
`lang`	A character string naming the language of the analyzed corpus. See `kRp.POS.tags` for all supported languages. If set to `"kRp.env"` this is got from `get.kRp.env`.
`tagger`	The software which was used to tokenize and tag the text. Currently, "TreeTagger" and "manual" are the only supported values. If "manual", you must also adjust the values of `mtx_cols` to define the columns to be imported.
`apply.sentc.end`	Logical, whethter the tokens defined in `sentc.end` should be searched and set to a sentence ending tag. You could call this a compatibility mode to make sure you get the results you would get if you called `treetag` on the original file. If set to `FALSE`, the tags will be imported as they are.
`sentc.end`	A character vector with tokens indicating a sentence ending. This adds to given results, it doesn't replace them.
`stopwords`	A character vector to be used for stopword detection. Comparison is done in lower case. You can also simply set `stopwords=tm::stopwords("en")` to use the english stopwords provided by the `tm` package.
`stemmer`	A function or method to perform stemming. For instance, you can set `stemmer=Snowball::SnowballStemmer` if you have the `Snowball` package installed (or `SnowballC::wordStem`). As of now, you cannot provide further arguments to this function.
`rm.sgml`	Logical, whether SGML tags should be ignored and removed from output.
`doc_id`	Character string, optional identifier of the particular document. Will be added to the `desc` slot.
`add.desc`	Logical. If `TRUE`, the tag description (column `"desc"` of the data.frame) will be added directly to the resulting object. If set to `"kRp.env"` this is fetched from `get.kRp.env`. Only needed if `tag=TRUE`.
`mtx_cols`	Character vector with exactly three elements named "token", "tag", and "lemma", the values of which must match the respective column names of the matrix provided via `file`. It is possible to set `lemma=NA` if the tagged results only provide token and tag. This argument is ignored unless `tagger="manual"` and data is provided as either a matrix or data frame.
`encoding`	A character string defining the character encoding of the input file, like `"Latin1"` or `"UTF-8"`.

Details

Note that the value of lang must match a valid language supported by kRp.POS.tags. It will also get stored in the resulting object and might be used by other functions at a later point.

Value

An object of class kRp.text. If debug=TRUE, prints internal variable settings and attempts to return the original output if the TreeTagger system call in a matrix.

References

Schmid, H. (1994). Probabilistic part-of-speec tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 44–49.

[1] https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Examples

## Not run: 
  # call method on a connection
  text_con <- file("~/my.data/tagged_speech.txt", "r")
  tagged_results <- readTagged(text_con, lang="en")
  close(text_con)

  # call it on the file directly
  tagged_results <- readTagged("~/my.data/tagged_speech.txt", lang="en")
  
  # import the results of RDRPOSTagger, using the "manual" tagger feature
  sample_text <- c("Dies ist ein kurzes Beispiel. Es ergibt wenig Sinn.")
  tagger <- RDRPOSTagger::rdr_model(language="German", annotation="POS")
  tagged_rdr <- RDRPOSTagger::rdr_pos(tagger, x=sample_text)
  tagged_results <- readTagged(
    tagged_rdr,
    lang="de",
    tagger="manual",
    mtx_cols=c(token="token", tag="pos", lemma=NA)
  )

## End(Not run)

koRpus

Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

v0.13-6

GPL (>= 3)

Authors

Meik Michalke [aut, cre], Earl Brown [ctb], Alberto Mirisola [ctb], Alexandre Brulet [ctb], Laura Hauser [ctb]

Initial release

2021-05-08