koRpus: hyphen-methods – R documentation

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

hyphen-methods

Automatic hyphenation

Description

These methods implement word hyphenation, based on Liang's algorithm. For details, please refer to the documentation for the generic hyphen method in the sylly package.

Usage

## S4 method for signature 'kRp.text'
hyphen(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  corp.rm.class = "nonpunct",
  corp.rm.tag = c(),
  quiet = FALSE,
  cache = TRUE,
  as = "kRp.hyphen",
  as.feature = FALSE
)

## S4 method for signature 'kRp.text'
hyphen_df(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  quiet = FALSE,
  cache = TRUE
)

## S4 method for signature 'kRp.text'
hyphen_c(
  words,
  hyph.pattern = NULL,
  min.length = 4,
  rm.hyph = TRUE,
  quiet = FALSE,
  cache = TRUE
)

Arguments

`words`	Either an object of class `kRp.text`, or a character vector with words to be hyphenated.
`hyph.pattern`	Either an object of class `kRp.hyph.pat`, or a valid character string naming the language of the patterns to be used. See details.
`min.length`	Integer, number of letters a word must have for considering a hyphenation. `hyphen` will not split words after the first or before the last letter, so values smaller than 4 are not useful.
`rm.hyph`	Logical, whether appearing hyphens in words should be removed before pattern matching.
`corp.rm.class`	A character vector with word classes which should be ignored. The default value `"nonpunct"` has special meaning and will cause the result of `kRp.POS.tags(lang, tags=c("punct","sentc"), list.classes=TRUE)` to be used. Relevant only if `words` is a valid koRpus object.
`corp.rm.tag`	A character vector with POS tags which should be ignored. Relevant only if `words` is a valid koRpus object.
`quiet`	Logical. If `FALSE`, short status messages will be shown.
`cache`	Logical. `hyphen()` can cache results to speed up the process. If this option is set to `TRUE`, the current cache will be queried and new tokens also be added. Caches are language-specific and reside in an environment, i.e., they are cleaned at the end of a session. If you want to save these for later use, see the option `hyph.cache.file` in `set.kRp.env`.
`as`	A character string defining the class of the object to be returned. Defaults to `"kRp.hyphen"`, but can also be set to `"data.frame"` or `"numeric"`, returning only the central `data.frame` or the numeric vector of counted syllables, respectively. For the latter two options, you can alternatively use the shortcut methods `hyphen_df` or `hyphen_c`. Ignored if `as.feature=TRUE`.
`as.feature`	Logical, whether the output should be just the analysis results or the input object with the results added as a feature. Use `corpusHyphen` to get the results from such an aggregated object. If set to `TRUE`, `as="kRp.hyphen"` is automatically set, overwriting other setting of `as` with a warning.

Value

An object of class kRp.text, kRp.hyphen, data.frame or a numeric vector, depending on the values of the as and as.feature arguments.

References

Liang, F.M. (1983). Word Hy-phen-a-tion by Com-put-er. Dissertation, Stanford University, Dept. of Computer Science.

[1] http://tug.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/

[2] http://www.ctan.org/tex-archive/macros/latex/base/lppl.txt

Examples

# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  sample_file <- file.path(
    path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
  )
  # call hyphen on a given english word
  # "quiet=TRUE" suppresses the progress bar
  hyphen(
    "interference",
    hyph.pattern="en",
    quiet=TRUE
  )

  # call hyphen() on a tokenized text
  tokenized.obj <- tokenize(
    txt=sample_file,
    lang="en"
  )
  # language definition is defined in the object
  # if you call hyphen() without arguments,
  # you will get its results directly
  hyphen(tokenized.obj)

  # alternatively, you can also store those results as a
  # feature in the object itself
  tokenized.obj <- hyphen(
    tokenized.obj,
    as.feature=TRUE
  )
  # results are now part of the object
  hasFeature(tokenized.obj)
  corpusHyphen(tokenized.obj)
} else {}

koRpus

Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

v0.13-6

GPL (>= 3)

Authors

Meik Michalke [aut, cre], Earl Brown [ctb], Alberto Mirisola [ctb], Alexandre Brulet [ctb], Laura Hauser [ctb]

Initial release

2021-05-08