koRpus: guess.lang – R documentation

Pricing

Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!

Get Started for Free

Documentation

koRpus

guess.lang

Guess language a text is written in

Description

This function tries to guess the language a text is written in.

Usage

guess.lang(
  txt.file,
  udhr.path,
  comp.length = 300,
  keep.udhr = FALSE,
  quiet = TRUE,
  in.mem = TRUE,
  format = "file"
)

Arguments

`txt.file`	A character vector pointing to the file with the text to be analyzed.
`udhr.path`	A character string, either pointing to the directory where you unzipped the translations of the Universal Declaration of Human Rights, or to the ZIP file containing them.
`comp.length`	Numeric value, giving the number of characters to be used of `txt` to estimate the language.
`keep.udhr`	Logical, whether all the UDHR translations should be kept in the resulting object.
`quiet`	Logical. If `FALSE`, short status messages will be shown.
`in.mem`	Logical. If `TRUE`, the gzip compression will remain in memory (using `memCompress`), which is probably the faster method. Otherwise temporary files are created and automatically removed on exit.
`format`	Either "file" or "obj". If the latter, `txt.file` is not interpreted as a file path but the text to analyze itself.

Details

To accomplish the task, the method described by Benedetto, Caglioti & Loreto (2002) is used, utilizing both gzip compression and tranlations of the Universal Declaration of Human Rights[1]. The latter holds the world record for being translated into the most different languages, and is publicly available.

Value

An object of class kRp.lang.

Note

For this implementation the documents provided by the "UDHR in Unicode" project[2] have been used. Their translations are not part of this package and must be downloaded seperately to use guess.lang! You need the ZIP archive containing all the plain text files from https://unicode.org/udhr/downloads.html.

References

Benedetto, D., Caglioti, E. & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.

[1] https://www.ohchr.org/EN/UDHR/Pages/UDHRIndex.aspx

[2] https://unicode.org/udhr/

Examples

## Not run: 
  # using the still zipped bulk file
  guess.lang(
    file.path("~","data","some.txt"),
    udhr.path=file.path("~","data","udhr_txt.zip")
  )
  # using the unzipped UDHR archive
  guess.lang(
    file.path("~","data","some.txt"),
    udhr.path=file.path("~","data","udhr_txt")
  )

## End(Not run)

koRpus

Text Analysis with Emphasis on POS Tagging, Readability, and Lexical Diversity

v0.13-6

GPL (>= 3)

Authors

Meik Michalke [aut, cre], Earl Brown [ctb], Alberto Mirisola [ctb], Alexandre Brulet [ctb], Laura Hauser [ctb]

Initial release

2021-05-08

guess.lang

Description

Usage

Arguments

Details

Value

Note

References

Examples

koRpus

We don't support your browser anymore