Guess language a text is written in
This function tries to guess the language a text is written in.
guess.lang( txt.file, udhr.path, comp.length = 300, keep.udhr = FALSE, quiet = TRUE, in.mem = TRUE, format = "file" )
txt.file |
A character vector pointing to the file with the text to be analyzed. |
udhr.path |
A character string, either pointing to the directory where you unzipped the translations of the Universal Declaration of Human Rights, or to the ZIP file containing them. |
comp.length |
Numeric value,
giving the number of characters to be used of |
keep.udhr |
Logical, whether all the UDHR translations should be kept in the resulting object. |
quiet |
Logical. If |
in.mem |
Logical. If |
format |
Either "file" or "obj". If the latter,
|
To accomplish the task, the method described by Benedetto, Caglioti & Loreto (2002) is used, utilizing both gzip compression and tranlations of the Universal Declaration of Human Rights[1]. The latter holds the world record for being translated into the most different languages, and is publicly available.
An object of class kRp.lang
.
For this implementation the documents provided by the "UDHR in Unicode" project[2] have been used.
Their translations are not part of this package and must be downloaded seperately to use guess.lang
!
You need the ZIP archive containing all the plain text files from https://unicode.org/udhr/downloads.html.
Benedetto, D., Caglioti, E. & Loreto, V. (2002). Language trees and zipping. Physical Review Letters, 88(4), 048702.
## Not run: # using the still zipped bulk file guess.lang( file.path("~","data","some.txt"), udhr.path=file.path("~","data","udhr_txt.zip") ) # using the unzipped UDHR archive guess.lang( file.path("~","data","some.txt"), udhr.path=file.path("~","data","udhr_txt") ) ## End(Not run)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.