Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

stri_enc_detect2

[DEPRECATED] Detect Locale-Sensitive Character Encoding


Description

This function tries to detect character encoding in case the language of text is known.

Usage

stri_enc_detect2(str, locale = NULL)

Arguments

str

character vector, a raw vector, or a list of raw vectors

locale

NULL or '' for default locale, NA for just checking the UTF-* family, or a single string with locale identifier.

Details

Vectorized over str.

First, the text is checked whether it is valid UTF-32BE, UTF-32LE, UTF-16BE, UTF-16LE, UTF-8 (as in stri_enc_detect, this is roughly inspired by ICU's i18n/csrucode.cpp) or ASCII.

If locale is not NA and the above fails, the text is checked for the number of occurrences of language-specific code points (data provided by the ICU library) converted to all possible 8-bit encodings that fully cover the indicated language. The encoding is selected based on the greatest number of total byte hits.

The guess is of course imprecise, as it is obtained using statistics and heuristics. Because of this, detection works best if you supply at least a few hundred bytes of character data that is in a single language.

If you have no initial guess on the language and encoding, try with stri_enc_detect (uses ICU facilities).

Value

Just like stri_enc_detect, this function returns a list of length equal to the length of str. Each list element is a data frame with the following three named components:

  • Encoding – string; guessed encodings; NA on failure (iff encodings is empty),

  • Language – always NA,

  • Confidence – numeric in [0,1]; the higher the value, the more confidence there is in the match; NA on failure.

The guesses are ordered by decreasing confidence.

See Also


stringi

Character String Processing Facilities

v1.6.1
file LICENSE
Authors
Marek Gagolewski [aut, cre, cph] (<https://orcid.org/0000-0003-0637-6028>), Bartek Tartanus [ctb], and others (stringi source code); IBM, Unicode, Inc. and others (ICU4C source code, Unicode Character Database)
Initial release
2021-05-05

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.