Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

ocr

Tesseract OCR


Description

Extract text from an image. Requires that you have training data for the language you are reading. Works best for images with high contrast, little noise and horizontal text. See tesseract wiki and our package vignette for image preprocessing tips.

Usage

ocr(image, engine = tesseract("eng"), HOCR = FALSE)

ocr_data(image, engine = tesseract("eng"))

Arguments

image

file path, url, or raw vector to image (png, tiff, jpeg, etc)

engine

a tesseract engine created with tesseract(). Alternatively a language string which will be passed to tesseract().

HOCR

if TRUE return results as HOCR xml instead of plain text

Details

The ocr() function returns plain text by default, or hOCR text if hOCR is set to TRUE. The ocr_data() function returns a data frame with a confidence rate and bounding box for each word in the text.

References

See Also

Other tesseract: tesseract_download(), tesseract()

Examples

# Simple example
text <- ocr("https://jeroen.github.io/images/testocr.png")
cat(text)

xml <- ocr("https://jeroen.github.io/images/testocr.png", HOCR = TRUE)
cat(xml)

df <- ocr_data("https://jeroen.github.io/images/testocr.png")
print(df)


# Full roundtrip test: render PDF to image and OCR it back to text
curl::curl_download("https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf", "R-intro.pdf")
orig <- pdftools::pdf_text("R-intro.pdf")[1]

# Render pdf to png image
img_file <- pdftools::pdf_convert("R-intro.pdf", format = 'tiff', pages = 1, dpi = 400)
unlink("R-intro.pdf")

# Extract text from png image
text <- ocr(img_file)
unlink(img_file)
cat(text)


engine <- tesseract(options = list(tessedit_char_whitelist = "0123456789"))

tesseract

Open Source OCR Engine

v4.1.1
Apache License 2.0
Authors
Jeroen Ooms [aut, cre] (<https://orcid.org/0000-0002-4035-0289>)
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.