ACT NOW: The Alteryx team will be retiring support for Community account recovery and Community email-change requests Early 2026. Make sure to check your account preferences in my.alteryx.com to make sure you have filled out your security questions. Learn more here
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

How to read Cyrillic, Chinese, Japanese, Turkish symbols

Marcel_Gavrila
8 - Asteroid

Hello,

 

I am new on R and I have an OCR batch macro, using R, which read PDF's and convert them to tabular format.  My issue is reading Cyrillic, Chinese, Japanese, Turkish letters.

 

Could someone help me to amend the code in order to read all types of symbols correctly?

 

Is a solution to use unicode for reading?

 

cond.install <- function(package.name){
options(repos = "http://cran.rstudio.com") #set repo
#check for package in library, if package is missing install
if(package.name%in%rownames(installed.packages())==FALSE) {
install.packages(package.name)}else{require(package.name, character.only = TRUE)}}

cond.install("pdftools")
cond.install("tesseract")

file <- "C:\\Users\\PDF\\First file.pdf"
pngfile <- pdftools::pdf_convert(file,dpi = 200)
text <- tesseract::ocr(pngfile)
write.Alteryx(text, 1)
write.Alteryx(file,2)

 

 

Thank you in advance,

 

 

 

0 REPLIES 0
Labels
Top Solution Authors