Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Using tesseract package in R with Alteryx Designer

NeilFisk
9 - Comet

Hello Community,

 

I have been searching to see if there is a use case already developed for using the tesseractOCR_pdf package to extract data from a scanned PDF within Alteryx Designer so I can downstream cleanse the data with the build-in tools within Alteryx Designer.  Has anyone had any luck in using the R Tool and loading the packages to work with a scanned PDF?

 

Thanks,
Neil

1 REPLY 1
NeilFisk
9 - Comet

I may have answered my own question.  After installing the tesseract package, I placed the following code in the R Tool:

 

# read in the PDF file location which must
# be in a field called FullPath
File <- read.Alteryx("#1", mode="data.frame")

# Use pdf_text() function to return a character vector
# containing the text for each page of the PDF
Data <- tesseract::ocr(file.path(File$FullPath))

# convert the character vector to a data frame
df_Data <- data.frame(Data)

# output the data frame in steam 1
write.Alteryx(df_Data, 1)

Labels