Hello Community,
I have been searching to see if there is a use case already developed for using the tesseractOCR_pdf package to extract data from a scanned PDF within Alteryx Designer so I can downstream cleanse the data with the build-in tools within Alteryx Designer. Has anyone had any luck in using the R Tool and loading the packages to work with a scanned PDF?
Thanks,
Neil
I may have answered my own question. After installing the tesseract package, I placed the following code in the R Tool:
# read in the PDF file location which must
# be in a field called FullPath
File <- read.Alteryx("#1", mode="data.frame")
# Use pdf_text() function to return a character vector
# containing the text for each page of the PDF
Data <- tesseract::ocr(file.path(File$FullPath))
# convert the character vector to a data frame
df_Data <- data.frame(Data)
# output the data frame in steam 1
write.Alteryx(df_Data, 1)