Hi,
I use the PDF Parser at http://silvercoders.com/en/products/doctotext/ to convert PDF files to text files. Most of the time it works really well.
However, it doesn't do such a good job on one particular file that I receive every month. It contains 4 pages with 6 tables on each page that each use values from the same fields for rows/columns & amounts.
I think (not 100% sure) the file is generated by Cognos/TM1. The doctotext converter works as per normal, but it is impossible to use normal Alteryx tools (I mostly use REG_EX / Multi-Row formula / filter) to extract the data within. The rows/columns labels & amounts are spread all over the place and there are no repeated patterns to work with.
I can export the PDF to XLSX using the converter within Adobe Reader (I have paid to have Adobe Export PDF), but I am unaware of how to make that happen within Alteryx and I am trying to avoid the manual process step of doing something outside Alteryx
I have asked many times for the file to be sent as XLSX or CSV and eventually gave up
Do you have any ideas?
Solved! Go to Solution.
Hey @mb1824
I used this great guide the other day by.... https://oliverpower.wordpress.com/2018/02/08/parsing-pdfs-using-alteryx-and-a-little-r/
It worked perfectly
Neil
Thanks, I will try that out
Hi,
Did you find any solution for this? I am new to Alteryx and having same difficulty.
I haven't got back to this to try the suggestion from @LordNeilLord.
@mb1824 There are some tools on the gallery that you can use to parse pdfs.
https://gallery.alteryx.com/#!app/PDF-Input--Text-and-Image-/5be5ec8d0462d71ffce6deaa
https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b
Give them a try 🙂