Alteryx Designer Desktop Discussions

briannet · ‎11-18-2021

Hello!

I have a use case where I want to extract information from PDF files. The PDFs are all in the same format, however, depending on how much information is included, the information I need to extract could be in slightly different locations throughout the PDF. I have access to Intelligence Suite. Does anyone have any suggestions on ways to accommodate these differences?

JoeHerbert · ‎11-18-2021

Hi @briannet could you upload a safe example to help answer in more detail. Without seeing my initial thoughts are to use the Computer Vision toolset and select all possible fields that could be entered and then filter out nulls/ data cleanse until your happy with the result.

Here's a good guide to start off with getting to grips with the Computer Vision tools: https://community.alteryx.com/t5/Data-Science/Unlocking-Insights-from-Images-using-Computer-Vision/b...7

Happy Solving,

Joe

briannet · ‎12-01-2021

Thank you for your reply! Unfortunately, I cannot upload a safe example. I will review the link you provided. Thank you again!

mceleavey · ‎12-01-2021

Hi @briannet ,

this could be done in different ways depending on your use case.

For example, if you are wanting to parse invoices (or purchase orders, forms etc) and receive these via pdf and they are in the same format you can use the following method which uses the Image Template tool from IS:

Once you have the image you can then drag a box around the section need and give it a name:

You can then use this to load in templates of this format.

Alternatively, you can load in the pdf and convert to text, then split the text out using the Text to Columns tool on the text field configured like this:

Which will split the rows of your data into separate rows in the data, as converting to text will put the text for each page of your pdf into a single cell.

This will give the following:

And then you are probably going to need regex to parse out the sections you need. In my example I need to pull the six number string from the text, which represents the ID, followed by the remaining digit, which represents the duration of a stay in hospital in days:

Which gives me the following:

And so on. Regex is going to be useful to parse out the bits you need.

I hope this helps,

M.

Alteryx Designer Desktop Discussions

Extract Information from PDF