Alteryx Designer Desktop Discussions

briannet · ‎11-18-2021

Hello!

I have a use case where I want to extract information from PDF files. The PDFs are all in the same format, however, depending on how much information is included, the information I need to extract could be in slightly different locations throughout the PDF. I have access to Intelligence Suite. Does anyone have any suggestions on ways to accommodate these differences?

JoeHerbert · ‎11-18-2021

Hi @briannet could you upload a safe example to help answer in more detail. Without seeing my initial thoughts are to use the Computer Vision toolset and select all possible fields that could be entered and then filter out nulls/ data cleanse until your happy with the result.

Here's a good guide to start off with getting to grips with the Computer Vision tools: https://community.alteryx.com/t5/Data-Science/Unlocking-Insights-from-Images-using-Computer-Vision/b...7

Happy Solving,

Joe

briannet · ‎12-01-2021

Thank you for your reply! Unfortunately, I cannot upload a safe example. I will review the link you provided. Thank you again!

mceleavey · ‎12-01-2021

Hi @briannet ,

this could be done in different ways depending on your use case.

For example, if you are wanting to parse invoices (or purchase orders, forms etc) and receive these via pdf and they are in the same format you can use the following method which uses the Image Template tool from IS:

Once you have the image you can then drag a box around the section need and give it a name:

You can then use this to load in templates of this format.

Alternatively, you can load in the pdf and convert to text, then split the text out using the Text to Columns tool on the text field configured like this:

Which will split the rows of your data into separate rows in the data, as converting to text will put the text for each page of your pdf into a single cell.

This will give the following:

And then you are probably going to need regex to parse out the sections you need. In my example I need to pull the six number string from the text, which represents the ID, followed by the remaining digit, which represents the duration of a stay in hospital in days:

Which gives me the following:

And so on. Regex is going to be useful to parse out the bits you need.

I hope this helps,

M.

Anasalter · ‎10-11-2024

@mceleavey I am having invoices from different Hotels so the format is not same but i want relevant information out of those like invoice id, Amount, Invoice date , etc how should i automate the process of extracting these relevant information from different pdf where structure of the pdf are not same. (For-eg in some pdf Invoice date is mentioned as Date and in some pdf it is Invoice Date) so Regex is not helping.

gawa · ‎10-13-2024

One way of PDF parsing is to utilize Spatial analysis. By converting the text boxes to spatial objects, you may be able to parse data: for example, Find the Nearest of the target texts(Date, Invoice Date)

It requires a bit advanced skill but can be helpful.

For your reference, this is my blog describing about how to create spatial objects of PDF (sorry, it's Japanese but you can google translate it)

https://community.alteryx.com/t5/%E3%83%96%E3%83%AD%E3%82%B0/Python%E3%83%84%E3%83%AB%E3%81%A8%E7%A9...

Alteryx Designer Desktop Discussions

Extract Information from PDF

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...