Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extract Information from PDF

briannet
6 - Meteoroid

Hello!

 

I have a use case where I want to extract information from PDF files. The PDFs are all in the same format, however, depending on how much information is included, the information I need to extract could be in slightly different locations throughout the PDF. I have access to Intelligence Suite. Does anyone have any suggestions on ways to accommodate these differences?

5 REPLIES 5
JoeHerbert
Alteryx Alumni (Retired)

Hi @briannet could you upload a safe example to help answer in more detail. Without seeing my initial thoughts are to use the Computer Vision toolset and select all possible fields that could be entered and then filter out nulls/ data cleanse until your happy with the result. 

Here's a good guide to start off with getting to grips with the Computer Vision tools: https://community.alteryx.com/t5/Data-Science/Unlocking-Insights-from-Images-using-Computer-Vision/b...7

Happy Solving, 

 

Joe 

briannet
6 - Meteoroid

Thank you for your reply! Unfortunately, I cannot upload a safe example. I will review the link you provided. Thank you again!

mceleavey
17 - Castor
17 - Castor

Hi @briannet ,

 

this could be done in different ways depending on your use case.

For example, if you are wanting to parse invoices (or purchase orders, forms etc) and receive these via pdf and they are in the same format you can use the following method which uses the Image Template tool from IS:

 

mceleavey_0-1638385448147.png

Once you have the image you can then drag a box around the section need and give it a name:

 

mceleavey_1-1638385513324.png

 

mceleavey_2-1638385563242.png

 

You can then use this to load in templates of this format.

 

Alternatively, you can load in the pdf and convert to text, then split the text out using the Text to Columns tool on the text field configured like this:

 

mceleavey_3-1638385873373.png

 

Which will split the rows of your data into separate rows in the data, as converting to text will put the text for each page of your pdf into a single cell.

This will give the following:

mceleavey_4-1638385929766.png

 

And then you are probably going to need regex to parse out the sections you need. In my example I need to pull the six number string from the text, which represents the ID, followed by the remaining digit, which represents the duration of a stay in hospital in days:

 

mceleavey_5-1638386016843.png

 

Which gives me the following:

mceleavey_6-1638386047527.png

 

And so on. Regex is going to be useful to parse out the bits you need.

 

I hope this helps,

 

M.

 

 

 

 

 

 



Bulien

Anasalter
7 - Meteor

@mceleavey  I am having invoices from different Hotels so the format is not same but i want relevant information out of those like invoice id, Amount, Invoice date , etc  how should i automate the process of extracting these relevant information from different pdf where structure of the pdf are not same. (For-eg in some pdf Invoice date is mentioned as Date and in some pdf it is Invoice Date)  so Regex is not helping.

gawa
16 - Nebula
16 - Nebula

One way of PDF parsing is to utilize Spatial analysis. By converting the text boxes to spatial objects, you may be able to parse data: for example, Find the Nearest of the target texts(Date, Invoice Date)

It requires a bit advanced skill but can be helpful.

 For your reference, this is my blog describing about how to create spatial objects of PDF (sorry, it's Japanese but you can google translate it)

https://community.alteryx.com/t5/%E3%83%96%E3%83%AD%E3%82%B0/Python%E3%83%84%E3%83%AB%E3%81%A8%E7%A9...

 

Labels