Hello!
I have a use case where I want to extract information from PDF files. The PDFs are all in the same format, however, depending on how much information is included, the information I need to extract could be in slightly different locations throughout the PDF. I have access to Intelligence Suite. Does anyone have any suggestions on ways to accommodate these differences?
Hi @briannet could you upload a safe example to help answer in more detail. Without seeing my initial thoughts are to use the Computer Vision toolset and select all possible fields that could be entered and then filter out nulls/ data cleanse until your happy with the result.
Here's a good guide to start off with getting to grips with the Computer Vision tools: https://community.alteryx.com/t5/Data-Science/Unlocking-Insights-from-Images-using-Computer-Vision/b...7
Happy Solving,
Joe
Thank you for your reply! Unfortunately, I cannot upload a safe example. I will review the link you provided. Thank you again!
Hi @briannet ,
this could be done in different ways depending on your use case.
For example, if you are wanting to parse invoices (or purchase orders, forms etc) and receive these via pdf and they are in the same format you can use the following method which uses the Image Template tool from IS:
Once you have the image you can then drag a box around the section need and give it a name:
You can then use this to load in templates of this format.
Alternatively, you can load in the pdf and convert to text, then split the text out using the Text to Columns tool on the text field configured like this:
Which will split the rows of your data into separate rows in the data, as converting to text will put the text for each page of your pdf into a single cell.
This will give the following:
And then you are probably going to need regex to parse out the sections you need. In my example I need to pull the six number string from the text, which represents the ID, followed by the remaining digit, which represents the duration of a stay in hospital in days:
Which gives me the following:
And so on. Regex is going to be useful to parse out the bits you need.
I hope this helps,
M.
@mceleavey I am having invoices from different Hotels so the format is not same but i want relevant information out of those like invoice id, Amount, Invoice date , etc how should i automate the process of extracting these relevant information from different pdf where structure of the pdf are not same. (For-eg in some pdf Invoice date is mentioned as Date and in some pdf it is Invoice Date) so Regex is not helping.
One way of PDF parsing is to utilize Spatial analysis. By converting the text boxes to spatial objects, you may be able to parse data: for example, Find the Nearest of the target texts(Date, Invoice Date)
It requires a bit advanced skill but can be helpful.
For your reference, this is my blog describing about how to create spatial objects of PDF (sorry, it's Japanese but you can google translate it)