community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE

Extracting, Cleansing, Normalizing, Parsing Unstructured PDF Data

Atom

Hi, I'm an Alteryx beginner and need some serious help with a specific, very large dataset.  This project is the primary reason for the license purchase. Can someone help build a workflow that converts the attached job costing data in PDF to columnar (or is it tabular?) format for further analysis? The data is from the COINS ERP system.  Attached is a sample of the data for ONE project in ONE year for ONE company along with the output format I would like.  I need to build a template workflow that will allow me to convert this same type of data for thousands of projects spanning seven years for 40+ entities. PDFs are currently separated by year, by entity (so roughly 250-300 separate, large PDF files).  Once the data is properly converted I will need to apply various lookups and blend it with 2-3 other datasets for various financial/computational analyses and reporting.  I'm much more comfortable with these tasks, just need this core data in a workable format. 

 

From my research, it looks like I'll need to use another source such as DoctToText, R code, etc. which I have no experience with.  I will be spinning my wheels for days.  Please help. 

 

Thanks in advance to the brave soul who takes this one.  I'm at your disposable to get this solved!!!

 

Thanks,

Gisele

 

 

Alteryx Certified Partner
Alteryx Certified Partner

A colleague of mine has recently published a 'PDF Input' connector which as you stated, makes use of the R tool. 

 

You will then have to perform parsing (take a look at the regex and text to columns tools for this). My colleague also included a sample workflow in the documentation so it's worth looking at how he converted the PDF into a structured table.

 

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

 

Ben

Is there any way to put the file on this thread? as my work computer says I can't get it from an 'unsecured site', but this appears to be exactly what I need. 

Atom

Thank you @BenMoss.  This is very helpful.  Any idea where (or if) I can get one-on-one assistance with an Alteryx representative to help with my particular dataset? I've hit a few roadblocks.

 

Thanks,

Gisele

Atom

@Christine1, see attached.

Atom

I am unable to download the tool. Can someone help please? thanks

PFA for the solution. The solution uses 2 different R packages viz. Tabulizer and PDF Tools.

When using the Tabulizer tool I have just extracted the data present in the tabular format and that too from one page. You can put a loop and read all the pages using the same logic.

 

When using the PDF tools i have filtered the data to read only first page. You can later add the logic to pick the data you want.

Labels