Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Extracting, Cleansing, Normalizing, Parsing Unstructured PDF Data

Highlighted
5 - Atom

Hi, I'm an Alteryx beginner and need some serious help with a specific, very large dataset.  This project is the primary reason for the license purchase. Can someone help build a workflow that converts the attached job costing data in PDF to columnar (or is it tabular?) format for further analysis? The data is from the COINS ERP system.  Attached is a sample of the data for ONE project in ONE year for ONE company along with the output format I would like.  I need to build a template workflow that will allow me to convert this same type of data for thousands of projects spanning seven years for 40+ entities. PDFs are currently separated by year, by entity (so roughly 250-300 separate, large PDF files).  Once the data is properly converted I will need to apply various lookups and blend it with 2-3 other datasets for various financial/computational analyses and reporting.  I'm much more comfortable with these tasks, just need this core data in a workable format. 

 

From my research, it looks like I'll need to use another source such as DoctToText, R code, etc. which I have no experience with.  I will be spinning my wheels for days.  Please help. 

 

Thanks in advance to the brave soul who takes this one.  I'm at your disposable to get this solved!!!

 

Thanks,

Gisele

 

 

Highlighted
Alteryx Certified Partner
Alteryx Certified Partner

A colleague of mine has recently published a 'PDF Input' connector which as you stated, makes use of the R tool. 

 

You will then have to perform parsing (take a look at the regex and text to columns tools for this). My colleague also included a sample workflow in the documentation so it's worth looking at how he converted the PDF into a structured table.

 

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

 

Ben

Highlighted
5 - Atom

Is there any way to put the file on this thread? as my work computer says I can't get it from an 'unsecured site', but this appears to be exactly what I need. 

Highlighted
5 - Atom

Thank you @BenMoss.  This is very helpful.  Any idea where (or if) I can get one-on-one assistance with an Alteryx representative to help with my particular dataset? I've hit a few roadblocks.

 

Thanks,

Gisele

Highlighted
5 - Atom

@Christine1, see attached.

Highlighted
5 - Atom

I am unable to download the tool. Can someone help please? thanks

Highlighted
5 - Atom

PFA for the solution. The solution uses 2 different R packages viz. Tabulizer and PDF Tools.

When using the Tabulizer tool I have just extracted the data present in the tabular format and that too from one page. You can put a loop and read all the pages using the same logic.

 

When using the PDF tools i have filtered the data to read only first page. You can later add the logic to pick the data you want.

Highlighted
5 - Atom

Hello,

 

I tried running this workflow but the R tool gives me this error: 

R (2) Error in loadNamespace(name) : there is no package called 'tabulizer'

R (2) The R.exe exit code (1) indicated an error.

clipboard_image_0.png

 

Am I missing something?

Highlighted
5 - Atom

Hi,

 

You need to explicitly install the package using the R console first and then try running the workflow.

Labels