community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE
SOLVED

PDF Parsing help

Meteor

Hello -

I am pulling in PDF information using the PDF input tool located in the Alteryx Gallery. The PDF I am importing has a bit of a unique structure where one record is broken into four stacked lines. The PDF also has repeating headers, and various summary totals throughout the file. I've attached a snippet of the information below. Would anyone be able to help in effectively parsing this detail? I was thinking if I could use the "text to columns" tool using the dashes delimiting the columns in the PDFreport, remove the repeating column headers, use the sample tool to sample every 4th line to "unstack" the information into different streams, and then join everything back together I would have usable Excel data.

 

Thanks!

 

 

Alteryx Screenshot.PNG

Quasar

Hi @jannis005

 

You won't be able to use the dashes to parse the data with a text to columns tool, as it will only do it on rows where you have the dashes and ignore the rest.

 

I'd do something like this:

1. Filter out all the unnecessary headers and make the headers I want a separate stream.

2. Use multi-row formula to create a group identifier for each set of 4 rows. First make one that keeps counting from 1 to 4 and then group these, resetting the grouping each time the count is 1.

3. Now I'd cleanse the data to remove duplicate white spaces.

4. Then do a text to columns on \s. This might give you issues with your first column though. Perhaps you'll find it easier to pull out data with regex.

5. Union back with headers. 

 

There are many many ways to solve this. You'll need a bit of trial and error to get it right.

 

Good luck!

Kat

 

Labels