Alteryx Designer Desktop Discussions

oracleoftemple · ‎10-19-2018

I have imported a PDF file using an R script. I get one record for each page with all the text from that page. Using the cell viewer, I can see the fixed widths that I'd need to parse the text.

What is the best way to parse this text?

Claje · ‎10-19-2018

The following recommendation is going to be relatively conceptual, as I don't have a dataset to test this with readily.

First, I would use a RecordID tool to create a unique ID for each document.

Then, I would use the Text to Columns tool, configured to split to new rows with a \n delimiter. This will create a row for each line of your PDF.

From there, I would create a series of formulas to SUBSTRING() the text to the appropriate lengths. You could also use a single regular expression tool instead. I'll play with that concept and update this post with an example of that.

Then, you can look at the actual data itself to make sure you filter out individual pieces that you don't need (like the header lines here from budget name: through the ------ line)

jdunkerley79 · ‎10-19-2018

I would suggest something like:

1. Use a record ID to create a identifier for each page

2. Split the text into lines rows a Text To Column tool splitting on \n

3. Assuming a variable number of title rows you need to find first row which has just dashes and spaces (using a regex expression)

4. Use this to find the index of spaces and create a set of indices

5. Use an Append Field tool to add indices to each line

6. Finally you can use a formula tool to split to individual values

If you can post some sample data am happy to put together an example workflow

oracleoftemple · ‎10-19-2018

I would greatly appreciate that. Here's a few pages from the file. I used an R tool to get it loaded. Is this enough info to put together a workflow?

jdunkerley79 · ‎10-19-2018

@oracleoftemple don't see the attached data files.

2 samples would be plenty to build the workflow.

oracleoftemple · ‎10-19-2018

Sorry I'm hesitant to post it in this string. I tried to attach it to a DM to you, but I don't have that option. Do you have an email I can send it to?

danrh · ‎10-19-2018

This is almost the exact scenario I faced a few months back here. I ended up creating the attached macro to handle these situations --- it looks for the record with the dashes and creates a dynamic fixed width parser based on that line. You might need to tweak it a little to fit your situation, but it should get you most of the way there.

Hope it helps!

Alteryx Designer Desktop Discussions

Parsing text stored in a single record - fixed width