Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing text stripped from PDF

serendipitytech
8 - Asteroid

I'm working on a project to analyze budgets from a city, the records come from a public records request, but they sen them in basically PDF printouts of their system, I've found a useful thread here for converting PDF to plain text, and I end up with horrible formatted text, some lines like this:

 

001 311000 AD VALOREM TAXES                            .00             .00             .00  -12,320,998.80  -12,097,422.00

 

I need to break this into 9 columns. The last 5 columns are the numbers, of course, some lines are missing data, so they end up with 4 lines maybe. 

 

The problem I'm running into is trying to break the columns, I figure, I could pretty much use a "space" delimiter and get pretty darn close, except for that 3rd column of text, in the case above  AD VALOREM TAXES

So, I'm trying to see if I can maybe use REGEX to wrap all words in quotes, so I'd have "AD VALOREM TAXES" as a result, but I'm not hitting on the expression that does this, I can isolate characters, but I'm missing how to catch all the words and spaces between them. 

 

Of course, any other ideas on how I can best parse this data would be great. This is more of a personal project, just getting involved in the local city government :)

 

I'm attaching the txt version of the PDF as well if it helps.

 

3 REPLIES 3
serendipitytech
8 - Asteroid

I think I found the regex I needed, wanted to share with others, this seems to match complete strings of words including special characters like - and & that are found in these strings. I did restrict to all uppercase as that seems to be fine here. 

 

(\b[A-Z]+(.)+[A-\Z]\b)

 

Then using the replace of 

"$1"

 

I think I'm getting closer. Time to call it a night though. 

SophiaF
Alteryx
Alteryx

Looks good - only thing I see missing is lines like this:

 

001 335122 8TH "CENT MOTOR FUEL USE TAX"                 .00             .00             .00     -570,924.76             .00

To capture the '8th', you could try:

 

(\b\d*[A-Z]+(.)+\d*[A-\Z]\b)
Sophia Fraticelli
Senior Solutions Architect
Alteryx, Inc.
serendipitytech
8 - Asteroid

That is a great catch Sophia! Thank you. I noticed a couple instances in the reports I've been trying to clean up still, and was not looking forward to going back to figure that out!

 

Super helpful, thanks so much!

Labels