I'm working on a project to analyze budgets from a city, the records come from a public records request, but they sen them in basically PDF printouts of their system, I've found a useful thread here for converting PDF to plain text, and I end up with horrible formatted text, some lines like this:
001 311000 AD VALOREM TAXES .00 .00 .00 -12,320,998.80 -12,097,422.00
I need to break this into 9 columns. The last 5 columns are the numbers, of course, some lines are missing data, so they end up with 4 lines maybe.
The problem I'm running into is trying to break the columns, I figure, I could pretty much use a "space" delimiter and get pretty darn close, except for that 3rd column of text, in the case above AD VALOREM TAXES
So, I'm trying to see if I can maybe use REGEX to wrap all words in quotes, so I'd have "AD VALOREM TAXES" as a result, but I'm not hitting on the expression that does this, I can isolate characters, but I'm missing how to catch all the words and spaces between them.
Of course, any other ideas on how I can best parse this data would be great. This is more of a personal project, just getting involved in the local city government :)
I'm attaching the txt version of the PDF as well if it helps.
Solved! Go to Solution.
I think I found the regex I needed, wanted to share with others, this seems to match complete strings of words including special characters like - and & that are found in these strings. I did restrict to all uppercase as that seems to be fine here.
(\b[A-Z]+(.)+[A-\Z]\b)
Then using the replace of
"$1"
I think I'm getting closer. Time to call it a night though.
Looks good - only thing I see missing is lines like this:
001 335122 8TH "CENT MOTOR FUEL USE TAX" .00 .00 .00 -570,924.76 .00
To capture the '8th', you could try:
(\b\d*[A-Z]+(.)+\d*[A-\Z]\b)
That is a great catch Sophia! Thank you. I noticed a couple instances in the reports I've been trying to clean up still, and was not looking forward to going back to figure that out!
Super helpful, thanks so much!