Important Community update: The process for changing your account details was updated on June 25th. Learn how this impacts your Community experience and the actions we suggest you take to secure your account here.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Help with parsing out a pdf document

cgoodman3
14 - Magnetar
14 - Magnetar

Hey Community

 

It's Monday morning and my brain hasn't kicked in and I know community is really fast at answering questions.

 

I've used the intelligence suite tool to OCR a PDF document (it's a bank statement), now I need to parse this out into the required parts.

 

Roughly the documents is

 

Date   Narrative   Type   Debit amount   Credit amount   Balance

 

Sometimes the narrative is split over several lines, I've tried text to columns on a new line delimiter (\n) however due to the narrative spreading over several lines it isn't consistent. For example if there was always an empty row it would be fine to then use a multi-row formula to group back to the transaction line items, but erroneous new lines are appearing which makes this logic fail.

 

What I wanted to then do is take something that would be consistent at the start of each transaction which is the date, and using regex_replace I can match on date and replace with a pipe delimiter

 

 

REGEX_Replace([text], "\d{2}\/\d{2}\/\d{4}", "|")

 

 

However this loses the date. Is there a way therefore to maintain the date but just prefix this with the pipe delimiter?

 

 

Chris
Check out my collaboration with fellow ACE Joshua Burkhow at AlterTricks.com
2 REPLIES 2
mpennington
11 - Bolide

@cgoodman3 ,

I am currently on a machine without an active license, but I believe this could be accomplished with capturing groups.  I'm certain I've done this with other characters, but given that the Pipe is a RegEx metacharacter, it may have be escaped.  So maybe something like this:

REGEX_Replace([text], "(\d{2}\/\d{2}\/\d{4})", "\|$1")

At any rate, hope it works for you.  Good luck.

cgoodman3
14 - Magnetar
14 - Magnetar

Yeah that looks to have worked. I was expecting anything in the replace part of the formula to just be put in as a text string.

Chris
Check out my collaboration with fellow ACE Joshua Burkhow at AlterTricks.com
Labels