Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extract Budget Totals using Regex

hellyars
Pulsar

Regex trouble...

See attached workflow.

I am trying to extract the table fields from a table input (using computer vision tools in a batch macro). 

Unfortunately, the source PDF was of poor quality and Alteryx was not always able to delineate the fields.

 

In my workflow....

 

If RQST and BILL are BOTH NOT NULL then the import was good.  If they are 1 or both are NULL then the import was bad. The screenshot depicts all 3 cases. 

 

The field to split is RAW.  

I only care about

 

  • the 1 to 3 {1,3} digits at the start of RAW.  This is the Line Number (LN).
  • the number group or groups (more often groups) at the end of raw.  these will be  3 or more digits (but with commas).  The first group is RQST.  The second gorup is BILL.  There may be instances when there is only one group.  This is okay.  It just means we don't know if its RQST or BILL that should be 0 or the value found at the end.
  • Note if table import worked correctly there two ending number groups will be separated by a pipe "|"  But, you can see in the that sometimes both, 1 or none are present.

How can I extract this.

 

 

 

 

 

 

 

regexCantGet Numbers.pngregextrbl1.png

6 RÉPONSES 6
danilang
19 - Altair
19 - Altair

Hi @hellyars 

 

Love to help, but there's no workflow.

 

Dan

hellyars
Pulsar

@danilang  Sorry.  There now

 

hellyars
Pulsar

I did as much clean as possible in Alteryx -- e.g., removing unnecessary punctuation, etc.  But, I ended exporting to a text editor and manually fixing as much as possible before reintroducing into Alteryx and then Tableau.  Ended up going thru a few cycles until I got everything to total up correctly, etc.

EN6924
Boule de feu

Please find my solution done using Regex -

 

Regular expression - (\d+).+\|(.+)\|(.+)

 

EN6924_0-1648830367561.png

 

hellyars
Pulsar

@EN6924  Thank you.  In theory that should work.  In practice hundreds of lines lack the correct delimiters due the less than perfect quality of the original PDF/images processed by the Computer Vision Tools.

danilang
19 - Altair
19 - Altair

Hi @hellyars 

 

There may be a way to handle all this in a single regex statement, my regex-fu is not that good.  Instead I decided to break it down into manageable chunks.

danilang_1-1648989348088.png

The first pattern is \d+.*?\|.*?(\d+[,|\.|\s]*\d+)[\||\s](\d*[,|\.|\s]*\d+) and matches 12|abc..yz|xx,xxx|x,xxx with various possibilities for the delimiters and thousands separator.  After reversing all the rows that didn't match, I used this (\d*[,|\.]*\d+)?[\|]+(\d*[,|\.]*\d+)?.* where the BILL and RQST groups are now optional handling the case where only one of them is there.  This left me with only rows with a single number at the from separated from the rest of the string by a space.  I then unioned, reversed the B and R values and finally unioned with the first set of matches. 

 

In the sample output, the R and B columns correspond to your RQST and BILL columns  

 

danilang_0-1648989266446.png

 

Of course you'll have to validate all the rows, but this should get you at least 99% of the way there.  

 

Dan

 

Étiquettes