Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Help with Regex Parsing

HW1
9 - Comet

Hi,

I am parsing a set of pdfs and usually their data is in this format:

 

RecordIDValue
1Page 1 of 2
29
3PO BOX 235 S
4CLEVELAND QLD 4163 Waste &
5Recycling
6A.B.N. 40 000 605 925
7#NAME?
8"= -F—14-B-$1547-P3257 P (07) 4051 0899
9= WASTE OPTIONS E ar.cairns@random.com.au
10= NORTH RICHMOND NSW 2754
11#NAME?
12— P (07) 4051 0855
13ov eeeeeeesssssssseeeeeessessssittaneeeeeesesssssssttneeeessceessssssiteeeeeeeecesestttteneeeeeeeeeees E operations.cairns@jjswaste.com.au
14| BBall Weseciicyal :
15© Month: FEB-2021 :
16Invoice No: 140073852102 :
17' Customer No: 14007385
18: Issue Date: 27/02/2021:
19: | MBBS TTA Tice Ta et ewe eX -Soaa
20m™ Account Summary (including GST) ; : ;
21- Opening Balance: $111.38 : Please email all :
22| Payments made this period: $0.00 7 WM Cureutesiecucccn Om
23: Overdue balance due immediately $111.38: SLR SAE heln
24: Transactions for this period $132.00: :
25: Total Amount Payable $243.38:
26(Total GST for this Invoice $12.00) : :
27TERMS STRICTLY 14 DAYS
28pS <a
29HOW TO PAY PLEASE EMAIL ALL REMITTANCE ADVICES TO AR.REMIT @JJSWASTE.COM.AU
30[==] CREDIT CARD PAYMENT GATEWAY
31} PFN N OW Payment can be made using your How To Register
32credit card via our secure Payment Simply download the Random smart phone app from the App Store or Google
33Gateway or by phoning Invoices and Play and follow the prompts. Or you can visit https://myaccount.random.com.au
34Accounts. See front of invoice for and enter your Customer Number, your preferred email address and follow the
35contact information prompts. Once registered, you will need to confirm your registration using your
36latest invoice number.
374 Download on the < EDP
38@ AppStore Seer Kato
39EFT EFT pebit DIRECT DEBIT BPAY BX] MAIL
40EFT made payable to Payments can be arranged for Bill - 98357 Detach the How To Pay slio from th
41rs Random & Sons Toren, automatic ouctions on Le] ot Ne, sanpyaeno bottom of this invoice and post along
42N BSB: 034 168 e or eacn montn trom your . j bl
43a. Account: 0045 0396 nominated savings, cheque or credit were que mace payanie to
44#NAME?
45oD office to set-up Direct Debit as your phone banking. Mail To:
46ab payment option. Random & Sons Pty Ltd
47< PO BOX 835
48u CLEVELAND QLD 4163
49Page 2 of 2
50»S
51Waste &
52Transaction Details Recvetin
53SUPPLY OF CONTAINERS, REMOVAL & DISPOSAL OF ecyc 1 S
54TRADE WASTE FOR THE PERIOD ENDING: 27/02/2021 ——_—_———————
55A.B.N. 80 000 205 325
56DATE DESCRIPTION DOCKET PRICE GST TOTAL BALANCE
57Balance Brought Forward 111.38
58OVERDUE BALANCE FOR PAYMENT 111.38
59Overdue Balance
60Your invoice shows an overdue balance, this needs to be paid immediately to avoid possible suspension of your services. Should your services be
61suspended you could incur a $45 Overdue Fee. If you have a query regarding this balance please call our Accounts Department on (07) 4051 0855 to
62discuss your account.
63(BOC GAS & GEAR CAIRNS)
64*** 1.5 METRE BIN
6502/02/21 1 BIN 37.50 3.75 41.25
6616/02/21 1 BIN 37.50 3.75 41.25
67*** 1.5 METRE BIN RECYCLING
6803/02/21 1 BIN 22.50 2.25 24.75
6917/02/21 1 BIN 22.50 2.25 24.75
70SUB-TOTAL 120.00 12.00 132.00
71TOTAL DUE 243.38
72Errors and Omissions Excepted
73$< - ----------------------------------------------------
74INVOICE NUMBER: 140073852102
75Remittance Advice
76Comments: IS
77I Waste &
78Recycling
79PLEASE EMAIL ALL REMITTANCE ADVICES TO
80AR.REMIT@RANDOM.COM.AU Invoice No: 140073852102
81Customer No: 14007385
82. Issue Date: 27/02/2021
83Random & Sons Pty Ltd
84PO BOX 235 TOTAL DUE $243.38
85CLEVELAND QLD 4163

 

This is from parsing the pdf invoice and What I need to extract from this is essentially the invoice date, has the "Invoice Date" text(Record ID: 18)

The two $ values are the sum of pre-tax values for each type i.e. 37.5 + 37.5 in this case = 75 and similarly 22.5 + 22.5 = 45, The row is also accompanied by a date but I dont want the date of the columns, just the numbers.

 

I want REGEX to identify all dates and then I can filter out and post process. I have the regex formula as Regex Match : (\d+/\d+/\d+) (.*) but it misses the invoice date

 

And I want the output to be in the format as:


Co./Last NameAddr 1 - Line 2Addr 1 - Line 3DateSupplier Invoice No.BinAmountTax CodeTax AmountJournal MemoDescription
Random & Sons Pty LtdPO Box 235North Richmond NSW 275427/02/20211400738521021.5 METRE BIN$75.00GST$7.50Purchase; Random & Sons Pty LtdBOC GAS & GEAR CAIRNS
Random & Sons Pty LtdPO Box 235North Richmond NSW 275427/02/20211400738521021.5 METRE BIN RECYCLING$45.00GST$4.50Purchase; Random & Sons Pty LtdBOC GAS & GEAR CAIRNS

 

Can you please help with the Regex?

 

Thanks.

3 REPLIES 3
apathetichell
19 - Altair

I can help with one quick thing - the reason why  your regex_match can't find your issue/invoice date is because it has characters in front of it. The other dates don't have preceding characters so they turn up as true. To fix your formula you can use:

REGEX_Match([Field1], ".*(\d+/\d+/\d+)(.*)")

 

When I looked at the rest of your data - I wasn't quite sure the best way to bring it over, but my initial thought was because it looked like the regex was so specific to each column - perhaps just a contains statement and carrying over the field would be less of a headache.

HW1
9 - Comet

Yes, that makes it a step better however, how can I sum or group the "Bins" ? as in this case the numbers from Row nos 65 and 66 are to be clubbed together and likewise Row nos 68 and 69 are to be clubbed together. How can I identify them and then use further?

apathetichell
19 - Altair

those are actually relatively straight forward because they are directly under the *** bin heading - so you are basically looking for anything that is below a *** which starts with a date (just repeat your regex_match info in a multi-row to find these rows). I'd actually keep those line items as-is and with the multi-row parse out the full regexout - or even the full [value]. Then I'd just text to column them with a space delimiter.

 

I'm working on this:

(\d+/\d+/\d+)(.*)|\*\*\*(.*)|\((\u.*)\)|(.*)No\:(.*)) right now in a regex parse - it's not perfect and I'm not sure how much further I'll make it on it. but it gets you towards the right place. I use a text to column (split on space) to turn the specific bin entries into usable formats.

 

You'll need to use multi-row formulas to fill in the type of bin (where regex-out3 isn't null() ) - and once you do that for the line entries you can use a summarize tool with regexout3 (bin type) as a group-by to get the breakout you need.

 

You'll need to filter out a ton of null()s

 

I'm having a bit of an issue with the address and company name - but

Labels