Hi,
I am parsing a set of pdfs and usually their data is in this format:
RecordID | Value |
1 | Page 1 of 2 |
2 | 9 |
3 | PO BOX 235 S |
4 | CLEVELAND QLD 4163 Waste & |
5 | Recycling |
6 | A.B.N. 40 000 605 925 |
7 | #NAME? |
8 | "= -F—14-B-$1547-P3257 P (07) 4051 0899 |
9 | = WASTE OPTIONS E ar.cairns@random.com.au |
10 | = NORTH RICHMOND NSW 2754 |
11 | #NAME? |
12 | — P (07) 4051 0855 |
13 | ov eeeeeeesssssssseeeeeessessssittaneeeeeesesssssssttneeeessceessssssiteeeeeeeecesestttteneeeeeeeeeees E operations.cairns@jjswaste.com.au |
14 | | BBall Weseciicyal : |
15 | © Month: FEB-2021 : |
16 | Invoice No: 140073852102 : |
17 | ' Customer No: 14007385 |
18 | : Issue Date: 27/02/2021: |
19 | : | MBBS TTA Tice Ta et ewe eX -Soaa |
20 | m™ Account Summary (including GST) ; : ; |
21 | - Opening Balance: $111.38 : Please email all : |
22 | | Payments made this period: $0.00 7 WM Cureutesiecucccn Om |
23 | : Overdue balance due immediately $111.38: SLR SAE heln |
24 | : Transactions for this period $132.00: : |
25 | : Total Amount Payable $243.38: |
26 | (Total GST for this Invoice $12.00) : : |
27 | TERMS STRICTLY 14 DAYS |
28 | pS <a |
29 | HOW TO PAY PLEASE EMAIL ALL REMITTANCE ADVICES TO AR.REMIT @JJSWASTE.COM.AU |
30 | [==] CREDIT CARD PAYMENT GATEWAY |
31 | } PFN N OW Payment can be made using your How To Register |
32 | credit card via our secure Payment Simply download the Random smart phone app from the App Store or Google |
33 | Gateway or by phoning Invoices and Play and follow the prompts. Or you can visit https://myaccount.random.com.au |
34 | Accounts. See front of invoice for and enter your Customer Number, your preferred email address and follow the |
35 | contact information prompts. Once registered, you will need to confirm your registration using your |
36 | latest invoice number. |
37 | 4 Download on the < EDP |
38 | @ AppStore Seer Kato |
39 | EFT EFT pebit DIRECT DEBIT BPAY BX] MAIL |
40 | EFT made payable to Payments can be arranged for Bill - 98357 Detach the How To Pay slio from th |
41 | rs Random & Sons Toren, automatic ouctions on Le] ot Ne, sanpyaeno bottom of this invoice and post along |
42 | N BSB: 034 168 e or eacn montn trom your . j bl |
43 | a. Account: 0045 0396 nominated savings, cheque or credit were que mace payanie to |
44 | #NAME? |
45 | oD office to set-up Direct Debit as your phone banking. Mail To: |
46 | ab payment option. Random & Sons Pty Ltd |
47 | < PO BOX 835 |
48 | u CLEVELAND QLD 4163 |
49 | Page 2 of 2 |
50 | »S |
51 | Waste & |
52 | Transaction Details Recvetin |
53 | SUPPLY OF CONTAINERS, REMOVAL & DISPOSAL OF ecyc 1 S |
54 | TRADE WASTE FOR THE PERIOD ENDING: 27/02/2021 ——_—_——————— |
55 | A.B.N. 80 000 205 325 |
56 | DATE DESCRIPTION DOCKET PRICE GST TOTAL BALANCE |
57 | Balance Brought Forward 111.38 |
58 | OVERDUE BALANCE FOR PAYMENT 111.38 |
59 | Overdue Balance |
60 | Your invoice shows an overdue balance, this needs to be paid immediately to avoid possible suspension of your services. Should your services be |
61 | suspended you could incur a $45 Overdue Fee. If you have a query regarding this balance please call our Accounts Department on (07) 4051 0855 to |
62 | discuss your account. |
63 | (BOC GAS & GEAR CAIRNS) |
64 | *** 1.5 METRE BIN |
65 | 02/02/21 1 BIN 37.50 3.75 41.25 |
66 | 16/02/21 1 BIN 37.50 3.75 41.25 |
67 | *** 1.5 METRE BIN RECYCLING |
68 | 03/02/21 1 BIN 22.50 2.25 24.75 |
69 | 17/02/21 1 BIN 22.50 2.25 24.75 |
70 | SUB-TOTAL 120.00 12.00 132.00 |
71 | TOTAL DUE 243.38 |
72 | Errors and Omissions Excepted |
73 | $< - ---------------------------------------------------- |
74 | INVOICE NUMBER: 140073852102 |
75 | Remittance Advice |
76 | Comments: IS |
77 | I Waste & |
78 | Recycling |
79 | PLEASE EMAIL ALL REMITTANCE ADVICES TO |
80 | AR.REMIT@RANDOM.COM.AU Invoice No: 140073852102 |
81 | Customer No: 14007385 |
82 | . Issue Date: 27/02/2021 |
83 | Random & Sons Pty Ltd |
84 | PO BOX 235 TOTAL DUE $243.38 |
85 | CLEVELAND QLD 4163 |
This is from parsing the pdf invoice and What I need to extract from this is essentially the invoice date, has the "Invoice Date" text(Record ID: 18)
The two $ values are the sum of pre-tax values for each type i.e. 37.5 + 37.5 in this case = 75 and similarly 22.5 + 22.5 = 45, The row is also accompanied by a date but I dont want the date of the columns, just the numbers.
I want REGEX to identify all dates and then I can filter out and post process. I have the regex formula as Regex Match : (\d+/\d+/\d+) (.*) but it misses the invoice date
And I want the output to be in the format as:
Co./Last Name | Addr 1 - Line 2 | Addr 1 - Line 3 | Date | Supplier Invoice No. | Bin | Amount | Tax Code | Tax Amount | Journal Memo | Description |
Random & Sons Pty Ltd | PO Box 235 | North Richmond NSW 2754 | 27/02/2021 | 140073852102 | 1.5 METRE BIN | $75.00 | GST | $7.50 | Purchase; Random & Sons Pty Ltd | BOC GAS & GEAR CAIRNS |
Random & Sons Pty Ltd | PO Box 235 | North Richmond NSW 2754 | 27/02/2021 | 140073852102 | 1.5 METRE BIN RECYCLING | $45.00 | GST | $4.50 | Purchase; Random & Sons Pty Ltd | BOC GAS & GEAR CAIRNS |
Can you please help with the Regex?
Thanks.
I can help with one quick thing - the reason why your regex_match can't find your issue/invoice date is because it has characters in front of it. The other dates don't have preceding characters so they turn up as true. To fix your formula you can use:
REGEX_Match([Field1], ".*(\d+/\d+/\d+)(.*)")
When I looked at the rest of your data - I wasn't quite sure the best way to bring it over, but my initial thought was because it looked like the regex was so specific to each column - perhaps just a contains statement and carrying over the field would be less of a headache.
Yes, that makes it a step better however, how can I sum or group the "Bins" ? as in this case the numbers from Row nos 65 and 66 are to be clubbed together and likewise Row nos 68 and 69 are to be clubbed together. How can I identify them and then use further?
those are actually relatively straight forward because they are directly under the *** bin heading - so you are basically looking for anything that is below a *** which starts with a date (just repeat your regex_match info in a multi-row to find these rows). I'd actually keep those line items as-is and with the multi-row parse out the full regexout - or even the full [value]. Then I'd just text to column them with a space delimiter.
I'm working on this:
(\d+/\d+/\d+)(.*)|\*\*\*(.*)|\((\u.*)\)|(.*)No\:(.*)) right now in a regex parse - it's not perfect and I'm not sure how much further I'll make it on it. but it gets you towards the right place. I use a text to column (split on space) to turn the specific bin entries into usable formats.
You'll need to use multi-row formulas to fill in the type of bin (where regex-out3 isn't null() ) - and once you do that for the line entries you can use a summarize tool with regexout3 (bin type) as a group-by to get the breakout you need.
You'll need to filter out a ton of null()s
I'm having a bit of an issue with the address and company name - but