Hi,
I have a dataset of ledger title (string) that is somewhat messy but needs to be converted to a title that is tidier and easier to read. To achieve this, I need to break this down into a table with multiple fields or columns. We operate globally so different people write $values differently e.g. commas instead of dot, that's another issue that needs addressing but not important (lowest priority) as we just need to tell people the "right way" to spell out values in future.
Current dataset:
EUR Approved: POR 2020 APT - BNS8 - (Frankfurt, Germany) Real Estate - ($ 193,8MM )
NAM Approved: 2019 TXD -TDU2 (Charlotte, NC) Full Real Estate & Initial Budget: - RE:$138.8MM Initial: $183.2MM
NAM Approved: 2021 BRS - JBD6 (Quebec, CA) Full Real Estate - (Project Total- $738.3MM)
For context of above:
Region Field = EUR/NAM needs to change to two letters.
Building Type = APT/TXD/BRS
Site ID = BNS8, TDU2, JBD6
Budget Type = Full Real Estate = RE ; Initial Budget = IB
Total Budget needs to be combined total where two numbers are available e.g. 138.8+183.2= 322.0
Description to be adjusted in following pattern: [Site ID] [Building Type] [Budget Type]
Expected output:
Year | Region | Budget Type | Total Budget ($) | Description |
2020 | EU | RE | 193.8 | BNS8 APT Real Estate |
2019 | NA | RE & IB | 322.0 | TDU2 TXD Full Real Estate and Initial Budget |
2021 | NA | RE | 738.3 | JBD6 BRS Full Real Estate |
Am new to Alteryx, so any help welcome!!! Thank you!!!!
Thank you so much .@mpennington this helped tremendously. Some follow up:
REGEX formula: (NA|EU)\u.*(?:POR)*\s(\d{4})\s(\w{3})\s*-\s*(\w{4})
1. YEAR - For some reasons, not all years pulled up
2. REGION - just spotted other regions JPN, IND, INA in the dataset - these would need to be grouped as OT (as in other countries)
Below is the output:
Region | Year | Building Type | Site ID | |
NAM NME Approver: 2021 TNS - ILT1 (New Haven, DE) Construction Supplemental -$22.5MM (Project Total: $188.4MM) | NA | 2021 | TNS | ILG1 |
IND NME Approver: 2022 TRS450G+3 - BRQ2 (Prague, Czech Republic) - $25,613,196 | [Null] | [Null] | [Null] | [Null] |
EUR NME Approver: 2020 S&A -EGN1 (DE) EGN1 S&A expansion - $ 3.47MM | [Null] | [Null] | [Null] | [Null] |
NAM NME Approver: 2022 Custom BRS - TUX1 (Vancouver, BC) Full Real Estate - $294.8MM | [Null] | [Null] | [Null] | [Null] |
EUR NME Approver:: POR 2022 YNS80 - QEJ9 (Berlin, Germany) Real Estate - ($ 193,8MM ) | [Null] | [Null] | [Null] | [Null] |
If you can help that would be greatly appreciated.
Yes, I will see what I can do with this. But one thing to note, working with unstructured strings is challenging. If your data formats and structure changes, it can cause problems. The 5 strings you note above are structured differently than the 3 examples you provided in the first case. RegEx is looking for a very precise set of strings and what precedes and follows it. To use RegEx or any comparable method you need consistensy in the data. Regardless, I will follow up after I reevaluate the new information to see if this is an easy fix.
@AlYap
I have to agree with @mpennington that it is very challenging to parse the unstructured data.
With a bit of change of your input data, the parse statement will not be working most likely.
Thank you so much for helping. This is has kept me spending more time on reporting than I should, including weekends! So anything to improve that would be greatly appreciated.
No problem, glad to help. Just be aware that this solution only works for finite parameters, if your data set gets too big or too variable, this could break. Inconsistent treatment on the dollar amounts is a prime example. They use millions in some cases, commas, periods, the full amount and the abbreviated amounts. Can get real tricky, real fast. I commented up the formulas and broke them into more easily consumable segments. This worked for both the original set and the new data. I had to guess on some of the Site IDs and building types, but it should be easy for you to modify, just look at the comments. Let me know if you have any further questions. Good luck.
@AlYap Did the secondary solution to your updated data set solve your issue?
.@mpennington thanks for checking in. It worked for the examples I provided. Unfortunately it didn't for the entire population of data set and it's my fault to select only a small sample size. You are right, the unstructured data set is difficult to automate, so there are some things we have to live with until a solution is available for the data input process. Thanks a lot though!!