Hi All,
My data is all contained in a single column and I want to split it into columns. The structure looks like this -
2021 507982 GINKOZAC 30 CAPSULE P909 Cash RONZAC HEALTH & WELLNESS NUTRICEUTICALS FOOD SUPPLEMENT BLOOD CIRCULATION GINKOBILOBA Cash 0 90914429750198 2021-04-11 00:00:00.000 Yes 2.00000 300.00000 -150.00000 0.00000 150.00000
I know the order of columns contained but they all have varying lengths. For e.g., the first 4 characters are year 2021, the next 6 characters after the space is product ID. However, beyond that, the next column (product description) etc. will be varying length. I am unable to come up with a unique identifier to split this data into my desired set of columns.
Year | Item ID | Item description | Store ID | Customer type | Item type | Category 1 | Category 2 | Category 3 | Category 4 | Category 5 | Customer type breakdown | 0 | Transaction ID | Date | Loyal? | Qty | Gross sales | Promo | Discount | Net sales
Any support on how I can achieve the structure above will be much appreciated.
If it was just a single row (or not all in one field) then, it's not ideal, but it could've been bruteforced with RegEx or text to columns. However, as you say there are others with varying lengths/no consistent patterns, I'm struggling to think of a way you'll be able to make this dynamic. Where does this data come from? Is there any way you can make changes upstream to add these delimiters or something that can be used to identify or separate each entity within the string?
I received this data from the client - not sure if their system can help output the data with a consistent delimiter.
@icemanlindelof Can you manually split the given sample records based on the header and provide the input file?
If there aren't any consistent patterns or obvious delimiters then as far as I'm aware it'll be impossible to split these dynamically and so the best course of action would be to try and have something changed upstream that would allow this. If not then unless I'm mistaken you'll likely have to resort to manual intervention before adding to your flow.
There are parts of the string that are easy to breakdown, and others that are not so easy. Looks very much like a Regex type approach will be best.
Year | Item ID are simple enough
Item description | Store ID are a bit trickier as these may be multi-word and there may not be a way to logically imply the split between the two.
Customer looks to be from a distinct set of possibilities which can be coded for in Regex
Item type hopefully this can be picked out as just being between Customer and Category 1
Category 1 | Category 2 | Category 3 | Category 4 | Category 5 if these have a distinct set of possibilities then this can be coded in Regex
Customer type breakdown looks to be from a distinct set of possibilities which can be coded for in Regex
0 is easy to pick out in Regex
Transaction ID | Date | Loyal? | Qty | Gross sales | Promo | Discount | Net sales all look to be space separated - so Text to Columns or Regex to parse.
If you're new to Regex then this will take some time.
Perl Regular Expression Syntax - 1.68.0 (boost.org) is a good resource for understanding Regex.
Below is a sample split of two records.
Year | Item ID | Item Descriptipn | Store ID | Customer type | Item type | Category 1 | Category 2 | Category 3 | Category 4 | Category 5 | Customer type breakdown | 0 | Transaction ID | Date | Brand | Loyalty? | Qty | Gross sales | Promo | Discount | Net sales |
2021 | 101375 | MARVELON 1 X 21 TAB | P549 | Cash | LOCAL | PHARMA-RX | RX-ENDOCRINE | CONTRACEPTION | ORAL CONTRACEPTIVE | ETHINYLESTRADIOL,DESOGESTREL | Cash | 0 | 5.49E+13 | 00:00.0 | MARVELON | No | 1 | 14.6 | 0 | 0 | 14.6 |
2021 | 507982 | GINKOZAC 30 CAPSULE | P909 | Cash | RONZAC | HEALTH & WELLNESS | NUTRICEUTICALS | FOOD SUPPLEMENT | BLOOD CIRCULATION | GINKOBILOBA | Cash | 0 | 9.09E+13 | 00:00.0 | Yes | 2 | 300 | -150 | 0 | 150 |
@icemanlindelof one way of doing this with the help of regex tool. You need to mention the position of each fields. If you find some of the fields are not fitting adjust the size mentioned inside the regex