I have a csv comma delimited file that I am trying to parse in Alteryx but running into a few edge cases. It has 8M+ rows and 34 columns.
Some of the columns have multiple values that are also separated by comma that I do not want to separate. I cannot put them in quotes, brackets or parenthesis. I have to work with this file.
Input:
Col1 |
ab,cd,7,6,45,21,gh |
Desired Output:
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
ab | cd | 7,6 | 45 | 21 | gh |
I am working with strings mostly here.
I know what column has multiple values (for example col3 here) if that helps.
Thanks in advance!
Hey -> as described this is impossible.
It's usually pretty doable - but the 7,6 is a big issue. unless this is in quotes -> there's now way to specify this is one term vs a two distinct terms without some logic - which you dictate.
are there always going to be numbers with commas in column 3,4 -> are they always one term?
the core thing you'll end up doing is a record id, text to columns split to rows. you'll use tile with unique value for recordid
you'll then use cross tab -> recordid as a key field -> and tile sequence number as the header, and your column as the value. set it first, or contact mode -> doesn't really matter. This will get you there - except the 3,4 column. They'll be in separate columns because Alteryx isn't a mind reader and can't figure out which commas are part of your number term and which commas are delimiters.
Sorry, its hard to define pattern or length as the column is more of an 'Item Description' type field with multiple commas and no set field length. However to achieve our goal for even 70%, I can put a number to the string length. Let's say the total length is 9 characters including the comma in between. Example value - "9A-1,9A-2".
Can we do something with this?
Agree with you.
The values are not always one term. They are I would say 20% times multiple values.
Thanks for your input. I appreciate it.
@agrawaluk
if fixed field number and only 1 field of desc. it may use split of cutting front and back.
eg:
id, acc, desc, a, b, c
1, 23xx, lambda a, b: a + b, 1, 2, 3
split front
id | acc | desc, a, b, c |
1 | 23xx | lambda x, b: x + b, 1, 2, 3 |
then split end
id | acc | desc | a | b | c |
1 | 23xx | lambda x, b: x + b | 1 | 2 | 3 |
else may use any unique column as anchor in regex,
like column of date, Y/N, or any other fixed length columns
yeah - without an understanding of the semantic construction of the description field and an understanding of the universe -> I can't tell you if you will get to 70%. My assumption is that the value is extracting some type of information from the description field vs just splitting it. I also assume you can kind of do what you want using an LLM but -> costs. You can try to post some more examples -> you can try to follow my split by text to columns/row mode /tile suggestion above - > and see if that gets you to 70% - if it doesn't it may explain what it is that makes sense/doesn't make sense about your data structure.
Maybe pytorch? https://stackoverflow.com/questions/71552396/extracting-data-logically-from-unstructured-text
https://www.reddit.com/r/LangChain/comments/1d3a3ee/parsing_unstructured_text/
langchain? chatgpt?