We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

CSV parsing with multiple values in one column

agrawaluk
8 - Asteroid

I have a csv comma delimited file that I am trying to parse in Alteryx but running into a few edge cases. It has 8M+ rows and 34 columns. 

Some of the columns have multiple values that are also separated by comma that I do not want to separate. I cannot put them in quotes, brackets or parenthesis. I have to work with this file.

 

Input:

Col1
ab,cd,7,6,45,21,gh

 

Desired Output:

Col1Col2Col3Col4Col5Col6
abcd7,64521gh

 

I am working with strings mostly here.
I know what column has multiple values (for example col3 here) if that helps. 

Thanks in advance!

6 REPLIES 6
flying008
15 - Aurora

Hi, @agrawaluk 

 

What's the feature of your columns with don't split ? length or pattern ?

apathetichell
20 - Arcturus

Hey -> as described this is impossible.

 

It's usually pretty doable - but the 7,6 is a big issue. unless this is in quotes -> there's now way to specify this is one term vs a two distinct terms without some logic - which you dictate.

 

are there always going to be numbers with commas in column 3,4 -> are they always one term?

 

the core thing you'll end up doing is a record id, text to columns split to rows. you'll use tile with unique value for recordid

you'll then use cross tab -> recordid as a key field -> and tile sequence number as the header, and your column as the value. set it first, or contact mode -> doesn't really matter. This will get you there - except the 3,4 column. They'll be in separate columns because Alteryx isn't a mind reader and can't figure out which commas are part of your number term and which commas are delimiters.

agrawaluk
8 - Asteroid

Sorry, its hard to define pattern or length as the column is more of an 'Item Description' type field with multiple commas and no set field length. However to achieve our goal for even 70%, I can put a number to the string length. Let's say the total length is 9 characters including the comma in between. Example value - "9A-1,9A-2". 

Can we do something with this?

agrawaluk
8 - Asteroid

Agree with you. 

 

The values are not always one term. They are I would say 20% times multiple values. 

 

Thanks for your input. I appreciate it. 

PangHC
13 - Pulsar

@agrawaluk 
if fixed field number and only 1 field of desc. it may use split of cutting front and back.

 

eg:

 

id, acc, desc, a, b, c

1, 23xx, lambda a, b: a + b, 1, 2, 3

 

split front

idaccdesc, a, b, c
123xxlambda x, b: x + b, 1, 2, 3

then split end

idaccdescabc
123xxlambda x, b: x + b123

 

else may use any unique column as anchor in regex,

like column of date, Y/N, or any other fixed length columns

apathetichell
20 - Arcturus

yeah - without an understanding of the semantic construction of the description field and an understanding of the universe -> I can't tell you if you will get to 70%. My assumption is that the value is extracting some type of information from the description field vs just splitting it. I also assume you can kind of do what you want using an LLM but -> costs.  You can try to post some more examples -> you can try to follow my split by text to columns/row mode /tile suggestion above - > and see if that gets you to 70% - if it doesn't it may explain what it is that makes sense/doesn't make sense about your data structure.

 

Maybe pytorch? https://stackoverflow.com/questions/71552396/extracting-data-logically-from-unstructured-text

https://www.reddit.com/r/LangChain/comments/1d3a3ee/parsing_unstructured_text/

langchain? chatgpt?

 

 

Labels
Top Solution Authors