Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Need help to cleanse a horrible data source

jt_edin
8 - Asteroid

Hi All. I've been given some data that should have a simple format. Here are the three fields: ID1, ID2, Comment

 

Both IDs are numbers, and the Comment is free text. The problem is that the comment field hasn't been wrapped in double quotes, so when you parse it, things like commas and carriage returns are interpreted literally rather than escaped as they should in a csv wrapped in "".

 

I opened the data in Alteryx using the '\0' csv delimeter trick. Here is a good example of what I'm dealing with:

 

  1. 13353,5546,Proper line of feedback with no commas in comment seems to work OK
  2. 85575,348568,I think it would be better, I mean, why can't we do the following.
  3.  
  4. The process should be streamlined
  5. 53343,8923,Proper line of feedback

Lines 1 and 5 are OK. Lines 2, 3 and 5 have some problems:

  • Line 2 has commas in the text comment. Since commas are also used to delimit the fields, we need to wrap everything after the second comma in double quotes.
  • Line 3 is an extra carriage return preserved in the text comment
  • Line 4 is a continuation of the comment belonging to the IDs in line 2. How can these comments be appended to the previous line, replacing a carriage return with a space?

This is possibly a quick job in Regex? Ideally I'd like the data to look like this:

 

  1. 13353,5546,"Proper line of feedback"
  2. 85575,348568,"I think it would be better, I mean, why can't we do the following. The process should be streamlined"
  3. 53343,8923,"Proper line of feedback"

I can see that if a line start with a number it's probably an ID, and if it starts with a non-numerical char it's probably a comment that needs to be moved up. But I don't know enough about regex to detect these things.

 

Finally, the two ID fields are numeric, but of varied lengths, so I can't just judge a solution based on character positions.

 

If anyone can point out how to solve this problem I'd be very grateful! Thanks

3 REPLIES 3
JohnJPS
15 - Aurora

A MultiRow formula can be used to join the relevant lines based on a regular expression matching 'number,number,everything else' ... then split to 3 columns.  This will give you everything but will have duplicates with the appended and non-appended rows... to remedy that, sort the third column descending prior to doing a "unique" on the first towo columns: the descending sort will ensure that "unique" grabs only the longer of the two rows. A final Formula can add in the quotes on the parsed third field.  See attached - hope it helps.

 

Capture.PNG

jt_edin
8 - Asteroid

Thank you, this looks incredible!

 

Unfortunately I've discovered that some comments span many lines, depending on how many times the user hit enter when typing their helpful feedback comments! Example attached (I have amended the text input). Since these span more rows than r-1, r, or r+1, can they still be handled by a multi-row formula?

 

Thank you

JohnJPS
15 - Aurora

@jt_edin,

 

EDIT: actually the given solution works for multiple lines since it just continues appending them as it steps through the MultiRow formula.

 

However, you may need to add an initial Select tool, in order to expand the allowed field size, so that the appends don't get truncated... (see new attachment).

Labels