community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
SOLVED

Parsing .csv with multiple character separator

Alteryx Partner

Hello All,

 

I'm trying to parse a .csv file that has a multi-character delimiter "<|>".

 

I am trying to do this by using the regex tool but so far have failed miserably :(

 

As I understand, the expression /<\|>/g should match to all my delimiters, as checked here http://regexr.com/

 

This doesn`t seem to work with Alteryx however, as even when checking for matches I get falses for all rows.

 

I could always use a find/replace on the delimiters, but that would force me to recheck the model with every new dataset, as some of the fileds are free text user input and you never know when someone will use your separator in their text.

 

TL;DR: How do I parse a .csv file delimited by the "<|>" separator using regex?

 

All answers are super appreciated!

Highlighted
Alteryx
Alteryx

I think the attached does what you want.

 

Not the prettiest of regex, but does the job.

 

Key points are:

  1. Read in the file with \0 delimiter (which just reads the raw rows)
  2. Use the regex tool in the tokenize mode marking the data you want to keep (.*?)(?:<\|>|$)
  3. First ? is non greedy
  4. Second ? is non marked group
  5. Have to make that second group match your delimiter or the end of the line
Adam Riley
Principal Software Engineer
Tech Lead Core Engines, Alteryx
Alteryx Partner

Hi Adam,

 

That worked beautifully! Thank you for helping :)

 

To follow up on it, I have two questions:

 

1) Could you explain how the expression used works? I'm fairly new to regex and as such don't understand it completely.

2) Is there a way to name the new columns based on the initial column name? (see image)Capture.PNG

 

 

 

 

Cheers! 

 

 

Alteryx
Alteryx

1. No problem

 

We have two groups

 

(.*?) and (?:<\|>|$)

 

(.*?) is a marked group and what we extract to our Alteryx field. So .* says match any number of any characters and ? says do it non greedily.  Without the ? this first group would just eat up the whole string.

 

(?:<\|>|$) is a non-marked group, as designated by the ?: at the start of it.  This means Alteryx will not see this group, we are just using it to do the OR.  Which brings us to the last part <\|> is your delimiter.  | means OR.  And $ means the end of the string.  This is what allows us to extract the very last field.

 

Here is a version from regex101 with highlighting https://regex101.com/r/tAUlKI/1

 

2. You'll want to keep the fieldnames in the data until after the tokenize and then use a dynamic rename to take the fieldnames from the first row of data.  Example attached.

 

 

 

 

Adam Riley
Principal Software Engineer
Tech Lead Core Engines, Alteryx
Alteryx Partner

Thank you, Adam!

 

Very nice explanation on the regular expression, and the website you shared is incredibly useful.

 

The dynamic rename worked flawlessly as well.

 

Cheers :)

Labels