Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing .csv with multiple character separator

nathanpartel
5 - Atom

Hello All,

 

I'm trying to parse a .csv file that has a multi-character delimiter "<|>".

 

I am trying to do this by using the regex tool but so far have failed miserably :(

 

As I understand, the expression /<\|>/g should match to all my delimiters, as checked here http://regexr.com/

 

This doesn`t seem to work with Alteryx however, as even when checking for matches I get falses for all rows.

 

I could always use a find/replace on the delimiters, but that would force me to recheck the model with every new dataset, as some of the fileds are free text user input and you never know when someone will use your separator in their text.

 

TL;DR: How do I parse a .csv file delimited by the "<|>" separator using regex?

 

All answers are super appreciated!

4 REPLIES 4
AdamR_AYX
Alteryx Alumni (Retired)

I think the attached does what you want.

 

Not the prettiest of regex, but does the job.

 

Key points are:

  1. Read in the file with \0 delimiter (which just reads the raw rows)
  2. Use the regex tool in the tokenize mode marking the data you want to keep (.*?)(?:<\|>|$)
  3. First ? is non greedy
  4. Second ? is non marked group
  5. Have to make that second group match your delimiter or the end of the line
Adam Riley
https://www.linkedin.com/in/adriley/
nathanpartel
5 - Atom

Hi Adam,

 

That worked beautifully! Thank you for helping :)

 

To follow up on it, I have two questions:

 

1) Could you explain how the expression used works? I'm fairly new to regex and as such don't understand it completely.

2) Is there a way to name the new columns based on the initial column name? (see image)Capture.PNG

 

 

 

 

Cheers! 

 

 

AdamR_AYX
Alteryx Alumni (Retired)

1. No problem

 

We have two groups

 

(.*?) and (?:<\|>|$)

 

(.*?) is a marked group and what we extract to our Alteryx field. So .* says match any number of any characters and ? says do it non greedily.  Without the ? this first group would just eat up the whole string.

 

(?:<\|>|$) is a non-marked group, as designated by the ?: at the start of it.  This means Alteryx will not see this group, we are just using it to do the OR.  Which brings us to the last part <\|> is your delimiter.  | means OR.  And $ means the end of the string.  This is what allows us to extract the very last field.

 

Here is a version from regex101 with highlighting https://regex101.com/r/tAUlKI/1

 

2. You'll want to keep the fieldnames in the data until after the tokenize and then use a dynamic rename to take the fieldnames from the first row of data.  Example attached.

 

 

 

 

Adam Riley
https://www.linkedin.com/in/adriley/
nathanpartel
5 - Atom

Thank you, Adam!

 

Very nice explanation on the regular expression, and the website you shared is incredibly useful.

 

The dynamic rename worked flawlessly as well.

 

Cheers :)

Labels
Top Solution Authors