Hi Alteryx Community,
I am trying to tokenize a series of strings with letters, numbers, and special characters by every instance of a capital letter of a word, but I only want to tokenize if there is no "/" in between two words. How would I go about that? The data is very random and inconsistent. Ideally I would like to tokenize by the "/" but every now and then there is a date using "/" so I am attempting to go with each capital letter.
For example:
Polar Bear / Train (CAR) 18-19 / Nickel/Vacation 1/2 - 1/6/ CAT 1/3 Alligator X&Y / FLOMINGO / Vulture/GORILLA / HOMECARE2 Provide
To turn into:
Polar Bear
Train (CAR)
Nickel
Vacation
CAT
Alligator X&Y
FLOMINGO
Vulture
GORILLA
HOMECARE2 Provide
Solved! Go to Solution.
Hi there,
I got to your list by using the following within the Regex tokenize tool:
[^\d-]
to strip out the following numbers (and make sure the list was consistent with yours, i followed up with a parse with the following:
([^\d-]+)
Hope this helps but let me know if this doesn't give you what you're after.
Charlie
Hi Charlie,
Thank you for the reply. I tried using regex to tokenize by [^\d-] and then had a regex parsing by ([^\d-]+), but the first regex simply broke out the set one character at a time (and therefore the second regex didn't do much). Please let me know if I misunderstood your response.
The following is a screen shot of the tokenize tool:
Thank you
Matt
Thank you, this is great. My only concern is what to do when there is a number within the word/phrase like the "HOMECARE2 Provide" at the end.
I love a good Regex question! I'll just add that in your example you wanted to also capture the "HOMECARE2 Provide". In that case I would change the regex in the tokenize to:
(\u.+?)(?:/|$)
The (?:/|$)
is an unmakred group - it says to stop when you hit either a / or the end of the line ( $ in regex)
Thank you Bob. What do you suggest I do for the second regex parsing tool since it is deleting the 2 and everything after that in "HOMECARE2 Provide"? Please keep in mind that the digit (if part of the word/phrase) would not always be at the end of the word itself, but may be in the beginning or middle of the word.
Please see the screen shot below:
Hi Matt,
Regex is fun because how you build it really depends on your data and your knowledge of how quirky it can be.
Based on your original post it look like you want to keep "HOMECARE2 Provide" and the numbers you want to get rid of are the dates.
so what I did is first replace the possible dates with nothing.
The first replace gets rid of dates with either 2/17 or 12/15 format, the second with either 2-17 or 2-15.
As I said regex is a great tool, but how complex your expressions are depends on how extreme your data is.
Cheers,
Bob
User | Count |
---|---|
106 | |
82 | |
70 | |
54 | |
40 |