According to the documentation of the RegEx tool (and general practice AFAIK) '\w' in RegEx should be a short hand for [A-Za-z0-9_]. That is, all uppercase and lowercase unaccented latin letters, numbers and underscore
However in versions 2023.1+ (at least) \w allows any character from any alphabet.
Is this expected behaviour? Has this always been the case? Is there a setting I'm missing?
Thanks,
Ollie
Solved! Go to Solution.
I'm on 2021.4 and running your workflow produces the same results for me.
Thanks @apathetichell
So according to perl's documentation it looks like Alteryx is behaving without the /a modifier in effect. I also noted that there are differences in behaviour with AMP on and off. From Alteryx's documentation this looks like using Unicode rules.
So this has maybe always been the case, but the documentation in the RegEx tool is misleading.
Interesting.
Using the regex101.com website, it looks like the /u modifier causes the expression \w to include alpha-numeric characters from non-Latin languages.
Did you find any way to turn on the /a modifier in Alteryx regex?
Chris
@ChrisTX Unfortunately I don't think we can (de)activate flags in Alteryx's RegEx (other than case insensitivity). Certainly @MarqueeCrew was asking for the ability to change the multiline flag here: