Hi,
I have a field with some data in it for example
[Location]="MSY-MAINT/BAY/1"
The location field should only ever have letters, numbers, the "-" symbol or the "/" symbol and I am willing to take that risk of user error to not have the Regex ridiculously complex. The underlying ERP does a good amount of validation already.
I was trying to do a formula to capture only up to the first "/" so the desired result would be "MSY-MAINT"
This is the 4th formula I started building up after the other methods became too complex to be worth debugging. Note that I've added several qualifiers that appear to be necessary per Regex101, but adding them didn't change the behavior of the Alteryx formula. It did the same thing with or without them (this includes the + for multiple matches and the \ as an escape character).
REGEX_Replace(Uppercase([Location), "([0-9A-Za-z]+)\/(.*)","$1")
This is returning "MSY-MAINT" but it should only return "MSY" as the "-" symbol is not in the capture group. "-" is ascii code 45. Verified with RegEx101 the capture group does not include ascii 45.
regex101: build, test, and debug regex
This should be required for the result Alterxy gave to be correct of "MSY-MAINT"
REGEX_Replace(Uppercase([Locaiton]), "([0-9A-Za-z\-]+)\/(.*)","$1")
I'm sure it's something I misunderstand about Regex or the implemenation of it in Alteryx. Does anyone know why Alteryx appears to give the incorrect result for the first version of the formula?
Thanks.
Solved! Go to Solution.
Hello @ColinG
Great spot, this does appear to be quite a tricky case. I have investigated a little bit, and I believe I have found an answer.
The regex you provided ([0-9A-Za-z]+)\/(.*) actually matches the string "MSY-MAINT/BAY/1" as follows:
([0-9A-Za-z]+) matches "MAINT"
\/ matches "/"
(.*) matches "BAY/1"
As you will notice, "MSY-" is effectively left unmatched. This suggests that the regex replace is also keeping unmatched characters, hence your result is both the unmatched "MSY-" and the matched "MAINT". Admittedly, when using regex from within the formula tool, this is hard to notice, but when I instead used the regex tool itself, this became apparent by toggling the "Copy unmatched text to output option". Please see the images below:
With Copy unmatched text enabled:
With Copy unmatched text disabled:
The reason this is happening is because, like you correctly identified, the character set [0-9A-Za-z] does not include the "-". and you have specified, that immediately after the character set, should be a "/". Therefore the regex statement starts looking at the "M" after the "-".
One way to get around this would be to change the "\/" to a "\b". In Regex, "\b" defines any word boundary, so the regex will now see both the "-" and "/" as word boundaries, returning you the value "MSY". The full regex statement would be: ([0-9A-Za-z]+)\b(.*)
I have attached the workflow with my examples below.
I hope the above all makes sense. Please let me know how you get on.
Regards - Pilsner
I didn't understand what you were saying at first and tend to inherently distrust Alteryx's frequently not quite standard implementation of tools. But this seems to be me not understanding how Regex works under the covers. I should have tried a full validation in Regex101. It confirms what you said and how the RegEx tool worked.
I believed Regex would pattern match from the beginning of a string so I would have expected it to find MSY and stop at the "-". But it appears Regex really finds the end of the first grouping, and then starts pattern matching from the end of the group "T" to the first nonmatching letter (the "-" after the "M") and drops the rest of the leading string.
Not having "-" in the group it stopped at the "-"
Having "-" in the group it continues to the beginning of the string
It would have helped me realize the error of my assumption if Regex_Replace behaved more like standard RegEx, or had a parameter value that specifically called out that copying unmatched characters was a configurable default.
Using \b is an inventive alternative I always forget about trying.
Thanks.
You're absolutely right, regex by default matches the full pattern wherever that may be and will only start at the beginning of the line if specifically told to using the "^" symbol.
I too, found it confusing that there was no parameter for "copy unmatched text" within the Regex_Replace function in the formula tool. I have added it as an idea to the community here , so hopefully this will be added at some point.
Glad I could help.
Regards - Pilsner
I don't understand why, but at least on Regex101, putting the ^ in any position in the formula seemed to be considered valid Regex, but it no longer matched anything, no capture groups were highlighted. I know some of the things I tried below are nonsensical, but sometimes so is Regex
^([0-9A-Za-z]+)\/(.*)
(^[0-9A-Za-z]+)\/(.*)
([^0-9A-Za-z]+)\/(.*)
I believe the reason for that will be that the full pattern cannot be found. By adding the "^", you are forcing the pattern to start at the beginning of the line. It then looks for characters in the set [0-9A-Za-z] and then a "/".
In your string "MSY-MAINT/BAY/1", the "-" is neither in [0-9A-Za-z] or a "/"; therefore, it doesn't fit your pattern. In regex, the entire pattern has to be found for it to be accepted. This was possible without "^" as the regex was not forced to start at the beginning of the line. I have attached a couple of screenshots which I hope make this clearer.
Another way to tackle your issue would be with the following regex : ^([0-9A-Za-z]+)
This begins at the start of the line, then takes all characters from your character set [0-9A-Za-z], up until the first character from outside your character set.
I accept the premise though it does still seem like Regex is behaving differently using ([0-9A-Za-z]+)\/(.*) since it is not forcing a full pattern match from delimiter working back to the beginning of the text input (as far as not returning anything if there isn't a full match).
The only way I can reconcile this discrepancy logically is that Regex treats the given delimiter "/" as a hard terminator to the pattern match, whether it starts from the end or starts from the beginning, whereas the beginning of the string is not treated as an equivalent hard terminator to assess the full pattern match. It just stops because the pattern doesn't continue, and since a terminator of a prior group wasn't found, it is treated as a 100% match and returned.
For human logic this is a little wonky, but isn't Regex anyway...
Thanks for the pointers. This has definitely helped me understand some of the finer points of Regex use and debugging.
I agree, sometimes the logic can seem a bit weird from a human perspective. Whenever I think I understand regex, another curveball will appear.
Glad I could help 😊
User | Count |
---|---|
63 | |
28 | |
23 | |
23 | |
22 |