REGEX_REPLACE - Please explain why this works

Question

By lots of internet searches, and dumb luck, I wrote a REGEX_REPLACE formula that does exactly what I want.  I'm hoping someone can explain why it works.

I wanted to evaluate a string containing 0-x underscores.  If there were less than 2, then return everything to the left of the first underscore.  If there were more than 2, then return everything to the left of the second to last underscore.

So, if the string was "abc_def", then I wanted to return "abc".

If it was "abc_def_hij_klm", then I only wanted "abc_def".

I did it using the following formula:

if REGEX_CountMatches([text_in], "_") < 2 then
REGEX_Replace([text_in], "(.+)_(.+)","$1") else
REGEX_Replace([text_in], "(.+)_(.+_.+)","$1") endif

I understand the if REGEX_CountMatches([text_in], "_") < 2 part.  What I don't understand is why the last two lines work like they do.  Specifically, why/how does the pattern work?

I have attached the workflow.

regex_trial.yxmd

Garabujo7 · Answer

Hello @afinzerillo ,

I'll recommend that you take this interactive tutorial about regex to understand what each piece does:

https://regexone.com/

It's been very useful for me.

Also another useful resource is this interactive regex tester

https://regex101.com/

Hope that helps

Gabriel

atcodedog05 · Answer

Hi @afinzerillo

In REGEX_Replace([text_in], "(.+)_(.+_.+)","$1") formula .+ means one or more any characters. Hence whats Regex is doing is its trying to fit the max in the first bracket such that it still maintains pattern (string)_(string_string) this is how the Regex works it will try to fit the max in the first occurrence of the pattern also keeping into the consideration that it meets the whole pattern.

Hope this helps : )

afinzerillo · Answer

@atcodedog05

I expected the output as written and produced in the workflow I uploaded.  See below.

Notice how "abc_def_efg_hij_klm_nop_qrs_tuv_wxy_z12" is returned as "abc_def_efg_hij_klm_nop_qrs_tuv".  I get everything to the left of the second to last underscore.  the syntax I wrote, "(.+)_(.+_.+)" works like I want it no matter how many underscores there are in the string.  I just don't know how Alteryx is processing "(.+)_(.+_.+)" to make it happen.