Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Regex join, Wildcard join or "Like" Statements

JohnY
7 - Meteor

 

I a trying to join on non exact fields

 

URL extractMapstring
rta/podcast/2015/07/eot_20150723_1105.mp3*/rta/podcast/*/ttt

 

 I have tried using a fuzzy match tool, but it really isnt what I want, it returned a match of 40 for this.

 

you can see from the above Mapstring column that the intention of the mapping is that it SHOULD match the pattern  "rta/podcast/" with "/ttt" following it so the above match is a false positive.

 

Yes I know you can exclude matches below a threshold, but I want more precision in the matches

 

It sounds like a job for REGEX to me, but is there away you can use REGEX to join?

 

Or, in SSIS there is the ability to join two fields and then alter the SQL so that you can use a "Like" statement, is there any similar functionality in Alteryx?

 

 

 

 

6 REPLIES 6
Federica_FF
11 - Bolide

I think there are several ways to accomplish what you desire.

 

The point is: what's the logic behind the matching/joining?

 

Let's say the logic is that the KEY for the match is "*/rta/podcast/*/ttt", the string that comes after "*/" and before "/*/ttt"

 

You could use regex, parsing the pattern \W\W(.+)\W\W\Wttt and creating a new column called for example KEY. Then you could use a Find a replace to find (connector F) in the URL extract column the string in the KEY column and APPENDING the MAPSTRING column connected in the R connector..

 

Immagine.png

 

The point is: the regex pattern will be always good? Are you only looking for a match with "*/rta/podcast/*/ttt" or the column Mapstring can also contains different values?

 

Another way could be to extract with regex the "key" that exactly match both strings, join all the records of both tables (with the append fields tool) - if the tables are not too much big - and only filter where the key is contained in both column (filter tool with the formula contains([Mapstring], [KEY] AND contains ([URL Extract, [KEY]]).

 

Or you could create in both flow a new column with the KEY and then make a classic join on that field.

 

It depends on the data.

JohnY
7 - Meteor

OK I am relatively new at this and haven't explained it very well, apologies

 

And apologies also if I havent correctly understood your response

 

From what I THINK you are suggesting, I dont think it is what I am looking for. A static, explicit statement parse per your Regex will not work, it really needs to be a join on two fields from two sources

 

The point of the exercise is to summarise and group URLs in order to interrogate user clicks, the identifier in the below mapping table is the means by which the URLs will be grouped

 

Before join

 Temp3.JPG

 

After Join:

Temp2.png

 

So, in summary, as you can see above, the astrixes are wildcards, so */rta/podcast/*/zps_* needs to match ONLY URLs that contain  BOTH /rta/podcast/ and /zps_ and it needs to be in that order.

 

zps_/rta/podcast/2014/06/20140615.mp3 HTTP/1.1 for instance (in the wrong order) should not match at all

 

I got SOME of the matches that I wanted using the Fuzzy Match tool, but it is a very blunt instrument for what I need, it would match the above in the wrong order; not what I want

 

What I really need is the ability to match 2 datasets together using wildcards with complete precision. As I said in the original post, it seems to me like regex should do it, but the question is, can it be done in the context of a join?

 

 

 

 

 

Federica_FF
11 - Bolide

Hi,

 

so basically your map column is already the regex pattern you need, you only need to turn/translate it into regex language.

 

In the workflow attached I've translate the map string into regex (replacing the astheriscs with ".+" which means any sequence of characters and making the first astherisc optional) with a formula tool.

 

Then I've joined the tables with an Append tool, matching each record of the first table with each record of the second one, and then filtered only the records where the regex pattern match the URL.

 

Check if it works!

 

jgo
Alteryx Alumni (Retired)

Hi @JohnY,

If the "identifier" value always starts with 'rn_' you can simply use the expression below in a Formula tool to simply parse that part out and concatenate (append) to 'rn_' and forgo a lookup file all together.

 

REGEX_REPLACE([URL],'^rta/podcast/.+(\w{3})_.*$','rn_$1')

 

Hope this helps,

 

JGo

JohnY
7 - Meteor

Hi and thank you

 

Yes in the last couple of days, a colleague and I came up with this solution as well (although yours is more complete and elegant!), but, being a cartesian join, it makes for potentially very large datasets and wait times.

 

We were kind of hoping there was another way to to do it instead, but if not this will get us out of trouble.

JohnY
7 - Meteor

No unfortunately the state


@jgo wrote:

Hi @JohnY,

If the "identifier" value always starts with 'rn_' you can simply use the expression below in a Formula tool to simply parse that part out and concatenate (append) to 'rn_' and forgo a lookup file all together.

 

REGEX_REPLACE([URL],'^rta/podcast/.+(\w{3})_.*$','rn_$1')

 

Hope this helps,

 

JGo

 

Unfortunately the identifier may be completely different each time. I will check out your regex statements though, clearly your regex is more advanced than mine!


 

Labels