Alteryx Designer Desktop Discussions

JohnY · ‎09-17-2016

I a trying to join on non exact fields

URL extract	Mapstring
rta/podcast/2015/07/eot_20150723_1105.mp3	/rta/podcast//ttt

I have tried using a fuzzy match tool, but it really isnt what I want, it returned a match of 40 for this.

you can see from the above Mapstring column that the intention of the mapping is that it SHOULD match the pattern "rta/podcast/" with "/ttt" following it so the above match is a false positive.

Yes I know you can exclude matches below a threshold, but I want more precision in the matches

It sounds like a job for REGEX to me, but is there away you can use REGEX to join?

Or, in SSIS there is the ability to join two fields and then alter the SQL so that you can use a "Like" statement, is there any similar functionality in Alteryx?

Federica_FF · ‎09-18-2016

I think there are several ways to accomplish what you desire.

The point is: what's the logic behind the matching/joining?

Let's say the logic is that the KEY for the match is "*/rta/podcast/*/ttt", the string that comes after "*/" and before "/*/ttt"

You could use regex, parsing the pattern \W\W(.+)\W\W\Wttt and creating a new column called for example KEY. Then you could use a Find a replace to find (connector F) in the URL extract column the string in the KEY column and APPENDING the MAPSTRING column connected in the R connector..

The point is: the regex pattern will be always good? Are you only looking for a match with "*/rta/podcast/*/ttt" or the column Mapstring can also contains different values?

Another way could be to extract with regex the "key" that exactly match both strings, join all the records of both tables (with the append fields tool) - if the tables are not too much big - and only filter where the key is contained in both column (filter tool with the formula contains([Mapstring], [KEY] AND contains ([URL Extract, [KEY]]).

Or you could create in both flow a new column with the KEY and then make a classic join on that field.

It depends on the data.

JohnY · ‎09-18-2016

OK I am relatively new at this and haven't explained it very well, apologies

And apologies also if I havent correctly understood your response

From what I THINK you are suggesting, I dont think it is what I am looking for. A static, explicit statement parse per your Regex will not work, it really needs to be a join on two fields from two sources

The point of the exercise is to summarise and group URLs in order to interrogate user clicks, the identifier in the below mapping table is the means by which the URLs will be grouped

Before join

After Join:

So, in summary, as you can see above, the astrixes are wildcards, so */rta/podcast/*/zps_* needs to match ONLY URLs that contain BOTH /rta/podcast/ and /zps_ and it needs to be in that order.

zps_/rta/podcast/2014/06/20140615.mp3 HTTP/1.1 for instance (in the wrong order) should not match at all

I got SOME of the matches that I wanted using the Fuzzy Match tool, but it is a very blunt instrument for what I need, it would match the above in the wrong order; not what I want

What I really need is the ability to match 2 datasets together using wildcards with complete precision. As I said in the original post, it seems to me like regex should do it, but the question is, can it be done in the context of a join?

Federica_FF · ‎09-21-2016

Hi,

so basically your map column is already the regex pattern you need, you only need to turn/translate it into regex language.

In the workflow attached I've translate the map string into regex (replacing the astheriscs with ".+" which means any sequence of characters and making the first astherisc optional) with a formula tool.

Then I've joined the tables with an Append tool, matching each record of the first table with each record of the second one, and then filtered only the records where the regex pattern match the URL.

Check if it works!

jgo · ‎09-21-2016

Hi @JohnY,

If the "identifier" value always starts with 'rn_' you can simply use the expression below in a Formula tool to simply parse that part out and concatenate (append) to 'rn_' and forgo a lookup file all together.

REGEX_REPLACE([URL],'^rta/podcast/.+(\w{3})_.*$','rn_$1')

Hope this helps,

JGo

JohnY · ‎09-21-2016

Hi and thank you

Yes in the last couple of days, a colleague and I came up with this solution as well (although yours is more complete and elegant!), but, being a cartesian join, it makes for potentially very large datasets and wait times.

We were kind of hoping there was another way to to do it instead, but if not this will get us out of trouble.

JohnY · ‎09-21-2016

No unfortunately the state

@jgo wrote:
Hi @JohnY,
If the "identifier" value always starts with 'rn_' you can simply use the expression below in a Formula tool to simply parse that part out and concatenate (append) to 'rn_' and forgo a lookup file all together.

REGEX_REPLACE([URL],'^rta/podcast/.+(\w{3})_.*$','rn_$1')

Hope this helps,

JGo

Unfortunately the identifier may be completely different each time. I will check out your regex statements though, clearly your regex is more advanced than mine!

Alteryx Designer Desktop Discussions

Regex join, Wildcard join or "Like" Statements

Re: Unable to get an output

Re: Extracting the list of sheet names across mult...

Re: Chaining Apps

Re: Unable to read in all raw xml from an excel fi...

Unable to read in all raw xml from an excel file