Alteryx Designer Desktop Discussions

alts2h · ‎06-07-2018

Dear all,

Here is my problem: I need to identify the identical part between two strings A and B.

Put in another way, I need to know how much first characters are identical between string A and string B.

The length of string A and of string B are variable. The length of the identical part is also variable.

There are no specific characters (like - or /) that I can use to try a regex match.

The only thing I know is that the identical part is necessary at the beginning (left) of the two strings.

For exemple:

comparing "ABC123" and "ABC234" must return "ABC"

comparing "ABCD" and "AD12" must return "A"

comparing "123" and "A123" must return Null

I can't figure out a solution...

Thank you so much for your help!

danrh · ‎06-07-2018

Not as concise as I'd like, but something like this will do the trick:

This splits out every character onto a separate row for both fields, gives them a position ID within that field, then matches them back up. Then, after a filter to only keep the first run of matches (in case there's a random match later in the string), it concatenates them back together and re-joins to your original data. Like I said, a little messier than I'd like, but gets the job done!

BenMoss · ‎06-07-2018

What does your data stream look like?

Is it two sources that you want to compare all values against all values, or is your data already mapped together and you want to do this comparison at a row level?

Could you share say a 10 row sample dataset? I have a solution in mind based on what you said but it perhaps isn't the most performant. See the below image and attached workflow.

Essentially we are parsing each letter out and giving it an ID, then joining this parsed field1 and field2 out on the letterID and the letter value.

Once we have this we can identify for how long does the letterID increment by just one (i.e. consecutive letters) and therefor we then know which letters match at the beginning of the string.

alts2h · ‎06-07-2018

Thank you so much to both of you.

It solved my problem and works great.

In my case, there are really few lines (about 100) so there is no performance point to be solved.

JohnJPS · ‎06-07-2018

There's also a cool RegEx trick for this: (see the RegEx answer to a similar question here, which includes several other answers for doing it with R). So, suppose fields [f1] and [f2], and you'll need a character not found in either, in this case I'm using the pipe character, then the following formula will return the common starting substring:

REGEX_Replace(
  [f1]+"|"+[f2],
  "^([^|]*)[^|]*(?:\|\1[^|]*)$",
  "$1"
)

alts2h · ‎06-08-2018

Wouh: AMAZING.

But I'm not sure I really get it...

I definitely need to dig down into RegEx... Do you have any good link / guide to share?
Thanks in advance

Dana_Jinga · ‎06-06-2019

Found it very useful when comparing the fuzzy matching results!

Divyang21Arora · ‎09-26-2021

What if I want to get all the matching characters in the String.

For example, TNT Express and Express Logistic should return "Express" as the output.

JohnJPS · ‎09-26-2021

@Divyang21Arora ,

Two approaches depending on what you're after:

If you just want common words, it should be easy to tokenize each string (e.g. split on spaces where each split goes to a new row), then join these, requiring matching tokens.
If you want matching characters it can get fairly involved... but let's say you want the longest common substring. That's a well known problem in computer science, and you'll probably just want to grab some published Python code. For example here's a Dynamic Programming solution that is also very well documented: https://www.geeksforgeeks.org/longest-common-substring-dp-29/

Hope that helps!

John

Alteryx Designer Desktop Discussions

Identify identical part between two strings

Re: Unable to get an output

Re: Extracting the list of sheet names across mult...

Example workflow for setting up a custom list to u...

Re: Firm names parse

Re: Help with Multi-Row formula