We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Identify identical part between two strings

alts2h
7 - Meteor

Dear all,

 

Here is my problem: I need to identify the identical part between two strings A and B.

Put in another way, I need to know how much first characters are identical between string A and string B.

 

The length of string A and of string B are variable. The length of the identical part is also variable.

There are no specific characters (like - or /) that I can use to try a regex match.

The only thing I know is that the identical part is necessary at the beginning (left) of the two strings.

 

For exemple:

comparing "ABC123"  and "ABC234" must return "ABC"

comparing "ABCD" and "AD12" must return "A"

comparing "123" and "A123" must return Null

 

I can't figure out a solution...

Thank you so much for your help!

8 REPLIES 8
danrh
13 - Pulsar

Not as concise as I'd like, but something like this will do the trick:

image.png

 

This splits out every character onto a separate row for both fields, gives them a position ID within that field, then matches them back up.  Then, after a filter to only keep the first run of matches (in case there's a random match later in the string), it concatenates them back together and re-joins to your original data.  Like I said, a little messier than I'd like, but gets the job done!

BenMoss
ACE Emeritus
ACE Emeritus

What does your data stream look like?

 

Is it two sources that you want to compare all values against all values, or is your data already mapped together and you want to do this comparison at a row level?

 

Could you share say a 10 row sample dataset? I have a solution in mind based on what you said but it perhaps isn't the most performant. See the below image and attached workflow.

 

Essentially we are parsing each letter out and giving it an ID, then joining this parsed field1 and field2 out on the letterID and the letter value.

 

Once we have this we can identify for how long does the letterID increment by just one (i.e. consecutive letters) and therefor we then know which letters match at the beginning of the string.

 

2018-06-07_15-55-21.png

 

 

alts2h
7 - Meteor

Thank you so much to both of you.

It solved my problem and works great.

In my case, there are really few lines (about 100) so there is no performance point to be solved.

JohnJPS
15 - Aurora

There's also a cool RegEx trick for this: (see the RegEx answer to a similar question here, which includes several other answers for doing it with R).  So, suppose fields [f1] and [f2], and you'll need a character not found in either, in this case I'm using the pipe character, then the following formula will return the common starting substring:

REGEX_Replace(
  [f1]+"|"+[f2],
  "^([^|]*)[^|]*(?:\|\1[^|]*)$",
  "$1"
)

 

alts2h
7 - Meteor

Wouh: AMAZING.

But I'm not sure I really get it...

I definitely need to dig down into RegEx... Do you have any good link / guide to share?
Thanks in advance

Dana_Jinga
5 - Atom

Found it very useful when comparing the fuzzy matching results!

Divyang21Arora
6 - Meteoroid

What if I want to get all the matching characters in the String.

For example, TNT Express and Express Logistic should return "Express" as the output.

JohnJPS
15 - Aurora

@Divyang21Arora ,

 

Two approaches depending on what you're after:

  • If you just want common words, it should be easy to tokenize each string (e.g. split on spaces where each split goes to a new row), then join these, requiring matching tokens.
  • If you want matching characters it can get fairly involved... but let's say you want the longest common substring. That's a well known problem in computer science, and you'll probably just want to grab some published Python code. For example here's a Dynamic Programming solution that is also very well documented: https://www.geeksforgeeks.org/longest-common-substring-dp-29/

Hope that helps!

John

Labels
Top Solution Authors