Dear all,
Here is my problem: I need to identify the identical part between two strings A and B.
Put in another way, I need to know how much first characters are identical between string A and string B.
The length of string A and of string B are variable. The length of the identical part is also variable.
There are no specific characters (like - or /) that I can use to try a regex match.
The only thing I know is that the identical part is necessary at the beginning (left) of the two strings.
For exemple:
comparing "ABC123" and "ABC234" must return "ABC"
comparing "ABCD" and "AD12" must return "A"
comparing "123" and "A123" must return Null
I can't figure out a solution...
Thank you so much for your help!
Solved! Go to Solution.
Not as concise as I'd like, but something like this will do the trick:
This splits out every character onto a separate row for both fields, gives them a position ID within that field, then matches them back up. Then, after a filter to only keep the first run of matches (in case there's a random match later in the string), it concatenates them back together and re-joins to your original data. Like I said, a little messier than I'd like, but gets the job done!
What does your data stream look like?
Is it two sources that you want to compare all values against all values, or is your data already mapped together and you want to do this comparison at a row level?
Could you share say a 10 row sample dataset? I have a solution in mind based on what you said but it perhaps isn't the most performant. See the below image and attached workflow.
Essentially we are parsing each letter out and giving it an ID, then joining this parsed field1 and field2 out on the letterID and the letter value.
Once we have this we can identify for how long does the letterID increment by just one (i.e. consecutive letters) and therefor we then know which letters match at the beginning of the string.
Thank you so much to both of you.
It solved my problem and works great.
In my case, there are really few lines (about 100) so there is no performance point to be solved.
There's also a cool RegEx trick for this: (see the RegEx answer to a similar question here, which includes several other answers for doing it with R). So, suppose fields [f1] and [f2], and you'll need a character not found in either, in this case I'm using the pipe character, then the following formula will return the common starting substring:
REGEX_Replace( [f1]+"|"+[f2], "^([^|]*)[^|]*(?:\|\1[^|]*)$", "$1" )
Wouh: AMAZING.
But I'm not sure I really get it...
I definitely need to dig down into RegEx... Do you have any good link / guide to share?
Thanks in advance
Found it very useful when comparing the fuzzy matching results!
What if I want to get all the matching characters in the String.
For example, TNT Express and Express Logistic should return "Express" as the output.
Two approaches depending on what you're after:
Hope that helps!
John