I have 2 data sets that I am trying to compare 1 to 1, but this is split across 3 columns and the way the fields are concatenated is difference.
Data set 1 has 1 column with a concatenation of String 1, String 2, and String 3, while Dataset 2 has one column containing a concatenation of String 1 and String 3 and another column with String 2.
Data set 1
Column 1: String 1 + String 2 + String 3
Data set 2
Column 2: String 1 + String 3
Column 3: String 2
There is no pattern of what type of data could be in each string - it is not limited to words, addresses, IDs, etc and none of the fields have a predictable or set length. I am wondering if it is possible to first compare the left and right of Column 1 to Column 2, and then take the area in the middle that stopped matching from Column 1 to Column 2 to then compare to Column 3. We are looking to not use fuzzy match if at all possible.
Example:
Data set 1
Column 1: UX4759d452regcustdelivery
Data set 2
Column 2: UX4759regcustdelivery
Column 3: d452
Thank you very much in advance for your consideration.
Solved! Go to Solution.
Let's say you take the left and right of both columns and want to compare them. How are you going to determine the amount of character from left and right if each record can have a different amount?
Another option would be if you already have a table with all the IDs, then you could use find and replace to check against that table.
Here's one way that you can do it
The key formula is this
if findstring([Column1],[Column3])>=0 then
Left([Column2],findstring([Column1],[Column3]))+[Column3]+Substring([Column2],findstring([Column1],[Column3]))
else
""
endif
If column 3 is contained in column 1, then take the left part of column 1 to where column 3 starts, add column 3 and then add the rest of column 1
This works for the base case where column 2 is the one that has the piece missing. You may have to modify it to work in edge cases.
Dan
Hi Dan,
Thanks so much for this. This took care of most of the data, but there is a bit remaining that falls under a slightly different pattern where the middle string of Column 1 is not a direct match to Column 3. However, if there was a way to isolate the beginning and end of Column 1 (UX4759 and regcustdelivery) and see if the remaining d452 from Column 1 is contained within Column 3, then that would suffice as a match for my purposes.
I’m thinking the logic might be quite different since you can’t use findstring to search for Column 3 in Column 1 directly, but any ideas?
Data set 1
Column 1: UX4759d452regcustdelivery
Data set 2
Column 2: UX4759regcustdelivery
Column 3: 123d452xyz
If we can see that d452 from Column 1 is present within Column 3 (using a contains?) then that would be enough.
Thanks again very much for your help.
Here's a way that you can accomplish the second part
For each of column 1 and 2, split the string to letters and join on record id and position. Use a Multi-row to find where the strings stop matching and the number of matching letters is the length of the left hand match. Do the same after reversing the strings to find the length of right hand match. Join these back up to you original input and use
Substring([Column 1],[LeftSide],length([Column 1])-([LeftSide]+[RightSide]))to pull out the middle piece. Look for this middle in column 3
Dan
Hi Dan,
Thank you so much for this - this is perfect. I marked your first answer as the solution since that was the fit for the original question, but this is also unbelievably helpful. I really appreciate you taking the time to make these. This is also my first time seeing Regex Tokenize, something I clearly need to work on. Thank you again!
