Goal:
Match a Car (via Car ID field) in one data source to a Trip(s) (via free text Description field) in another data source. Essentially I want to do a left join where instead of a verbatim/exact match, I want to join on instances where the left join field is contained in the right join field.
Question:
Is there a way to get a list of ID's from one data source, and compare it to a list of Descriptions in another data source, to get a match/fuzzy match?
Supplemental Background:
Data Source 1: Contains a list of Cars (15,000 rows) and the Car's ID. This is an inventory of all our vehicles. Each row represents a single Car. The Car IDs are arbitrarily labeled and are inconsistent (i.e., one Car ID may be "15838293" and another might be "Kia_Car_29_2024")
Data Example 1:
Car ID |
Kia-Forte_2018 |
Dougs Car |
329874329874 |
Dougs Car Kia |
AB1 |
Data Source 2: Contains a list of Trips (400,000 rows). This is an inventory of all the Trips that a Car took to say, deliver a package or pick up a client. Each row represents a single Trip. But, there is no "Car ID" field in Data Source 2 to directly link a Trip to a Car. However, a Description field does exist in which users will sometimes note the Car ID (e.g., "Personal. used car 15838293. 16 hours total." or "Delivery to warehouse. Kia_Car_29_2024. July 14th.").
Data Source Example 2: (bold being a match)
Trip ID | Description |
1 | Trip to storage from warehouse - kia-forte_2018 |
2 | July 28 2019 - John - Warehouse to Service - AB112 |
3 | This was a trip from the data centers to the warehouse. |
4 | Trip to Ohio. 329874329874 |
5 | James-1/2/2020-Warehouse |
6 | Authorized vehicle driven to storage kia forte 2018 |
7 | Drove Dougs Car Kia to warehouse |
8 | Vehicle to servicing |
9 | N/A |
Ideal Output:
Trip ID | Description | Matched Car ID |
1 | Trip to storage from warehouse - kia-forte_2018 | Kia-Forte_2018 |
2 | July 28 2019 - John - Warehouse to Service - AB112 | AB1 |
4 | Trip to Ohio. 329874329874 | 329874329874 |
6 | Authorized vehicle driven to storage kia forte 2018 | Kia-Forte_2018 |
7 | Drove Dougs Car Kia to warehouse | Dougs Car |
7 | Drove Dougs Car Kia to warehouse | Dougs Car Kia |
Current Attempts:
Fuzzy Match: Tried this but it only allows you to bring in one data source
Find & Replace: Tried this and got results but I fear that if I had two Car ID's of "Dougs Car" and "Dougs Car Kia" and a Trip Description of "Drove Dougs Car Kia to warehouse" it would match and therefore replace it with the incorrect "Dougs Car"
Any help would be appreciated, thank you!
Solved! Go to Solution.
@KyJoLe this should be what you are looking for. I used an Append/Formula combo to test whether the Car ID strings were inside the Trip Description. Hope that helps!
Edit: Added Generate Rows tool to the top of the workflow to account for Car IDs with special characters omitted (re: Car ID 6).
@SPetrie do you have any documentation on the "question.controlparam" variable? I didn't know this technique existed.
Thank you rzdodson and SPetrie! I'm currently modifying/configuring and testing your solutions.
It's taking a little longer because the data sets I'm comparing are large (2 lists with ~150k+ rows = ~22 billion combinations if I understand the flow right).
I'll let you know the outcome once I finish.
Hi Guys,
I thought this would be helpful for my GL analysis for tax returns, I download GL data line by line description and then find keywords based on which the line items can be deductible and non-deductible. Let me know if this can be tweaked for that?
th
@shash270 Im sure tweaks can be made, but I think it will be best for you to create your own post with example inputs and expected output rather than continue it here.
Thanks @Spetrie !
Long overdue but I've accepted both solutions. SPetrie's worked better with large datasets (comparing dataset of 100K against dataset of 300k yielding 30B total comparisons) and rzdodson worked better for other projects with smaller datasets (comparing dataset of 10K against dataset of 20K). Thank you again everyone!
User | Count |
---|---|
106 | |
82 | |
72 | |
54 | |
40 |