Good Morning!
I'm having a brain freeze trying to join some data. Essentially, I have a concatenate that I was using as a unique ID, but I found a handful of instances where my data in the concatenates can change from year to year. So, my join created additional rows for those instances, but I need them on a single row. My example is below, and I was wondering if anyone could think of a way I could join this outlier data. I thought about trying to sort and then have alteryx create a unique id and join on that, but I'm not sure what the best approaches would be. Could I use Fuzzy Match?
Data Stream 1 (Left Join) | |||
Year | Client # | Name | Concatenate |
2018 | 1 | John Doe | 1John Doe |
Data Stream 2 (Right Join) | |||
Year | Client # (Number has changed for X Reason, but the data is related to left join) | Name | Concatenate |
2019 | 2 | John Doe | 2John Doe |
Solved! Go to Solution.
hi @Rosaile
I'm not quite sure what the output of your join would be. Are you saying that the 2 records for John Doe should be joined into one row? I assume that joining on the just the name gives you too many matches.
How do you know that "1John Doe" and "2John Doe" are actually the same record? If you can quantify what additional information implies that result, you should be able add it to the join. If it's something like address, then maybe the Fuzzy Match tool can help you.
Dan
Sorry, maybe this is a better example. Thinking of a single business paying property taxes to a specific county each year. That business may have its account number changed by the taxing jurisdiction from one year to another, and likewise, the business may change its own unique identifier from one year to another or the business may move addresses etc.
Either way, I have to keep track of the current year tax paid and the prior year tax paid, but it has to be reported on a single line both PY and CY. For most of my report, I can just create a unique ID through concats to satisfy this requirement, but instances where one of my fields used to create the concat change, I end up with more than one line.
Data Stream 1 (Left Join) | |||||||
Year | Business Name | Business ID | Tax ID | ||||
2018 | John's Business | 123 | 456 | ||||
Data Stream 2 (Right Join) | |||||||
Year | Business Name | Business ID | Tax ID (Changed in New Year) | ||||
2019 | John's Business | 123 | 678 | ||||
What I want to see on my report | |||||||
Year | Business Name | Business ID | Tax ID (Changed in New Year) | Input #2 Year | Input #2 Business Name | Input #2 Business ID | Input #2 Tax ID (Changed in New Year) |
2018 | John's Business | 123 | 456 | 2019 | John's Business | 123 | 678 |
As an addendum, both the Business ID or the Tax ID can change, so their really isn't a good unique ID to use. That's why I was thinking maybe Fuzzy Match or something might be a good tool to use for the outliers. Unfortunately, I'm not sure how to do that with the Fuzzy Match.
Thanks for the assistance @danilang in getting me to think a little more on this.
I decided to try and "waterfall" the joins, if i'm using that term correctly, and join first by one of the fields, and then by next one...............until i pick up all the outliers.
Hi @Rosaile
Is the Full Name a good thing to join?
I was thinking about an Append Fields Tool, but if there are many records on both data streams, this wouldn't work well.
But appending is a good way to pair data and use other tools to compare columns with each other.
Cheers,
Thank you for the help @Thableaus . I do have a lot of records, but when I get down the remaining items I'm having to join I will definitely look into the append tool, as that's one I have not tried out yet.
I think you're on the right track with the waterfall joins.
One approach we've taken before is to create a crosswalk table. Typically in cases like this, there is some individual who knows for sure what the old an new numbers are - they would probably have to do that in accounting, for instance.
So we would make a table, very much like your output example, putting the old and new numbers in the same row, but you only have to do it for the changed items.
If you know for sure what's changed, you can match it up at the beginning instead of the end using a text input and a find/replace.
Then when your data proceeds through the rest of your workflow, you can join on the old numbers instead of trying to match old and new.
Another benefit of this method is that any additional companies you find at the end that "fall out" of the join can be added to the text input and gradually your unmatched records are reduced as you run the workflow, add your exceptions to the text input, then run again.
Thank you for your feedback @ThizViz . I'm going to utilize your suggestion and add the "old" numbers in the same row with the new at the beginning of the workflow, so they can be matched up later on.