on 03-03-2016 12:37 PM - edited on 07-27-2021 11:43 PM by APIUserOpsDM
It's not uncommon to have a situation where you need to conditionally join one dataset to another. Perhaps the most common is when you want to join one file to another where a date from the first file is between, greater than or less than a date(s) on a second file. The default tools found in the Join section of the tool palette don't provide a simple way of doing this (it can be done, but you need to string several tools together to make it work. There is a better way! Read on...).
There is great macro available in the public Alteryx gallery called Advanced Join (find it here, but spoiler alert... you can download the attached workflow which includes this macro so you don't have to go to the gallery to get it). The Advanced Join gives you greater latitude than the Join tool. Most notably, you can select records from file A that are unique to A AND intersect with file B. Now you may be thinking, “I can do that by unioning the records from an inner join with records from a left join,” and you would be correct. But it takes two tools to do what one Advance Join does. More importantly, the Advanced Join allows you to put a conditional statement on your join which is something you can't do with the Join tool. And it’s this feature - the ability to use conditional statements in a join - which we will focus on for our purpose here.
Let's get into some examples. I have a file, 'Fruit List’, which contains data about various fruits. This file contains a Column Id, a Fruit Name, a Start DateTime and an End DateTime:
I have a second file, 'Greek Alphabet’, which contains a Column Id, a Greek letter and a Datetime.
I want to join the two files on ColumnId where the Datetime from Greek Alphabet (file B) is BETWEEN Start Datetime and End Datetime from Fruit List (file A). Here's the workflow and a screenshot of how to configure the Advanced Join:
And here are what my results look like:
Only one record from Greek Alphabet matched one from Fruit List on ColumnId where Greek Alphabet's Datetime was between Fruit List's Start Datetime and End Datetime.
In the next example, I have the same Fruit List file and want to join it another file, Greek Alphabet that contains just one datetime filed:
The first thing to note is both files have a field called 'DateTime.' We'll want to give these unique names to avoid ambiguity when we write our conditional state in the Advance Join configuration.
I want to join both files on ColumnId but only when DateTime from Fruit List is LESS THAN DateTime from Greek Alphabet:
And the results...:
Let's look at one last example. This time, I'm going to use the Fruit List and Greek Alphabet files used in the first example (Fruit List has a Start DateTime and an End DateTime). I'm interested in matching records where DateTime from Greek Alphabet is BETWEEN Start Datetime and End DateTime from Fruit List. I'm not matching on ColumnId this time.
For the Advanced Join configuration, I'm going to cross join my files. (CAUTION: the resulting join could contain as many rows as the product of the number of rows of the incoming datasets - a.k.a. Cartesian join - depending on how restrictive your conditional is. This means if you're joining 2 datasets that contain a million records each, the resulting dataset could contain as many as one trillion records! ). If I had wanted to match on ColumnId, I would have had to do that separately using a Join tool. The cross join option only allows you to apply a conditional statement:
Results from our 3rd example:
Notice how 10 records from Greek Alphabet were joined to just one record from Fruit List.
The Advanced Join tool can save you time and a lot of headaches when you want to join files using a conditional statement. It has some limitations - you can only join two datasets and include one conditional statement per tool, cross join limitation mentioned above - but Advanced Join provides greater capability and flexibility than the standard Join tool.
I ran into a situation recently where I had two tables with a conditional join that involved a column from TableA being bound by two columns from TableB and another column from TableA satisfying an inequality from TableB. Also, TableA had over 100 columns, so any approach that involved joining back to TableA on all TableA columns was rather tedious.
A very easy solution - as long as TableA isn't too huge - is to use sqldf in the R tool. This should also be considered for most any complex join condition. My table as noted had over 100 columns and also over a million rows. The sqldf approach wasn't fast - about 3 minutes to complete, and also gave one of those weird errors that isn't an error (see here)... but was very easy to implement and solved my problem.
The first example makes an assumption that does not translate well for broader problems. It assumes that each Greek letter can only occur once. This may be true in that context, but suppose the below tables are what you are working with and you need to calculate total sales for each team by month. In the above example it will error out with this message "Tool #23: ....records were generated by an N-Dimensional Join"
Sale_ID | Sales Person | Sale Date | Sale Amt |
1 | Natasha | 1/1/2018 | 96 |
2 | Steve | 1/1/2018 | 36 |
3 | Tony | 1/1/2018 | 75 |
4 | Natasha | 1/2/2018 | 43 |
5 | Steve | 1/2/2018 | 38 |
6 | Tony | 1/2/2018 | 59 |
7 | Natasha | 1/3/2018 | 24 |
8 | Steve | 1/3/2018 | 79 |
9 | Tony | 1/3/2018 | 66 |
10 | Natasha | 1/4/2018 | 60 |
11 | Steve | 1/4/2018 | 39 |
...........
And
Team Member | Team | Start | End |
Natasha | AAA | 1/1/2018 | 3/1/2018 |
Natasha | BBB | 3/2/2018 | 4/17/2018 |
Natasha | CCC | 4/18/2018 | 8/8/2018 |
Natasha | DDD | 8/9/2018 | 12/31/2099 |
Steve | DDD | 1/1/2018 | 5/1/2018 |
Steve | AAA | 5/2/2018 | 7/14/2018 |
Steve | CCC | 7/15/2018 | 12/31/2099 |
Tony | CCC | 1/1/2018 | 1/28/2018 |
Tony | BBB | 1/29/2018 | 3/2/2018 |
Tony | AAA | 3/3/2018 | 11/25/2018 |
Tony | DDD | 11/26/2018 | 12/31/2099 |
The way around this issue is to use the Advanced Join tool to do a cross join with this condition:
Team_Member = Sales_Person AND Start <= Sales_Date AND Sales_Date <= End
This is the result:
Team_Member | Team | Start | End | Sale_ID | Sales_Person | Sales_Date | Sale Amt |
Natasha | AAA | 1/1/2018 | 3/1/2018 | 1 | Natasha | 1/1/2018 | 96 |
Steve | DDD | 1/1/2018 | 5/1/2018 | 2 | Steve | 1/1/2018 | 36 |
Tony | CCC | 1/1/2018 | 1/28/2018 | 3 | Tony | 1/1/2018 | 75 |
Natasha | AAA | 1/1/2018 | 3/1/2018 | 4 | Natasha | 1/2/2018 | 43 |
Steve | DDD | 1/1/2018 | 5/1/2018 | 5 | Steve | 1/2/2018 | 38 |
Tony | CCC | 1/1/2018 | 1/28/2018 | 6 | Tony | 1/2/2018 | 59 |
Natasha | AAA | 1/1/2018 | 3/1/2018 | 7 | Natasha | 1/3/2018 | 24 |
Steve | DDD | 1/1/2018 | 5/1/2018 | 8 | Steve | 1/3/2018 | 79 |
Tony | CCC | 1/1/2018 | 1/28/2018 | 9 | Tony | 1/3/2018 | 66 |
....
You then can remove unnecessary columns and do a simple aggregation to figure out how much each team did in sales by month.
Of course if both of these tables are in the same database it is much easier just to write this logic in join statement instead of all these weird work arounds :)
Hi All,
I have trying to do an advanced Conventional Join with a condition that visit date is between Start date and End date if the Join field matches.The data types of the join field is int16. But, i keep receiving an error " records were generated by an N- Dimensional Join - Join Key:(_A_:3)"
Please let me know if i am missing on anything. Screenshots attached below.
Thank you in advance!
@bvinaya this has come up before here. Here was my response...
For a conventional join the join tool within the macro (tool #23) is set to error on multidimensional joins of more than 16 records to keep the output from blowing up. So you'll either need to address this within your incoming data (for example by removing records that have duplicates of the join key on one side of the join) or you can change the configuration of the join multiple tool #23 within the macro to "Allow all multidimensional joins".
Hi NeilR,
Thank You for getting back with me on this!
It's worked. :)
Thank You!
Where ranges are involved, although it may seem a little fiddly to do, you could generate spatial objects (trade areas and points) to represent date ranges and dates (you can decide on the scale to use). You can then use the Spatial Match tool to effectively join on the date and date range (date is target, date range is universe, and you select target within universe).
This can also be applied to other ranges and points (not just dates).
Hi NeilR,
How do I change the configuration of the join multiple tool #23 within the macro to "Allow all multidimensional joins"?
@jstney :