Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Join returns too many records

oracleoftemple
9 - Comet

This is my first day using Alteryx. I'm trying to Join transaction level purchase data with data about the vendor.  The Join tool seems to be returning way too many records.  My left input is transactions, and it's coming from four separate Excel files (I'm using the wildcard (*) to bring in all 4).  In total, there are 259,980 records.  My right input is vendors - there are 45,173 of these, and they're coming all from one file.  My Join output looks like this: there are 133,631 records in the L output, 878,918 records in the J output, and 33,362 records in the R output.  I wouldn't have expected the total of all three of those outputs to be that high.  If the Join had matched each transaction record with a vendor, it should have returned 259,980 records.  Even if there were no matches, there only should have been 305,153 records (259,980 + 45,173).  Why is the number of records in the output so high?

12 REPLIES 12
MarqueeCrew
20 - Arcturus
20 - Arcturus

@oracleoftemple,

 

If both the Left and Right inputs have duplicate transaction keys, you'll get a Cartesian product (e.g. 3 * 2 = 6).  This is 100% what is happening.  Either there is another field to add to the join key or you need to:

 

Unique the Right data

Summarize the Right data

 

Where we assume that the Left data is already unique to the transaction level.

 

Cheers,

 

Mark 

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
KOBoyle
11 - Bolide

Is it possible that the Vendor field that you are joining on is not unique? If more than one record in the Vendor source has the same Vendor Id or Vendor Name, one transaction could return more than one vendor in the join.

oracleoftemple
9 - Comet

You were right.  There were duplicate vendors.  I guess a new vendor with the exact same Supplier Number (my join field) is created each time a vendor changes addresses.  Also if they have a physical address and a PO Box, it will be in there twice.  Thanks for the prompt reply!!

oracleoftemple
9 - Comet

One more thing - the left input (transactions) had some info that I could do without.  It's coming from a liability account, so it had invoices to increase the account and payments to decrease it.  My solution was to use a Filter tool to split invoices (True output) from payments (False output) and connect the True output to the left input of the Join.  Is that the best way to handle that, or would you have done something else?

KOBoyle
11 - Bolide

@oracleoftempleIt sounds like your Vendor data source is a dimension table from a data mart/warehouse and that it is a slowly changing dimension. Typically this type of table will have effectivity date columns for each record. What you may want to try is use the Filter Tool with a condition like the following: [Effective Date Start] <= DateTimeNow() AND ( [Effective Date END] > DateTimeNow() OR IsNull([Effective Date END])

 

This approach should give you a distinct list of vendor records with the most current data.

 

KOBoyle
11 - Bolide

In Alteryx there are often several ways to accomplish a similar objective, but from what you described the Filter tool sounds like the best approach. Records on the (F) output can simply be ignored. 

oracleoftemple
9 - Comet

That's exactly right.  There's a Last Update Date column.  I'd be needing the record with the most recent date in that column - that way I'll have their current address also.  This is extremely helpful.  Thank you!

oracleoftemple
9 - Comet

Last question - it looks like I don't quite have the fields to do what you recommended.  Can you think of a way to compare all records with the same Vendor Number and remove all those other than the one with the most current date in the Last Update Date field?

KOBoyle
11 - Bolide

@oracleoftempleThe simplest approach would be to use the Preparation:Unique tool if you don't care which Vendor record you get. If the most recent Last Update date for each Vendor is required, you could use the Transform:Summarize tool ( Vendor - Group By, Last Update Date - Max). You would then need to join the result back to the Vendor source (joining on both Vendor and Last Update Date) to get the rest of the fields.

Labels