Alteryx Designer Desktop Discussions

SeanAdams · ‎09-25-2017

@BenG did a presentation at Inspire on some best practices - one of which is removing unnecessary fields from data BEFORE a join, rather than throwing them away using the join, since join requires a sort which is an expensive operation if you're carrying unnecessary data.

Questions on the internals of the join, so that we can all get better at using this well:

- In cases like this - is the join tool smart enough to throw away the extraneous data before doing the sort, or does it carry all this info through a sort? If it's carrying all this data into the sort (to enable the join), then this feels like a relatively quick optimization to do (i.e. trim out any non-join fields that are discarded in the join as a first step), and I'll log the idea

- does Alteryx actually sort all the data, or does it just sort a list of pointers to memory addresses? If the former, then we should be aggressive in trimming data columns going into a sort of any kind. If the latter, then extra columns really should not make much of an impact?

Example below:

cc: @AdamR_AYX

jdunkerley79 · ‎09-26-2017

Select is done post Join - the L and R outputs contain unmodified data sets

Rough sequence is (I think - but need someone at Alteryx to confirm):

1. Sort L input by left set of fields in join

2. Sort R input by right set of fields in join

3. Perform join (scan down both sorted sets)

4. Apply select on O output as record pushed into it

Worth noting that a Find and Replace does not sort the L input only the R.

SeanAdams · ‎09-26-2017

Thanks James - if this is correct then there's an easy technical optimization that can be done with the Join tool to include a pre-processing step removing unused fields.

I'll keep this unsolved for now to allow the internal dev team at Alteryx to opine.

Ned · ‎09-27-2017

The join does remove unneeded field before the sort. As you say, it is an obvious optimization.

SeanAdams · ‎09-27-2017

Thanks Ned - appreciate you confirming, so there's no overhead in filtering in the join (as opposed to filtering them out with a select before the tool).

Ned · ‎09-28-2017

I should have added - not only does it filter out unneeded fields ahead of time, if there is enough memory it will also just do the sort by moving pointers around, not the whole records. Of course if there isn't room in memory, it does require that the whole records get written to disk.

Also, @jdunkerley79 is correct about the basic order - the select does happen after because often you won't output the join fields from 1 side, even though they are needed in the original sort to do the join. That means there are 2 different filtering of the output fields.

Alteryx Designer Desktop Discussions

Order of operations for Join tool (question on the internals of the join)