@BenG did a presentation at Inspire on some best practices - one of which is removing unnecessary fields from data BEFORE a join, rather than throwing them away using the join, since join requires a sort which is an expensive operation if you're carrying unnecessary data.
Questions on the internals of the join, so that we can all get better at using this well:
- In cases like this - is the join tool smart enough to throw away the extraneous data before doing the sort, or does it carry all this info through a sort? If it's carrying all this data into the sort (to enable the join), then this feels like a relatively quick optimization to do (i.e. trim out any non-join fields that are discarded in the join as a first step), and I'll log the idea
- does Alteryx actually sort all the data, or does it just sort a list of pointers to memory addresses? If the former, then we should be aggressive in trimming data columns going into a sort of any kind. If the latter, then extra columns really should not make much of an impact?
I should have added - not only does it filter out unneeded fields ahead of time, if there is enough memory it will also just do the sort by moving pointers around, not the whole records. Of course if there isn't room in memory, it does require that the whole records get written to disk.
Also, @jdunkerley79 is correct about the basic order - the select does happen after because often you won't output the join fields from 1 side, even though they are needed in the original sort to do the join. That means there are 2 different filtering of the output fields.