Hey everyone,
We are attempting to do some large scale data processing where order is crucial to our process. Because of the size, the workflow greatly benefits from us using AMP. Does anyone know how long the sort order is retained after using the sort tool (i.e. certain type/number of tools after sort tool)?
- In our specific example, we have data being processed in the following sequence: Input data -> Sort tool -> Select tool -> Output data, and want to make sure the sort order is retained in the output file.
Hi @joentrice21 thats a great question! This webpage should contain any tools that are AMP enabled along with any differences - specifically you can check "Record Order" in a ctrl-F to see what tools might affect it.
Further, one way to accomplish a guarantee would be to immediately apply a record ID after the initial sort, and before output, sort by record id to achieve the sorted order.
Let me know if this helps!
Best,
Peter
Hi @PeterA1, thanks for the response! I work with @joentrice21. We're very familiar with the changes in Record Order in certain tools in AMP (particularly in tools that changed from sort-merge to hash operations as mentioned here). What we're seeing is actually a little different. We're seeing tools that shouldn't alter Record Order seemingly alter Record Order, and Record Order even changing between one tool's output and the subsequent tool's input. For example, in this workflow (also attached as "Alteryx Sort Order Retention.yxmd"):
According to the official documentation link you sent, none of these tools should alter Record Order under AMP (e2) differently than e1. (The sort tool of course alters Record Order in line with its official e1 documentation.) Looking at the results above, that doesn't seem to be the case though. Starting with the first select tool output (red #1 above), Record Order begins to change. The first select tool receives the correct Record Order, but the output is different (first record is #19,745,186). The second select tool input receives it in that Record Order (red #3), but the browse tool input (red #2 above) receives a different Record Order than the first select tool output (red #1, though technically this is the "correct" Record Order based on the preceding select and formula tools). The same thing happens with the second select tool (#3->#4 change, #4->#5 change, #4->#6 consistent). The final select tool then experiences a final Record Order change (back to the "correct" Record Order, again).
(Note: you may not be able to fully reproduce these exact results with the workflow. This problem is frequent but does not appear to be deterministic.)
Analyzing these results, what sticks out to me is the Record Order appears to largely be maintained (consecutive records are still descending), it just appears to be "chunks" that are mis-ordered. This made us think that it has something to do with AMP's 4Mb packets, which explicitly states "Records process in 4Mb packets for a faster run time, and are processed out of order." (this is the source of our question - I'll circle back to it momentarily). Looking at the above results, we suspect one of the following things might be happening:
With all that said, that's why we're looking to confirm that AMP maintains Record Order.
In our particular use case, the output needs to be in a particular format in order to be used by an external program. The order of this program's input is important for processing, and the input format cannot contain the sort field's data (only one field of data is accepted). Your proposed Record ID tool + sort tool solution unfortunately won't work here, as we have to first drop the sort field with a select tool, but we're not 100% sure the select tool won't alter Record Order as seen above. Here's an example of this (also attached as "Alteryx Sort Order Retention - Use Case.yxmd")
This finally takes us back to our original inquiry - if "records are processed out of order", is there a guaranteed Record Order in AMP between tools and in the output data tool (which would mean the results we're seeing above are a bug)? Or is Record Order not guaranteed (and the above is expected behavior)? Or is there a set number of tools/types of tools after the sort tool that Record Order might be guaranteed for?
PS - sorry for the long reply, hopefully this makes it easier to follow where our question comes from. Thanks for taking the time to read this far!
Hi @jb_ ,
Thanks for the detailed reply. Your analysis is pretty close to being exactly what is happening.
But to answer your initial question first
"Yes AMP guarantees to maintain record order through tools which are not "scrambling" the order by use of a hashing algorithm".
Ignoring the "browse everywhere" (what you called the "results window preview") data for the moment, if you think you have an example where this isn't the case then this would definitely be a defect.
Then to come back to what you are seeing: If I understood correctly where you are seeing the difference in order in the "Browse Everywhere" preview data.
So looking at your list
Then yes it is 3 and 4. AMP will process multiple record packets in parallel in different threads, which means it can actually end up doing the work on them out of sequence. But as you have guessed in your point 4 each packet has a sequence number which means that when sequence is required by a tool (such as the output or browse tool) then the engine is able to reassemble the records in the correct order. This is why the browse tools you have show a consistent ordering.
But your point 5 was more by design than a defect (but a design that we are actively questioning). If you look at the https://help.alteryx.com/20213/designer/tool-use-amp link, for Browse Everywhere we say this "Record order. For performance reasons, this tool could provide different output, run-to-run, because it doesn't require sequence, and can take any record packet that comes from a tool.". The reasoning for this is that reassembling step has a cost to it, and our thinking was that did we want to pay that cost on every single output anchor on the workflow? So the trade off is performance against ease of use. As I said, this is something that we are currently re-assessing and actively measuring what the cost of this is in terms of decision.
I hope that answers your question. Feel free to ask more below if you need any further clarification.
Tagging some members of the Alteryx Engine product team for visibility. @gfilla @TonyaS
Have you upgraded by chance to 21.4 or later? We have addressed various defects where there were variations in the output order from run to run in some tools, as well as introduced a feature called "Engine Compatibility Mode" where you can select that when using AMP and it will use the sort based grouping type that was used with the original Engine, rather than the hashed grouping that AMP uses. https://help.alteryx.com/20221/designer/engine-compatibility-mode
Additionally, the 2022.1 version is the most stable and recommended version to use with AMP Engine since it contains a good deal of fixed issues that were reported with various previous versions.