Hey everyone,
I have a large dataset that I use to generate samples using a random seed. For efficiency, I prefer to run the workflow using the AMP engine. However, I've encountered an issue: with multi-thread processing, the sample often changes each time the workflow is executed because the order of the records is altered.
I considered adding a Record ID tool at the start of the workflow, but I believe this would be ineffective if the input order changes when the files are brought into the workflow. Another idea I've developed is to create two workflows: the first would input the data and add a Record ID without using the AMP engine and then output this to be used in the second workflow, which would utilize the AMP engine.
I wanted to get your thoughts on whether there might be a more efficient solution that would still ensure the sample remains consistent across runs.
The reason that the order changes is that AMP is multi-threaded (Alteryx Multi-threaded Processing). So, in certain tools, the data will be chunked and sent to different cores for processing, then put back together in groups. For example, the Multi-Row Formula will split it into chunks based on the grouping fields selected (Disclaimer: This is my own view, not something I official, I find I don't need to sort unless I use a grouping field). A sort before the Multi-Row solves the issue.
AFAIK, the Input Tool, should always read in the data as is, so a recordID straight after would work to give you something to sort on. I think it would be an issue, if the order was changed on input, unless there was another selection or an SQL query etc.
@BaileyCallander you can also use Engine Compatibility mode
Hi @nbondarchuk,
I appreciate your reply, while the Engine Compatibility Mode works well for smaller datasets, I noticed that it may not always produce the exact same output with larger datasets.