What are some best practices for incrementally processing a very large dataset? By that, I mean a workflow that queries a subset of a very large dataset, manipulates it, and either delete/appends on the original source or saves to a new one.
I have been experimenting with batch macros that have control parameters and no input/output. They run fine on a single batch but seem to run indefinitely without doing anything when feeding in a large number of batches. Any ideas why this wouldn't work? My only theory is that it is trying to do them in parallel and that is causing problems. Is there a better approach?
Solved! Go to Solution.
@fharper "if asking in the query for one specific month at a time and iteratively executing the query via a batch macro can provide efficient results."
I was really just looking for validation that this was the best approach, as that is exactly what I am trying to do. I was able to run a different year without issue, so there is just something unique going on that I can dig into on my own.
To optimize it, I'd be interested if anyone has an approach to write in DB without having to stream out (while running SQL to delete the portion that is being re-appended), but right now the performance is acceptable for what I am trying to accomplish.
Thanks for the input everyone!
Just as an FYI - my take on this would be to build the SQL where statement more dynamically - and to concatenate the wheres into one larger where statement (using summarize tool/append fields/formula/etc) this would prevent lots of little read/writes - and in-db runs. I get the datstream out may only be reading a few records at any one time - but it would be better to do this (and to write to redshift) as few times as possible.
There are a slew of things which can time to any individual datastream out/redshift write so it's better to put more work into fewer in-db calls.
tile input control parameter and run multiple instances of alteryx using tile number as filter for each instance.
@DataRangler- the large dataset is In-DB so the tile/multi-field binning method won't help.
Are you perhaps working remote? Get fantastic improvement by running on server or some machine that doesn't have large datasets go through the vpn firewall and security
@hroderick I work on site or remote into a machine that is on site, so that’s not an issue for me.