This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm in a situation where I need to run iteratively through several data files and join them to my main stream, appending new information to each record. Which data files I'm joining to, what I join on, and even how many I'm joining too will be variable.
I've built a batch macro that will take a list of data files and the matching keys and will iterate though perfectly, the only problem is, the iterations are unioned together at the end instead of joined, so I end up with x number of repeating records for x iterations.
The obvious solution is a transpose / cross tab after the macro/union, but with millions of records, I really don't want to transpose.
I found another solution that involves stripping off the field names and building XML for a summarize macro, but we're potentially dealing with millions of records and hundreds to thousands of iterations through the joining macro, which can easily translate to a data stream of billions of records after the union.
Has anyone come up with a way to feed output of each iteration of a macro back into the macro, or is there an approach that's just eluding me?
A potential solution for this is to not have an output in your batch macro but instead a file. That same file is also used as an input within that macro that is used to join on each batch and simply grow that file every iteration. You then use that file as your source of processing; could be YXDB if dealing with many records to help speed up performance.
I think Ryan's solution is probably the simplest if you don't want to deal with a crosstab (although I'd usually take the performance hit of crosstabbing a few million records over this approach due to some of the nuanced complexities involved).
Three things to keep in mind when you have a batch macro that is loading and closing the same file (hereafter the "Read->Write File") on each iteration:
1) You have to seed the Read->Write File before running the macro so that it exists on the first run (otherwise it will err). You can do this upstream in the workflow that runs the batches or manually in a separate workflow. If you do it upstream, be mindful of point number two and three.
2) Alteryx is really great at doing as much with data in memory as is possible. This means it uses the 'swap' less (which is good) but it also means that the first record in a file can run through the entire workflow before the last record is even read out of the file. If you try to write to a file that is being read from, it will cause an error. Use a block until done tool immediately after the Input Data node for your Read->Write File inside your batch macro to make sure that all of the records have been read in before you proceed to process it.
3) Try to do place your Read->Write File on the canvas first. Placement order can matter in edge cases, especially when it comes to writing output files. If the engine has data at several outputs to write, it will tend to write to the one that was first placed on the canvas. Same thing with reading data. It generally isn't a problem except when a possible race condition exists. This underlying characteristic of the engine will override the block until done tool in some weird instances.
You should also be sure of your row counts with each iteration (and use the multi-join or a join + union) to make sure your files are the right size, otherwise you might end up with a severely truncated file at the end.
I'm inclined to agree with Ryan since you are worried about running the transpose/crosstab which can use a LOT of memory/temp storage.
There are two things to worry about when running large processes like this. 1) memory limitations & 2) Run time
if you run all joins at once, then you might use more memory, but it could run faster. however, if you are iterating the file per batch, then you have an expensive part of the process by opening and closing the file multiple times. the direction you go will be dependent on the system you are running on.
ways to improve join speed is to only join on a single field (numeric if possible), eliminate unnecessary fields, use a find and replace tool instead of join when possible (since it doesn't use an inherent sort).
I may be misreading your requirements; so to confirm you have a large dataset, and you are reading in a number of unknown other datasets, joining them (appending) them onto your previous data set, and continuing?
Your concern is the inherent feature of Alteryx to stack the output of multiple runs of a batch macro. Aka if you had a dataset of 1 million rows with 10 columns, and you added a column, you would get 2 million rows with 11 columns..., etc...
Did I summarize that correctly? I might have a pretty simply solution for you, but want to make sure i'm not way off base.