Is it possible to specify which incoming connection custom tools should receive first? I have a custom tool with 2 inputs, Left and Right. The Right input must always get its data first. So far I have tried:
Neither option made a difference. The execution path seems to be entirely dependent on the order of the tool/macro nodes in the workflow document. I cannot find anything in either the Python or .NET SDKs that might nudge the engine into a more deterministic execution path. Can we do this currently?
I wonder if this idea might be of use to you. The image below represents the right and left data inside your custom tool and the 2 Select tools represent the start of the workflow that makes up your custom tool.
You set the join condition of the join tool to something that would never produce a matching record between Left and Right, which means that all the 'Left' data will pass through the left output of the join tool. The join tool will, however, 'hold' the left input data until such time as the right input data has arrived in order to execute the join condition.
Thanks @DavidP, it's a good work-around, but actually defeats the purpose I am trying to achieve. Perhaps I should further expound what I am trying to do.
I have a large dataset that will be going through the Left input. My custom tool would be used, repeatedly, to enrich this dataset with additional fields of interest from related tables. Now, Alteryx has 2 tools to accomplish this: Join and Find/Replace. However, due to the size of the dataset, joins are very slow because it has to presort the data. Each join represents a new point where my data has to get presorted. This is causing very long run times in my workflows. I am trying to severely reduce or eliminate the number of times my Left data gets sorted by any tool.
Which leads me to Find/Replace, which I believe does NOT presort incoming data. Unfortunately, Find/Replace only works on a single key field. We can fake this by creating a macro that concatenates the selected key fields with a non-printable character and then runs the concatenated key through Find/Replace, and I have done this. However, it still feels a bit slow to me and is kinda clunky.
Which led me to creating a custom tool (preferably in .NET) that can store the Right values in a dictionary or hash table. As each Left record passes through, I can do a quick look-up and add any matching data to the output stream. However, this requires the Right data to always come to the tool first. And the only way I know of to tell Alteryx to run the Right data first is to make sure the input tools generating the Right data have a smaller ToolID in the underlying XML configuration. So it's a very fragile thing.
Hey @tlarsen7572. Since I don't think that the engine will support your request, I was trying to brainstorm a way to speed up your operation using the default tools. This is theoretical, but maybe indexing the columns you wish to join on could speed it up. Excel accomplishes this using a shared string table in the file format where every string is assigned a number, and the cells just store the number. I'm sure it's not going to perform like a database index, but a 4 byte integer should be faster to look up against than a variable length string.You could split the strings off to a separate table and just pass the associated numeric columns through the join and replace tools. I'll have to play with this idea one day.
I haven't had to push millions of rows through Alteryx yet. I did do some analytics against 40 million rows, but that was using IN-DB tools and a SQL Server. I'm interested to see your results.
Not sure if this will help you with the C++ SDK, but I was just reviewing another dev's custom python tool today and noticed the IncomingInterface class ii_close() method sets a input_complete bool value and calls the check_input_complete() method in the AyxPlugin class:
def ii_close(self): """ Called when the incoming connection has finished passing all of its records. """ self.input_complete = True self.parent.check_input_complete()
The check_input_complete() method in the "Multiple Inputs" sample tool has an example of using these objects and their bool values to control the further execution of the process_output() method:
https://github.com/alteryx/python-sdk-samples/blob/master/Python%20-%20Multiple%20Inputs/Python%20-%20Multiple%20InputsEngine.py
There is no way to specify the order in which incoming connections receive their records. Sorry.
However, there may be a different solution to your problem, depending on the details.
You say that the right input must always get its data first. Would it be more accurate to say that the left input cannot do any record processing until it has data from the right input? If so, you should be able to accumulate records on the left input until you know that the right input is finished, and trigger the output processing on either the right or left anchor's ii_close, depending on which one ended last.
In other words, you'll track variables in the AyxPlugin class:
- right_anchor_closed (bool)
- left_anchor_closed (bool)
- right_anchor_records (list or path to a temp file)
- left_anchor_records (list or path to a temp file)
Then you'll have a method to process the data that doesn't care which incoming anchor it's on:
- process_my_data(main_obj) (returns bool?)
And finally, you'll have code in each interface's ii_close method (or perhaps a more efficient version in the ii_push_record method) to toggle which one fires the process_my_data method:
for RightInterface:
if self.parent.left_anchor_closed:
process_my_data(self)
self.parent.right_anchor_closed = True
for LeftInterface:
if self.parent.right_anchor_closed:
process_my_data(self)
self.parent.left_anchor_closed = True