When the Python Tool operates, it seems to always ingest all the data before processing any of it (i.e. no batch processing). Python can handle this type of functionality with generators, can we update the tool so that it may do some preprocessing (like imports and data prep) and allow a defined generator function to be called repeatedly from a separate input handle and provide batch data frames on output for more parallel-like processing of data?
The Python Tool could be updated as such:
- Multi-Input - Same functionality as now, and also allow this data to be used for preprocessing and setting up the Python functions and a single batch function.
- Data Input - Ingests data in batches (as most other tools operate) where each batch passes in a dataframe (in this case, a subset of processed entries) into an existing Python function (with a name that is in globals()), and returns another dataframe with that desired output. This can give the option of adding/removing rows as necessary to a subset of the data.
- Data Output - Partial set of data after data processing to allow tools further in the chain to process in parallel.
- "On Complete" Multi-Outputs - Same functionality as now, to pass process-complete data to the next tool once all data ingested has been processed. Perhaps give the option to pass the complete set from Data Output.
A simple use-case, if a user wanted to use only the Python Tool:
Let's say a user wants to get all URLs from every post in a thread (containing millions of posts) that are in blacklisted domains.
- Data prep that sends the list of blacklisted domains into the Python Tool's Multi-Input handle, and that data is transformed and stored in a set within the Python tool once.
- A series of posts (strings) are sent in batches (let's say ~10000) to the Data Input of the Python Tool. The tool calls a defined Python function that extracts all the URLs, and filters those in the blacklist.
- That data is then transformed into a DataFrame which is then sent to the Data Output of the Python Tool, and only contains results corresponding to the small batch of posts that were ingested. Alteryx can also use this to track progress during execution.
- Once all posts have been processed, one of the Python Tool's Multi-Outputs can return a total count of URLs found that were NOT in the blacklist (sure this can be a part of the Data Output, but just for the sake of this example). Could also be used to trigger "on-complete events."
I know I used the term "generators" above, and the design could probably be simplified to instead call an Alteryx Python function that yields from a function to await input from the next batch to use actual Python generators. However, I feel my initial approach could be thought of as a simpler process since generators are more of an intermediate functionality.
I hope this makes sense and is elaborate enough to pursue. Thanks for the consideration!