I wanted some help or suggestions on how I should approach this issue I am faced:
- Currently, I am using my workflow to connect to a database (Impala SQL) by using the Dynamic Input tool. I have around 15 distinct queries that pull different information per person depending on the Group ID I input into the Text Box Interface tool (replaces a "dummy group id" from the query with the name I input into the text box).
- It takes about 50 minutes to fully run them, as other people are running queries from the same data-lake, which is restricted by how many queries can be run at once because of a memory limit.
- Essentially, the queries pull billions of rows of data, searching to find a snippet of data that matches certain conditions, and then throw away the billions of rows away.. multiplied by how many people are running similar queries at once
Is there a way that I can pull the billions of rows once, and then look for those snippets of data matching certain conditions without "throwing it away" to save on run-time but also not be restricted by memory issues?
Ideas:
- Pulling once into a csv file and then using that file to reference
- May not be feasible because they are limited to about 2 million rows
- Using the Cache Dataset Macro -- is this the right way to approach this?
- How can I use this tool so that people without an Alteryx license can use the workflow through the Alteryx server?
- How can I set it up in a way that the large dataset is only run once, and when another user uses the Application portion of it (input a Group ID in a textbox), it starts running from what was cached?
I appreciate any ideas or help provided. Thank you!