SOLVED
Pulling Billions of Data to Store to then Use
Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
nafong
6 - Meteoroid
2 weeks ago
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I wanted some help or suggestions on how I should approach this issue I am faced:
- Currently, I am using my workflow to connect to a database (Impala SQL) by using the Dynamic Input tool. I have around 15 distinct queries that pull different information per person depending on the Group ID I input into the Text Box Interface tool (replaces a "dummy group id" from the query with the name I input into the text box).
- It takes about 50 minutes to fully run them, as other people are running queries from the same data-lake, which is restricted by how many queries can be run at once because of a memory limit.
- Essentially, the queries pull billions of rows of data, searching to find a snippet of data that matches certain conditions, and then throw away the billions of rows away.. multiplied by how many people are running similar queries at once
Is there a way that I can pull the billions of rows once, and then look for those snippets of data matching certain conditions without "throwing it away" to save on run-time but also not be restricted by memory issues?
Ideas:
- Pulling once into a csv file and then using that file to reference
- May not be feasible because they are limited to about 2 million rows
- Using the Cache Dataset Macro -- is this the right way to approach this?
- How can I use this tool so that people without an Alteryx license can use the workflow through the Alteryx server?
- How can I set it up in a way that the large dataset is only run once, and when another user uses the Application portion of it (input a Group ID in a textbox), it starts running from what was cached?
I appreciate any ideas or help provided. Thank you!
Solved! Go to Solution.
Labels:
- Labels:
- Datasets
- Gallery
- Help
- Optimization
2 REPLIES 2
lwolfie
11 - Bolide
2 weeks ago
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You could try running it and storing the data as a yxdb file. Then have a separate workflow that references the yxdb file. I've also broken up data and stored them in separate files based on year (for example). Then, brought those individual files in together to continue the research.
2 weeks ago
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This worked beautifully, thank you!
