Hello!
I am new to Alteryx and any help/suggestion is appreciated.
Problem:
I need to processing billions of data points for my analysis. If I take all data together then Alteryx runs for hours. There is no way I can pull this data outside db.
I am doing everything in-DB and using summarize to get more insights.
Is there a way to split the initial (Billions of record) into smaller batches in Alteryx without pulling it outside DB?
Would this reduce my temp DB consumption?
Thanks!
Hi @gelvira ,
Are you using the in-database tool palette In-Database | Alteryx Help
Use the Connect-indb to make a connection to your database. It's way more efficient as it doesn't pull the records to your local instance of designer.
You can use the query in the connect-indb tool or filter-in-db tool to take a subset of date before transferring the data out with the data stream out tool (converting it to a normal workflow/bringing it in locally).
Hope this helps!
Greetings,
Seb
Thank you for your response, Seb.
Yes, I am only using the in-database tool palette In-Database | Alteryx Help.
Also, connect-indb is used to make a connection. I have filtered down the data as much as I can in the initial query itself which is - "1 day's data" for testing.
Yet the data is huge to pull it outside-db. So I am using In-Database | Alteryx Help formulas, filters and summarizes.
I need a way to split this 1 day's data into batches ( for instance divide the data into 4 equal parts) and I can use containers to run one specific batch with the same workflow. I tried to create a function to do the split but nothing worked for me.
what kind of database are you using btw?
Which one are you trying to do:
1) Are you trying to process billions of instances In-DB?
2) Are you trying to bring billions of records out of In-DB and into the standard canvas?
responses:
1): This should be fine. No issue. You do your processing In-DB and get down a few hundred thousand or whatever and use datastream out or write In-DB as needed. You do all of your processing before datastream out.
2) Don't do this. Go back to 1. Alteryx can handle tens of millions of records. Maybe hundreds of millions. Maybe billions - depending upon your system memory - but getting that data from your DB source into in-memory computation will be bad.
can you split by adding:
something like:
SELECT ROW_NUMBER() OVER(ORDER BY - use your date/time or other field here) AS RecId,
RecID/COUNT(ROW_NUMBER) AS Bucket
User | Count |
---|---|
107 | |
82 | |
70 | |
54 | |
40 |