Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Summarize tool getting stuck at 50% with AMP, seeking tips for loading big data

clmc9601
13 - Pulsar
13 - Pulsar

Dear community,

 

I have a workflow that loads 680 gb of data from .csv.gz files to SQL Server. I'm running it on a machine with 767 gb of memory, but my current Alteryx memory per user settings is limited to 256 gb. I'm an advanced Alteryx user but still fairly new to the Alteryx-SQL interaction and big data like this.

 

This workflow loads four different tables into a SQL Server which is stored physically in our office, not the cloud. All the data streams have a summarize tool right before the output data tool. Three of them ran without issue. The fourth data stream keeps getting stuck at 50% summarize. I expected it to at least make progress, even if slowly, like the others. Why might it be getting stuck, and/or what can I do to get it to finish running?

 

Connection progress gets stuck hereConnection progress gets stuck here

(summarize tool's 50% is partially hidden by the other connection label)

 

clmc9601_3-1643597298319.png

 

 

 

 

 

 

 

 

 

Here are more contextual details:

- I'm currently using AMP with the goal of maximizing my available RAM

- AMP gives the compacting packets message and the converting packets to temp files message later in the workflow. Does this significantly slow down the workflow?

- Here is my output data connection string: odbc:DRIVER={ODBC Driver 17 for SQL Server};UID=xx;PWD=__EncPwd1__;DATABASE=xx;SERVER=xx

- loading the csv data into Alteryx takes 11 hours, and the summarize tool has been stuck on 50% for at least the next 45 hours with no visible progress. There are 11 other tools in my workflow prior to the summarize tool, mostly filters, formulas, and unions.

- Alteryx is not even using my full 256 gb according to the task manager

clmc9601_1-1643596163794.png

 

- doing a SQL query in my SQL Server management tool outside of Alteryx counting the rows results in 0.

- the full csv data has 297 million rows and 141 columns

 

Here's what I've already tried:

- autosizing fields (with my own macros, not Auto Field) to their smallest necessary size. This makes a difference in SQL Server, but it made no visible difference in Alteryx. The data made it past this macro just fine, so I don't think this is the issue.

- using a select to remove all but the 62 fields required by the summarize. They are all "group by" fields. Removing unnecessary fields reduces the data from 680 gb to 496 gb, but it still got stuck at the same place.

- making sure Alteryx is stored on the machine (it is), not in a network user folder per this discussion

- making sure I don't have a cartesian join (I don't), per this discussion

 

Here are options I've thought of but aren't my preference:

- trying to process and load the data in batches

- increasing my Alteryx user setting memory to closer to 700 gb (but I am one of several users on the machine)

 

Does anyone have other insight or tips for me? Are there other ways to load this much data into SQL Server efficiently? Thank you for your time and advice!!

11 REPLIES 11
TonyaS
Alteryx
Alteryx

@theinsideguy 

Oh sorry, I just saw that you mentioned the Unique tool. I saw that this post was about Summarize so that is why I replied that way. 

 

Have you tried adding a sort before the Unique to sort by the field that you are doing the Unique on? It's worth trying to see if that helps at all.  Also have you tried increasing your Memory Limit? 

Tonya Smith
Sr. Product Manager, Adoption & Expansion
theinsideguy
7 - Meteor

Have not tried the sort tool prior to unique—not sure how that would help, but I can try. I have increased my memory limit to as high as it will let me (we only have 16GB on our machines). 

Labels
Top Solution Authors