Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Summarize tool getting stuck at 50% with AMP, seeking tips for loading big data

clmc9601
13 - Pulsar
13 - Pulsar

Dear community,

 

I have a workflow that loads 680 gb of data from .csv.gz files to SQL Server. I'm running it on a machine with 767 gb of memory, but my current Alteryx memory per user settings is limited to 256 gb. I'm an advanced Alteryx user but still fairly new to the Alteryx-SQL interaction and big data like this.

 

This workflow loads four different tables into a SQL Server which is stored physically in our office, not the cloud. All the data streams have a summarize tool right before the output data tool. Three of them ran without issue. The fourth data stream keeps getting stuck at 50% summarize. I expected it to at least make progress, even if slowly, like the others. Why might it be getting stuck, and/or what can I do to get it to finish running?

 

Connection progress gets stuck hereConnection progress gets stuck here

(summarize tool's 50% is partially hidden by the other connection label)

 

clmc9601_3-1643597298319.png

 

 

 

 

 

 

 

 

 

Here are more contextual details:

- I'm currently using AMP with the goal of maximizing my available RAM

- AMP gives the compacting packets message and the converting packets to temp files message later in the workflow. Does this significantly slow down the workflow?

- Here is my output data connection string: odbc:DRIVER={ODBC Driver 17 for SQL Server};UID=xx;PWD=__EncPwd1__;DATABASE=xx;SERVER=xx

- loading the csv data into Alteryx takes 11 hours, and the summarize tool has been stuck on 50% for at least the next 45 hours with no visible progress. There are 11 other tools in my workflow prior to the summarize tool, mostly filters, formulas, and unions.

- Alteryx is not even using my full 256 gb according to the task manager

clmc9601_1-1643596163794.png

 

- doing a SQL query in my SQL Server management tool outside of Alteryx counting the rows results in 0.

- the full csv data has 297 million rows and 141 columns

 

Here's what I've already tried:

- autosizing fields (with my own macros, not Auto Field) to their smallest necessary size. This makes a difference in SQL Server, but it made no visible difference in Alteryx. The data made it past this macro just fine, so I don't think this is the issue.

- using a select to remove all but the 62 fields required by the summarize. They are all "group by" fields. Removing unnecessary fields reduces the data from 680 gb to 496 gb, but it still got stuck at the same place.

- making sure Alteryx is stored on the machine (it is), not in a network user folder per this discussion

- making sure I don't have a cartesian join (I don't), per this discussion

 

Here are options I've thought of but aren't my preference:

- trying to process and load the data in batches

- increasing my Alteryx user setting memory to closer to 700 gb (but I am one of several users on the machine)

 

Does anyone have other insight or tips for me? Are there other ways to load this much data into SQL Server efficiently? Thank you for your time and advice!!

11 REPLIES 11
BrandonB
Alteryx
Alteryx

Have you tried using a bulk loader as opposed to just a regular odbc connection? This can make a HUGE difference. https://help.alteryx.com/20213/designer/connect-sql-server-database You need to define an ODBC connection in your connections manager, but this may be a great way for you to speed up your workflows. 

TonyaS
Alteryx
Alteryx

@clmc9601 

 

Thank you for using the AMP Engine and for providing feedback. 

 

I encourage you to continue to report any use case issues that you find with running workflows with AMP Engine enabled. We worked hard to identify differences from the original Engine as well as provide guidance on how to better optimize workflows to run with AMP. 

 

In order to comment on or respond to your particular issue, I need to understand which version of our software you are currently using. We have made many improvements and fixed a large number of discovered issues with AMP since it was first released to GA with the 2020.2 Release. 

 

I also wanted to provide some helpful links to available documentation about AMP Engine:

 

AMP Engine webinar (32 minutes): https://community.alteryx.com/t5/Analytics/AMP-Engine-Discover-the-Untapped-Performance-of-Your-Anal...

 

https://help.alteryx.com/current/designer/alteryx-amp-engine

https://help.alteryx.com/current/designer/Alteryx-Engine-and-AMP-Main-Differences

https://help.alteryx.com/current/designer/AMP-Memory-Use

https://help.alteryx.com/current/designer/tool-use-amp

 

https://community.alteryx.com/t5/Analytics/Accelerate-Your-Analytic-Processes-with-the-New-AMP-Engin...

https://community.alteryx.com/t5/Engine-Works/AMPlify-your-Workflows/ba-p/617590

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Best-practice-with-the-new-Multi-threa...

AlterEverything Podcast: https://community.alteryx.com/t5/Alter-Everything-Podcast/66-The-Alteryx-AMP-Engine-Explained/ba-p/5...

https://community.alteryx.com/t5/Engine-Works/AMP-Engine-Technical-Deep-Dive-Part-1-Why-AMP/ba-p/570...  

https://community.alteryx.com/t5/Engine-Works/AMP-Engine-Technical-Deep-Dive-Part-2-Key-concepts-of-... 

 

Tonya Smith
Sr. Technical Product Manager, cloud App Builder
clmc9601
13 - Pulsar
13 - Pulsar

Hi @BrandonB and @TonyaS,

 

Thank you so much for getting back to me! 

 

I will try switching my odbc to a bulk loader. I've actually been using bulk loaders for my other SQL Server instances but clearly forgot to create one here. Hopefully this will make the load times more efficient for the other data streams that did fully make it to the output tool.

 

I am using Alteryx 2021.2. This is my first time seeing Alteryx make no visible progress for so long, but it's also my first time working with this volume of data. Thank you for the resources! I will also try disabling AMP to see if my workflow progresses past the summarize sticking point. If running the workflow without AMP removes the no-progress point (even if the workflow runs slower), would you agree that would seem like an AMP vs E1 issue?

TonyaS
Alteryx
Alteryx

Hi @clmc9601  & @BrandonB ,

 

I did not find anything that was specific to Summarize with AMP in our list of either fixed or open issues, but I did find one related to ODBC and SQL. 

 

Do you both think this is specific to ODBC? Here is the description of the recent defect that we fixed: 

AMP engine hangs if ODBC post SQL fails in Input Data

 

Does this sound like it could describe your issue? This was reported internally and fixed and the fix will be in the 2022.1 Release version when that comes out. 

I do not have the timing on that release yet. 2021.4 is coming out on February 2, 2022 and then 2022.1 would be the next Release after that. 

 

 

Here are the steps to reproduce the issue with AMP: 

1) Create workflow and make sure it is set to use AMP engine
2) Add Input Data to some ODBC source (e.g. SQL Server) and pick a table to read
3) In Post SQL Statement option of the Input Data, set it to "DO SOMETHING" (ie a bogus statement)
4) Run the workflow and wait

Tonya Smith
Sr. Technical Product Manager, cloud App Builder
clmc9601
13 - Pulsar
13 - Pulsar

Hi @TonyaS,

 

Thank you for finding that! I doubt it describes my issue though because my ODBC loader is on the Output tool (not Input) and I using neither post nor pre SQL statements.

 

Also, running the workflow without AMP failed miserably. I received disk space errors such as:

-  Dynamic Input (150) Insufficient disk space - extracting this file will result in less than 10% (or less than 10GB) of free disk space in your Alteryx temp folder.

-  Union (84) FileUnbuffered: Error in GetOverlappedResult: There is not enough space on the disk. (112)

 

Do you know specifically what the memory limit is doing? It seems to involve disk space (per varied experiences of my coworkers, too) and RAM, and it doesn't always maximize the memory limit. For example, while my memory limit here was 256 gb, Alteryx was only using 230 gb, even though it clearly had more data than that available to process. I've looked through all the articles you sent and a few I found on my own, but I didn't find details about this. Thank you again for your time and advice!

 

TonyaS
Alteryx
Alteryx

@clmc9601 

I've talked with a few folks here at Alteryx. We have done significant improvements around Memory management and preventing deadlocking with the 21.3 and 21.4 Release versions. It is likely that this is a symptom of some of the issues we were seeing prior to those fixes. It was not a single fix, but a significant grouping of work to deliver the changes to how we manage memory. I wonder if someone from Support could work with you to see if they can try running your workflow on our latest version and see if the issue you are experiencing no longer exists.

Tonya Smith
Sr. Technical Product Manager, cloud App Builder
clmc9601
13 - Pulsar
13 - Pulsar

Sounds good. Thank you, @TonyaS! We moved the workflow to a different machine with more temp space, and it stopped deadlocking. I would still like to learn more about optimization, so I will reach out to support. Thanks again! 

theinsideguy
7 - Meteor

I also have the same Amp-Engine issues running 21.2.1. For datasets over a certain amount, often times it is the "unique" tool that it gets stuck on. As soon as I see the "Beginning to compact waiting packets to reduce memory usage" message, I know there is a 50/50 shot it's going to get stuck. 

 

Hoping to update to 21.3 shortly (21.4 is not supported by my company yet). 

TonyaS
Alteryx
Alteryx

For Summarize, it still falls back to original Engine for some actions (documented here: https://help.alteryx.com/20221/designer/tool-use-amp)

But also there were definitely still some deadlocking issues with 21.2. Some of those were fixed for 21.3, but more were fixed in 21.4. Release 22.1 is the most stable Release for AMP. 

There is a performance cost to fall back to original Engine, so if you are using some of the unsupported Summarize functionality that could be adding to the performance issues. 

 

TonyaS_0-1655389844035.png

 

Tonya Smith
Sr. Technical Product Manager, cloud App Builder
Labels