I believe I'm doing something wrong in Alteryx since I'm a new user. My company uses Alteryx as its dataflow tool and I need to extract the data from a simple query (below) and saved to some local file. I'm using CSV. The results have 5.7M rows.
Issue: When exporting the data from this query using MS SQL Server Management Studio, the total time is 0:40 to 0:60s. When running the same query in Alteryx using either in the Input Tool or the In-Db (Connect DB + Data Stream Out) it takes around 20-25 minutes. My guess is that I'm doing something very wrong.
I followed all the connection instructions in Alteryx website for both data connections and In-Db Connections and I'm still getting very slow extraction times. My colleagues have the same issues (though most of them never used other tools before - they believe this to be normal).
Before posting here, I browsed the forum for similar issues and while found some complaints on the extraction being slow, no real solution was provided. So I decided to come here and ask the community how to make the SQL Server data extractions faster in Alteryx as we (me and my team) must be doing something wrong.
Thanks in advance.
Complete Query:
SELECT a.LeadID ,a.Employee_ID ,a.Form_Name ,a.Date_Submitted ,l.SalesCycle FROM [Lead_Activity_Reporting].[dbo].[Activity] a INNER JOIN [Lead_Activity_Reporting].[dbo].[Lead] l ON a.LeadID = l.LeadID WHERE l.SalesCycle IN ('Nov 2018', 'Dec 2018', 'Jan 2019')
Solved! Go to Solution.
The key question is - how are you implementing this in Alteryx ? As in, what does your equivalent workflow look like ?
Very simple:
Input Tool with the SQL Query -> Output Tool to CSV.
Also tried to use
DB Connector -> Data Stream Out -> Output tool to CSV.
I don't think it can be more basic than that.
Additional information: it seems it is not just MS SQL Management Studio, extracting data using DBeaver through a jdbc connection takes the same time to query the data but an additional minute to extract it to CSV.
Can't find any settings in these Alteryx tools. If there are, they are located somewhere else.
That's the reason I believe something is really wrong with our Alteryx. How can I get more information for the troubleshooting?
Hi,
So whether you are doing this with the process you have outlined with the standard tools or the in-database tools, you are still streaming 5 million records through a 'pipe' to your local machine (in the case of the in-db tools, this is the point at which you use the 'data stream out' (the best use of the In-DB tools is to do some preparation and then 'aggregate' your data to the level needed for the analysis and then stream out).
My guess for why this process is slower than SQL server management studio would be the fact that it is perhaps generating the .csv file on the server machine and then passing the file across to your local computer (providing that you are indeed running the query on your machine).
To me, this is not a great test nor use of Alteryx's capabilities; I would focus on what happens to the .csv that is generated; is there a series if processes performed on this data that may be better translated into Alteryx.
Ben
Thanks for the reply.
Well, this is was just a test case to try to figure out what I was doing wrong. The real workflow includes other tools but since about 93% of time was being spent in the data extraction, I created a simple workflow to test the extraction tools.
It's important to mention that it is not slow just in the creation of the CSV file; any tool that comes immediately after the input tool (SQL Server) or the Data Stream Out takes a lot of time to receive the records. The file generation using, for example, DBeaver takes a little less than one minute (41s to finish the query + 1.1 minute to download the data and save it in the CSV file). But hey, if the problem was the CSV generation, then something is very wrong with Alteryx. You can't have any serious data software having difficulty generating comma-separated text files.
Finally, you said this is not a great test case for Alteryx, right? I'm not sure I agree with you on this on. The extraction of data is, well, a standard part in ETL; additionally, every tool I tested (including reading directly to python Pandas) perform similarly between 1-3 minutes. I'm not questioning why Alteryx is not superior to these other tools; I'm questioning why it is magnitudes slower than any other tools I tested. It is so slow that we are considering replacing the input tool with the python script tool to download the data instead of the input tool (but that would make the input tool for SQL Server data worthless, don't you agree)?
Right now we have a workflow that is taking around 24-27 minutes, 20-22 of which is just the database extraction. It is runs every 30 minutes in the Alteryx server. So if the workflow is delayed for some reason (busy database, too busy Alteryx server, etc.) for just a couple minutes, we lose one data processing cycle in the hour. And around 88% of the time is spent download, not processing the data.
What would be the best way to diagnose if the problem is really with Alteryx software or with some configuration (like the ODBC driver or some SQL server option that is incompatible with Alteryx)?
My thanks!
Since I believe other people will have the same issue with Alteryx, I'm going to post here my solution. It is not the ideal way, more like a dirty workaround but it is blazing fast.
Use python script tool. You will need pyodbc and pandas packages installed in Alteryx python environment. Follow the instruction on Python Script windows on how to install these packages.
import pandas as pd import pyodbc from ayx import Alteryx
# Read server information from Text Input tool p = Alteryx.read('#1') # connection parameters param = { 'DRIVER':p['Driver'][0], 'SERVER':p['Server'][0], 'DATABASE':p['Database'][0], 'SQL':p['SQL'][0] } # SQL code # param['SQL'] = """SELECT a.LeadID,a.Sales_ID,a.D2D_Rep,a.Form_Name,a.Date_Submitted,l.SalesCycle FROM [Lead_Activity_Reporting].[dbo].[Activity] a INNER JOIN [Lead_Activity_Reporting].[dbo].[Lead] l ON a.LeadID = l.LeadID WHERE l.SalesCycle IN ('Nov 2018','Dec 2018','Jan 2019')""" # Query the server and loads the data into a pandas dataframe with pyodbc.connect(driver=param['DRIVER'], server=param['SERVER'],database=param['DATABASE'], trusted_connection='yes') as cnxn: df = pd.read_sql_query(param['SQL'], cnxn)
# Export the dataframe as a table on output 1 Alteryx.write(df,1)
# Clean up the memory del(df)
Finally, input a Text Tool with four columns in the Python Script: Driver, Server, Database, SQL. In the first row just below the column headers you put the ODBC Driver String, the server name or address, the database (usually master), and the SQL code.
See image below:
With that you can get blazing fast speeds.
Known caveats:
- If the table/query is bigger than the available memory, this method won't work. There are ways around using the 'chunks' option in pandas.read_sql_query but there is no way to easily generalize the solution in this cases so you will need to read the documentation.
- Only tested in SQL Server. I'm not sure the Input Tool has issues with other databases. If it does, the solution should be similar just replacing the ODBC driver.
- The "There is no valid metadata..." message will appear once you click in other tools in the workflow and in some cases can cause the settings in the following tools to disappear. Don't know the reason, couldn't find a solution either.
Alteryx Designer developers: I'm assuming the reason the input tool is so sluggish is because you extract a small chunk, save it in a tempfile, get the next chunk, save it and so on. That's a feasible approach in dealing with the issue of limited memory. However, users should be able to customize that in the input tool. The way it is designed makes it worthless for SQL data extraction and many are the alternatives preferable to that. You can't call Alteryx an ETL tools if it's extraction is that poor.
Hi,
I read your post and was intrigued by the performance issues you were mentioning, so I ran some local tests.
I ran a query against one of our tables to pull in 5 fields (that usually have data) and grab ~6 million rows. I avoided a Join (as utilized in your example) because I'm unaware of indexing/performance considerations in your database, but I don't expect that this would cause the issue you are referring to.
I ran my query in SSMS to get a baseline, and found that it completed in ~45 seconds - about in line with your tests - including populating the results grid.
I then hopped over to Alteryx and set up a basic OleDB connection and copied my SSMS query into Alteryx. I added a CSV output to the temp directory (%temp%), and ran it. This completed in ~27 seconds, and created a ~340KB file.
I also ran this process and pointed it to a networked drive, and again found that I got a successful output to .csv in pretty short order - around 36 seconds this time - which makes sense, as local space is much faster to write to than a networked drive.
Finally, I had a coworker try running the same process in Alteryx first and then SSMS so that I had a second test case. I also had them try outputting to a file instead of the results grid.
Their results:
Writing to a temp file in Alteryx: ~30 seconds.
Writing to the results grid in SSMS: ~40 seconds.
Writing to a CSV file directly from SSMS: ~35 seconds.
So overall, in my environment/test case, Alteryx outperformed SSMS for this approximate volume of data.
All of that being said, I have encountered issues in the past where certain drivers can suffer from performance issues in certain cases with SQL. I saw in your python parameters that you are using ODBC driver 17 for SQL Server. This driver is not on the Alteryx list of supported drivers.
My two questions - are you using the same exact connection string in the Alteryx Input data tool as you are in Python? And have you/can you try using a different SQL driver for this connection?
Hi Claje,
First, let me tell you about the drivers. I started using ODBC Driver 13 for SQL Server (like everyone in my team). When I started doing the tests, I've tried ODBC 11, 17 and SQL Server Native Client 11.0. Of all of these, the one which performed the best was the ODBC 17. However, this may be because the data was already cached in the server and it just seemed faster. I didn't try the OleDB driver (wasn't it deprecated by Microsoft?). Can definitely give it a try.
One question: are you sure the file was 340KB for 6 Million rows? Maybe 340MB?
The numbers you posted are consistent to what I have been seen. The only one who performed far worse was Alteryx, which took about 20 minutes for the total process.
What I've tried in Alteryx. I've tried both Connect In-DB->Data Stream Out->Output Data, and Input Data->Output Data. Used a very basic connection string (odbc:DRIVER={ODBC Driver 17 for SQL Server};DATABASE=;SERVER=NDCMSQLP15;Trusted_Connection=yes) which is exactly the same I use for Python. Also tried to customize the ODBC driver by creating a System DSN with some preset options (tried combinations switching on and off subnet-failover, transparent network IP resolution, perform translation for character data, etc) but couldn't get any improvement.
What takes time (and I can see it by looking at the workflow) is retrieving the results. Once the results have been retrieve, the rest of the workflow runs fast. I will try to use OLEDB to see if it improves the transference and will post the result here.
Thanks for your reply.
Nelson
UPDATE: Just finished running the same query running OleDB and ODBC 13 drivers. The difference is savage:
Now I think it's reasonable to say the Alteryx has indeed some problems with ODBC drivers. And it seems an Alteryx thing because Python works flawlessly with it.
Also saw in Microsoft site that they undeprecated OleDB in 2017, which seems to explain why people are still using it. That's good to know. But Alteryx developers will still need to fix this issue with ODBC (I've used a driver approved for Alteryx this time).
You're correct regarding volume - 340 MB, not KB - the challenge of writing up a post quickly!
It would seem, based on your testing, like an issue pertaining to the ODBC driver in Alteryx. In my testing, I leveraged an OleDB driver yesterday. I just ran my test case again to identify a read time for Alteryx using ODBC Driver 13 for SQL Server and that driver seems significantly slower - My process ran for about 9 minutes using ODBC 13.
However, I configured this using the "Microsoft SQL Server" option within the Alteryx input data tool to end up with this performance.
If I configure an ODBC connection on my local machine instead (using the windows ODBC manager), and use the same driver, I return results in ~56 seconds. A little slower than OleDB, but about on par overall with the technologies we've tested so far.
I also reran my connection leveraging OleDB today to ensure that activity on the server was taken into account. That process ran in 23 seconds today.
As such, it seems like something about the "SQL Server Database Connection" option is causing this performance issue. That's also right about the limit of my knowledge on the topic.
If you can, try testing the ODBC connection by creating it using the ODBC Data Source Administrator and see if that runs any faster.
If it does, I'd recommend reaching out to alteryx support (support@alteryx.com) to discuss this performance issue with them. They may be able to provide more information as to configuration options I may have missed, and they will be able to tell you if this is expected behavior.
Hi Claje,
I'm replying to you now just to summarize the problem and also mark this as the solution, since there are at least two other related posts regarding ODBC slowdowns in Alteryx. However, none of them have solutions.
In summary, the issue is that Alteryx doesn't work very well with the ODBC driver. More specifically, if you select Microsoft SQL Server when creating the connection.
There are three solutions:
- Use OleDB Driver instead;
- Create the ODBC connection using the ODBC Admin (did not test this one but Claje confirmed it works);
- Use a Python script;
I'm contacting Alteryx support to check this issue/bug? but these options above should do the trick for now.