Engine Works

Samantha_Jayne · ‎10-03-2022

Alteryx + Snowflake series (1).png

Introduction
The Use Case
The Solution
Step by Step Guide to Writing Snowflake Merge Statements
Part 1 – Tables & Primary Keys
Part 2 – Matched Statement
Part 3 – Not Matched Statement
Part 4 – Drop the Temporary Table
Conclusion
Handy Tip

Introduction

One of my topics for INSPIRE EMEA 2022 this year is about updating data within Snowflake at speed. This three part blog series is going to lead you right into INSPIRE this year.

If you find yourself waiting for the UPSERT functionality using the output tool takes too long, you are in the right place. This blog is going to share with you all how you can utilise what Snowflake provides and what Alteryx provides to give you ultimate power! (As Tim the Toolman Taylor would say)

Source: GIPHY

So instead of being an Al Borland, waiting around, let’s strap in and learn some stuff about our favorite tools.

Source: Twitter

C’mon Al it will be fun and so rewarding I promise! 😃

Two topics we need to understand are UPSERT and MERGING and the differences between the two.

UPSERT is when you are updating data that already exists within the database and you are using a primary key to do so, in Alteryx this is done by transaction size in the Output Tool to ensure that the data is managed carefully, and nothing is dropped. It will call the API with the first chunk of data, update it and then send a message back to Alteryx to say all is complete, next set of data please. And so on… until the end of file. Depending on your data size and transaction size, this can take a long time.

Snowflake GIF1.gif

MERGING is the ability to update data that already exists within your database using a primary key(s). You don’t send the data up in sections using Alteryx, you send the whole file up and get Snowflake to do the hard work for you. The outcomes are exactly the same just the method is slightly different.

Snowflake GIF2.gif

The Use Case

The Use Case: As a retailer we need to keep on top of third-party datasets daily, we receive files that require updating in databases. However, the files are becoming larger, and the reports need to be ready first thing. With the new datasets the time it currently is 3 hours using output tool with UPSERT. The data is available from 08:30 am and the reports need to be ready for 9am. HELP!

For this kind of demonstration: I love to use Geolytix Open Supermarkets dataset. They create a fabulous dataset for use within modelling across the UK. It can be downloaded here: UK Retail Points.

Our use case, we have around 17K points from the Geolytix Retail Points and I would like to update the dates on them. I fix the date and then I try to update them with Alteryx, this seems to take a long time, by calculations of doing 100 every minute (ish) it would take hours to update my data.

17000/100 = 170 (packets) around 1 minute per packet, so hours = 170/60 = 2.83 hours ~ 3 hours!

There must be a better way! Don’t worry there is 😊

The Solution

So, let’s think about this differently and bulk push the whole data up into a temp table, then MERGE the data on the primary key using Snowflake’s merge functionality. We then have the best of both worlds. Snowflake takes care of all that hard work matching the data up and inserting when the data is new to ensure we have a great clean dataset at the end.

The result: we have completed our data update which used to take HOURS in around 11 seconds.

Source: GIPHY

That’s great, so how do I do it?

Well, the one thing we must learn is how to write Snowflake MERGE statements within Alteryx. So here is a handy guide...

Step by Step Guide to Writing Snowflake Merge Statements

I have the original table in Snowflake named Geolytix_Retail_Points_Merge
I write the data up to the snowflake database using the same table name but add the following suffix “_Temp” and set the output option to be Overwrite Table (Drop) using the bulk loader option for speed.
I add a MERGE statement in the Post Create SQL Statement

MERGE STATEMENT.PNG

The MERGE statement is a story of 4 parts:

Part 1 – Tables & Primary Keys

You need to tell snowflake the tables you are working with. I recommend using aliases here to make your life easier, this is where you give your table name a shorthand reference to use throughout your SQL code. 😉

Examples of an alias in the code below would be t1 or t2, it appears directly after the table name.

The two tables we are using in my example are:

MERGE INTO Database.Schema.Geolytix_Retail_Points_Merge t1 --your production table 
Using Database.Schema.Geolytix_Retail_Points_Merge_Temp t2  --your updated data 
On t1.ID = t2.ID

If you have more than one primary key, no problem, you just need to add a comma and then the next primary key and so on. i.e. t1.ID = t2.ID, t1.ID2=t2.ID2

Part 2 – Matched Statement

When we have a match on our primary key i.e., when t1.ID = t2.ID what do you want the merge statement to do?

Ideally you want it to update the fields you want updating… here you can expand your logic as far as you want but I personally just need the whole row updating.

WHEN MATCHED THEN UPDATE SET  
t1.RETAILER = t2.RETAILER,  
t1.FASCIA = t2.FASCIA,  
t1.STORE_NAME = t2.STORE_NAME, 
t1.ADDRESS_LINE1 = t2.ADDRESS_LINE1, 
t1.ADDRESS_LINE2 = t2.ADDRESS_LINE2, 
t1.ADDRESS_SUBURB = t2.ADDRESS_SUBURB, 
t1.ADDRESS_TOWN = t2.ADDRESS_TOWN, 
t1.ADDRESS_COUNTY = t2.ADDRESS_COUNTY, 
t1.ADDRESS_POSTCODE = t2.ADDRESS_POSTCODE, 
t1.LONG_WGS = t2.LONG_WGS, 
t1.LAT_WGS = t2.LAT_WGS, 
t1.BNG_E = t2.BNG_E, 
t1.BNG_N  = t2.BNG_N, 
t1.PQI = t2.PQI, 
t1.OPEN_DATE = t2.OPEN_DATE, 
t1.SIZE_BAND = t2.SIZE_BAND

Part 3 – Not Matched Statement

When we have no matches, we then want to insert the data into the table. Normal SQL insert statements follow the pattern of:

Insert (into these columns) values (from these columns)

Don’t forget those aliases!

(t1.production_table / t2.temp_table)

WHEN NOT MATCHED THEN 

INSERT (t1.ID, t1.RETAILER, t1.FASCIA,  t1.STORE_NAME, t1.ADDRESS_LINE1, t1.ADDRESS_LINE2, t1.ADDRESS_SUBURB, t1.ADDRESS_TOWN, t1.ADDRESS_COUNTY, t1.ADDRESS_POSTCODE, t1.LONG_WGS, t1.LAT_WGS, t1.BNG_E, t1.BNG_N, t1.PQI, t1.OPEN_DATE, t1.SIZE_BAND) 

VALUES (t2.ID, t2.RETAILER, t2.FASCIA,  t2.STORE_NAME, t2.ADDRESS_LINE1, t2.ADDRESS_LINE2, t2.ADDRESS_SUBURB, t2.ADDRESS_TOWN, t2.ADDRESS_COUNTY, t2.ADDRESS_POSTCODE, t2.LONG_WGS, t2.LAT_WGS, t2.BNG_E, t2.BNG_N, t2.PQI, t2.OPEN_DATE, t2.SIZE_BAND);

Part 4 – Drop the Temporary Table

This is a simple clean up step to keep database administrators happy, we delete the _Temp table, so that everything is ready for the next update.

Drop Table Database.Schema.Geolytix_Retail_Points_Merge_Temp;

Put it all together into your PostSQL statement box and you now can update at speed. Try it with a basic table first while you get yourself familiar with the concept then bookmark this page for reference, as you will be back. 😉

Conclusion

This blog is about the need for speed and utilising the right technology at the right time. By handing over the updates to Snowflake within Alteryx, we are dramatically improving the time it takes to update existing data in a database.

We would love for you all to try this in your organisations, if you are snowflake users, and let us know the before and after time in the comments. It would be great to see the impact this has had for you all.

And that is not all, tune in same time next week where we can use this method for bulk uploading spatial data into Snowflake!

Yes, you heard me right! 😄

Handy Tip

Snowflake doesn't like the max string size in Alteryx, so when writing data, be sure to check your data types first and adjust those with size of 1073741823 to the actual size of your data. Unsure what to change it too, try using the Auto Field tool.

Brinker · ‎10-12-2022

I am trying out this exercise for our company so we can show clients the benefits of Snowflake and Alteryx. I had to change the merge statement provided since the data columns in the file provide were different from the ones in the merge statement. I got it working but it was too quick on our normal SQL server to show the improvement, I took a large dataset of 5 Million rows.

What size snowflake warehouse would you recommend in this situation to show off the improved speed best?

Samantha_Jayne · ‎10-14-2022

@Brinker thank you for engaging with me on this topic. I would love to talk to you more about your current set up and what you have tried. I have INSPIRE next week but please reach out to samantha.hughes@alteryx.com and let's talk about your speeds and examples when I return. As SQL Server is a more native driver, I think you have the best setup there. Snowflake speeds are comparing against updating the data in the database using the output tool versus merge via bulk updating, not against different platforms, but with your setup we could potentially show more insights here. Would love to talk more. Samantha

mkosmicki1 · ‎08-16-2023

I keep seeing the error: Error: "Redshift Bulk Loader strings should be in the form: rsbl:ConnectString|table"

My data fields and table name do have spaces (not in snake mode or spaces removed). Is that the issue?

Also.....I'm working with 485K records and the temp table still takes 2:30 minutes to load....so how exactly is this making the process faster? Are you using Designer Cloud or a dedicated T1 line? Something has to be different with the setup.

Samantha_Jayne · ‎08-16-2023

Hi @mkosmicki1 this blog was written with Snowflake connections in mind. Is it possible you have the wrong connection option selected for your output?

If you are trying to apply it in Redshift, let me know. Sam

mkosmicki1 · ‎08-16-2023

@Samantha_Jayne I'm uploading to our Snowflake instance using the Snowflake ODBC connection. I set everything up like the screen snips you have for the output tool....but your available options look a bit different than what I'm seeing. I'm guessing an Alteryx version difference?

Samantha_Jayne · ‎08-17-2023

@mkosmicki1 I think it would be best to jump on a quick call and see your connection setup, as the error suggests its connecting to redshift and not Snowflake Bulk. Let's start there. please email samantha.clifton@alteryx.com and I will set up some time.

Samantha_Jayne · ‎08-18-2023

@mkosmicki1 and I got together and there is a little snowflake tip I want to share with you all, if you have Snowflake column names which are in lowercase, with spaces or special characters then you need to add quotes to the names like this. t1."retailer" = t2."retailer" if everything is in UPPERCASE with no spaces or special characters then you don't need to worry.

Thanks for your time today! and Well done on your massive time saving.

maamek · ‎10-03-2023

Don't you need to filter the records each time you run? Because it may add a recordID that already exists with each run no

Samantha_Jayne · ‎10-06-2023

Hi there @maamek

Merge statements have two parts, if the Record ID is present, it will update, if it is not it will insert.

Hope that helps.

Samantha

mtwilliamson · ‎10-31-2023

Hi Samantha,

I'm running into the same Redshift errors that mkosmicki1 was running into. I made sure to put double quotes around my fields and I'm still getting errors. Any suggestions?

One other thing. I don't have primary keys on the Snowflake tables I will be loading.

Samantha_Jayne · ‎11-07-2023

@mtwilliamson Hi there, apologies for the delay I have been on holiday.

When you first setup your connection within the InDB tools, you need to select Snowflake from the top drop down. The first database listed is Redshift. Please re-setup your connection with Snowflake selected and all should be great. 😉

Reach out if you would like to walk through it.