Hello everyone,
I have a requirement to append new rows to a Snowflake table, but I need to ensure that duplicate rows are not inserted. Unfortunately, I cannot use an "Update; Insert if New" approach because the table does not have a primary key.
The table includes columns such as T_Date, Total_customers, and Total_Income, and I am generating rows for the past 3 days (T_Date).
Note: I cannot delete and recreate the table with a primary key due to the large amount of production data it contains.
Could you please suggest any alternative methods, such as pre or post SQL queries, to achieve this?
@swapsingh2712 Is removing duplicates in a Post SQL statement possible? You could write a script that checks if any duplicates are present and have it keep the most recent record.
Bacon
We have had the same issues and have adopted a method using JOINS. We take the data we are about to load and create a NEW_KEY (date, id, whatever). We then create the same key by reading the final table OLD_KEY. You can do an in-db JOIN with full-outer join. You can then see when the columns are NULL in the OLD_KEY that the data is either New and needs to be inserted. If there is a match on both new and old keys then you know the record is essentially updating what is there and if the old key is populated but the new key is empty you know those are your historical records that you want to simply retain. Split up the INSERT/UPDATES and join the key back to your dataset. Take the RETAIN keys and join back to your existing data set and simply UNION the two. We use that method to set created dates, updated dates, etc. and use it to do a "MERGE/APPEND/KEEP" process as we build the final output table. We write it to a temp table and then do one final step to overwrite the final table as a safety precaution in case there are values that may error on insertion to the table. Depending on how much data you are talking about it is sometimes faster to reload a table vs. run deletes.
So for tables which are not huge --- you should be able to use some logic like:
MERGE INTO target_table AS tgt
USING source_table AS src
ON tgt.id = src.id -- Define your join condition(s) to determine if a record exists
WHEN NOT MATCHED THEN
INSERT (id, column1, column2)
VALUES (src.id, src.column1, src.column2);
or so sql says... for larger tables --- this is a bigger problem. You should not be doing incremental loads from Alteryx into Snowflake like this. You should be doing some form of staging/raw table and cleaned table. You don't need a full on unity catalog like Databricks --- but you should have a raw table ---> which in this case you'd be writing to --> and some kind of CDC in Snowflake which writes to your clean table/view. If your system isn't set up this way --- that's the first issue you should be addressing.
User | Count |
---|---|
109 | |
92 | |
78 | |
54 | |
40 |