This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Hi Community! Please read this post before giving the "go to" answer of "Just use GroupBy!" :) I have an In Database workflow and in the middle I want to remove duplicate records by the first available date. For example, customer Bob shops at the store during February and he returns to shop again during that same month. I am only concerned about Bob's first shopping trip, and want to remove the second shopping trip from my results. My data has a date field and a customer id field along with transaction data: qty sold, amt, etc. The summarize tool with Group By only rolls up the records and doesn't really remove the duplicates, nor does it pick the first date. (This is bad news as it makes Bob look like a big spender when he visits!) I tried writing some SQL code in a Filter tool, but I keep getting an error when I use the Select Distinct command. Basically I'm trying to replicate the "Select Unique" tool, which works perfectly. The only problem is I have to exit the In-DB stream, which takes forever when you are dealing with 50 million records. Thanks!
The approach I would take is to (I have read your post in full I promise), use the summerize tool, grouping by your customer and returning the FIRST date (if you want the first date in each month then you would need to create a formula that replicates a 'month-year' parse and group by this too).
You can then join this table against your original data stream on all fields too pull out exactly the one record you want to keep.
Thanks for your response and thanks for the quick tip on the first date in each month which will help when I expand my query for multiple years.
When I use the summarize tool I can group by customer ID, but how do I return the FIRST date in the series?
During lunch I was also thinking about trying to add a count to the records with the first date at 1 and subsequent dates at increasing integers. Then I could filter by this new "rank" column for only the first occurance?
I think my other struggle here is that I have a lot of sql code from other team members but I just can't take snippets of the code and paste it into Alteryx In-DB tools like Filter and Formula?
Unfortunately I'm not overly familar with the In-DB intergration so can't comment on the embedding script.
There should be an operation 'First', according to the documentation at least but someone did themselves suggest this wasn't the case last week when I answered another in-db question, check out the documentation here: https://help.alteryx.com/11.7/LockInSummarize.htm, is there a min operator coming to think of it? That would achieve the same thing and I believe there is.
Whilst the incriminating idea will work, i'm not to sure how we would achieve that without using the functionality outlined above anyway so if we can't do one we can't do both!
Okay, I'm going to leave this question open incase an Alteryx Wizard comes by to figure out how to use the Filter or Formula In-DB tools to accomplish this task.
I took you advice Ben and appreciate the help!
I had to use the summarize tool to group by customer ID and select the MIN date. I then took this output and performed a Left Outer Join to join back into the workflow so that I could pick up the remaining fields. A quick Select tool dropped the duplicate Customer ID and Date fields.
***To anyone that does not want to make the same mistake as I did. When you use the summarize tool you can not group by all of the records to try and save a step! The tool can't identify the unique records on the customer ID field then so it ends up keeping all the records, including the duplicates.