To give an example scenario.
I have a UserInformation in a table with below columns.
UserId,
UserName,
UserAddress,
UpdatedDate.
I need to get the latest Updated record for each User.(only one record per user)
I am using Hive Query to get the data like below.
SELECT u.UserId, u.UserName, u.UserAddress FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY UserId ORDER BY UpdatedDate DESC) AS RowRank, UserId, UserName, UserAddress, UpdatedDate, FROM UserInformation ) u WHERE u.RowRank = 1;
Is it possible to do the same in Alteryx with In-DB modules?
Solved! Go to Solution.
I am not quite sure how efficient it will be but you could:
I don't have access to a Hive server but connected to a SQL SERVER you can create the RowNumber using a formula in-db and then user a filter in-db to select just the latest record:
Both Answers work on Hive, the second one is faster.
This first answer works but doesn't look to be very efficient, The answer by "jdunkerley79" looks better to me.
Thanks for the contributing.
I have one additional question on the same dataset. I am trying to capture change of Name or Address. I use HQL to do this and looking to see if this can be done in-DB using Alteryx.
Select UserId from ( Select UserId, UserName, (LEAD(UserName, 1) OVER (PARTITION BY UserId ORDER BY UpdatedDate DESC)) AS prev_UserName, UserAddress, (LEAD(UserAddress, 1) OVER (PARTITION BY UserId ORDER BY UpdatedDate DESC)) AS prev_UserAddress From UserInformation ) u Where u.UserName <> u.prev_UserName or u.UserAddress <> u.prev_UserAddress
A formula-InDB can solve the LAG and LEAD window function as well.
LEAD(UserAddress, 1) OVER (PARTITION BY UserAddress ORDER BY UpdatedDate DESC).
Thanks
@JoeS thank you, this suggestion helped me in my use case and was pretty effective considering i have a large data set to work with.