We’ve extended Inspire Early Bird Pricing until March 1. Register now and enjoy 20% off conference passes and 10% off training passes. P.S. Don’t forget to bring friends! When you sign up for five or more tickets, you get an extra 20% discount on conference passes. Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Filter Rows If ...

df
8 - Asteroid

Hey,

I want to look at all rows within 10 minutes and filter all rows with the same Variables in "nr1" and "nr2"

 

for example the first row: look at all rows with timestamp between "12.01.2018 08:00:20" and "12.01.2018 08:10:20"

AND if they have the same content in "nr1" and "nr2" filter it.

 

row 2 should be filtert. (same content in "nr1" and nr2" and within the timespan)

row 3 should not be filterd. (not same content in "nr1" and "nr2")

row 4 should not be filterd. (not in timespan)

 

DateDate+10minNr1Nr2 
12.01.2018 08:00:2012.01.2018 08:10:20XX 
12.01.2018 08:05:2012.01.2018 08:15:20XX 
12.01.2018 08:07:3012.01.2018 08:17:30AA 
12.01.2018 08:15:0012.01.2018 08:25:00XX 

 

I think i need something like the "multi-row Formular" but i dont need a specific number of rows.

7 REPLIES 7
mceleavey
17 - Castor
17 - Castor

Hi @df,

 

I've attached the solution to this.

Yes, you're right, the multi-row formula tool would be the correct approach.

Basically, you're building a nested if then else formula, but referencing the row above.

What I've done is converted the datetime fields to a recognised format in Alteryx, then used these fields in the multi-row tool to determine the difference in time with the row above. Coupled with this is a comparison with the other two fields with row above, and created a Y/N binary field to say if they should be filtered or not. I then fed this into the filter tool to remove the appropriate records.

 

You can tweak the logic in the multi-row if you need to.

 

Hope this helps.



Bulien

danrh
13 - Pulsar

Here's an attempt at a scale-able version:

image.png

 

I'm converting to a datetime and generating every possible time between then and 10 minutes out.  At that point I join on all three fields (the new TimeDate field as well as nr1 and nr2), and I add a Summarize at the end to stick it all back on one row.  If you're interested just in a particular record, you can toss a filter at the end with a Contains() function (I've included it in the attached).  Note that if you've already got some sort of unique identifier, you can sub that in for wherever I'm using RecordID.

 

The hiccup here is really going to be the number of records you're dealing with --- this adds 600 records per record in the dataset.  I've got an idea or two to get around this,but this is a more straightforward solution if that's not going to be an issue.

df
8 - Asteroid

Thank you for your solutions.
It works with both.

 

I think the second solution is a bit impracticable since you add 600 rows and i already have over 1 million rows per input-data.

 

in the first solution i need to sort data based on "nr1" and "nr2" to work correctly i guess.

 

mceleavey
17 - Castor
17 - Castor

You would need to sort the data sequentially, whether that be by datetime or by those columns.

Sorry, I don't know your data so that's over to you, but just be aware that the logic compares a row to the row above. 



Bulien

df
8 - Asteroid

You solution takes really long. After 10 min 23% (344MB). i guess it is because off the +600 extra row for each entry? This Formlar already contains 209GB with only 315MB Input.

 

grafik.png

mceleavey
17 - Castor
17 - Castor

This is because that method is using the generate rows tool, which works well for smaller data-sets. My solution uses a multi-row formula which does not generate additional data. This should run quicker.



Bulien

danrh
13 - Pulsar

Yeah, it's taking a while because you have a lot of data and it multiplies it out by 600 records.  The reason I'd avoid using the Multi-Row Formula is that you have to define how many rows up you are checking, and it sounds like from your prompt that it could be variable.  Additionally, it's difficult to define that a given record doesn't match the next row up, but does match two rows higher.

 

Another attempt is attached.  In this I'm rounding to minutes, which cuts down the multiplication from 600 to 11.  This means you end up needing a couple extra tools to check the "bookend" generated records to make sure the rounding isn't messing it up.  Give it a shot, see if it runs in a manageable amount of time. 

 

If it's still too slow, you could look at doing a join just on Nr1 and Nr2 and then a filter to make sure the date is within the timeframe.  Once again you're creating extra records (this is more or less an open join), so this would depend on how many different combinations of Nr1 and Nr2 you have.  If you have a wide variety, it might work.  If you only have a handful, this will create a LOT of records.

Labels