Hi!
I am looking to build a workflow with:
1. If the 'Speed' column has 4 consecutive 0's and the difference between the first 0 and last 0 is at least 3 minutes
a. then create a group containing records from the start index to the end index (before the first 0).
b. delete those rows that have speed = 0 above, and point the new start index to the next row after this deletion
I have the following python code, but I am having trouble vectorizing it -- since the data is 1Million+ rows, it is taking too long to run. I am wondering if there is a better way to do this; I tried this using the multi-row formula in alteryx but was having issues with that. I appreciate your help.
import tqdm
start_idx = 0
shift = 0
df['Group'] = np.nan
for idx in tqdm.tqdm(df.index):
if df['Group'][idx] == 0 and df['Group'][idx+1] == 0 and df['Group'][idx + 2] == 0:
if df['timediff'][idx] + df['timediff'][idx+1] + df['timediff'][idx+2] >= 3:
df['Group'][start_idx:idx-1] = shift
shift+=1
start_idx = idx + 3
df.drop([idx, idx +1, idx+2])
Sample dataset attached
Solved! Go to Solution.
I am not sure I completely understand A and B, but I think the logic in the attached workflow should handle the first set of qualifications. What should the output look like after A and B are complete?
Nice!
Sorry I was not clear enough -- I'd like the original rows to remain intact. I think you removed the other rows where speed != 0. Each of the current Groups contains 4 records, but this is not necessarily the case.
There should be a new group created once there are 4 consecutive 0's in the speed column. ( we can assign these rows containing the 4 consecutive 0's the previous group's value, and forget about what I said in B.)
Please let me know if this makes sense.
Thanks again.
Hello @hamzam ,
based on your description and the code
3 consequent records,
if they have a speed of 0 then
check timediff
if >= 3 , then
this is a match , mark and execlude from the next check
if the record is not a match leave as is.
have a look at the workflow, it will give you an idea. there is no deletion but exclusion. At the end, you will have a filter with the matched records and unmatched. you can then do what you desire with them.
I hope you find it helpful or at least it will give you an idea.
Hello @hamzam ,
Apologies, I read your reply after I posted. However, we are still aligned.
Is it possible to provide some pseudo code of the exact logic. like the one in the first reply may be.
it is a nice and cool challenge. 🙂
Hi @ImadZidan ,
I have this updated code snippet:
import tqdm #just to see the progress of the for loop
start_idx = 0
group = 0 #groups start from 0
df['Group'] = np.nan
for idx in tqdm.tqdm(df.index):
if df['Group'][idx] == 0 and df['Group'][idx+1] == 0 and df['Group'][idx + 2] == 0 and df['Group'][idx+3] == 0:
if df['timediff'][idx] + df['timediff'][idx+1] + df['timediff'][idx+2] + df['Group'][idx+3] >= 4:
df['Group'][start_idx:idx+3] = shift
group+=1
start_idx = idx + 4 # start index is the one after the 4th zero value
#df.drop([idx, idx +1, idx+2])# no longer needed
Hi @ImadZidan
Thanks for your help. But, as I had mentioned to @Greg_Murray , I'd like the original rows to remain intact. I think you removed the other rows, the ones that are not "hit".
Each row in the original dataset should be assigned a group -- the group number can start from 0; it will increase to 1, etc. after there are 4 consecutive 0's in speed (and the timediff condition is met)
I think your first workflow was more aligned with the goal.
It seems like we need to figure a way out to assign like this:
Null ---- group 1
Null ---- group 1
Hit ---- group 1
Null -- group 2
...
Null -- group 2
Hit -- group 2
@hamzam ,
Got you. We will do it.
It helps that you fully understand your data which is perfect. makes things very clear