I need to regularly process a large flat file (csv) containing a few thousand rows. The data is messy, but generally organized by [unit], [group], and [activity] where each file has up to 10 units, each unit has upwards of 300 groups and each group has 8 or so activities. The number of records (rows) for each activity is highly variable and not known in advance. I am interested in all the rows of any particular [unit] AND [group] AND [activity] WHEN [activity]=”act_nn_end” AND [result1]=”val1” AND [result2]=”val2” (for example the act_2_* rows in the table below)
This could probably done in a database by a SELECT for the condition and retaining the [unit], [group], and [activity] (wildcard the act_nn*) in a result table that I could then JOIN with the input to get the rows I need. But I am wondering if this could also be done in Alteryx. However I cannot see how to do this and thought I would ask the community for some advice or suggestions here.
I have an example of the file below. Since I will get 4-5 flat files a week it would certainly be helpful to automate selecting the portions of interest out of each file without dropping them into a db every time. And there will be multiple combinations of [result1] and [result2] that will be interesting – certain combinations will indicate a particular action to take for example.
Thanks for looking at this. Any hints or suggestions would be appreciated,
Tim
timestamp |
unit |
group_id |
Activity |
result1 |
result2 |
other fields …. (60) |
hh:mm:00 |
u1 |
gid1 |
act_1_start |
[Null] |
[Null] |
[Null] |
hh:mm:10 |
u1 |
gid1 |
act_1 |
[Null] |
[Null] |
[Null] or data |
hh:mm:20 |
u1 |
gid1 |
act_1 |
[Null] |
[Null] |
[Null] or data |
hh:mm:30 |
u1 |
gid1 |
act_1 |
[Null] |
[Null] |
[Null] or data |
hh:mm:40 |
u1 |
gid1 |
act_1 |
[Null] |
[Null] |
[Null] or data |
hh:mm:50 |
u1 |
gid1 |
act_1 |
[Null] |
[Null] |
[Null] or data |
hh:mm:52 |
u1 |
gid1 |
act_1_end |
val1 |
val2 |
[Null] or data |
hh:mm:60 |
u1 |
gid1 |
[Null] |
[Null] |
[Null] |
… |
hh:mm:ss |
u1 |
gid1 |
.. |
… |
… |
… |
hh:m2:00 |
u1 |
gid1 |
act_2_start |
[Null] |
[Null] |
[Null] |
hh:m2:10 |
u1 |
gid1 |
act_2 |
[Null] |
[Null] |
[Null] or data |
hh:m2:20 |
u1 |
gid1 |
act_2 |
[Null] |
[Null] |
[Null] or data |
hh:m2:38 |
u1 |
gid1 |
act_2_end |
val1 |
val4 |
[Null] or data |
hh:m2:40 |
u1 |
gid1 |
… |
[Null] |
[Null] |
… |
hh:mm:ss |
u1 |
… |
… |
… |
… |
… |
hh:m3:00 |
u1 |
gid2 |
act_1_start |
[Null] |
[Null] |
[Null] |
Solved! Go to Solution.
Unless I misunderstand your issue, it sunds like all you need is a Filter tool to view the rows of interest
[activity] == 'act_nn_end'
AND
[result1] == 'val1'
AND
[result2] == 'val2'
In this example, the rows of interest would come from the T output of the Filter tool.
You can connect any file to this tool using the Input tool, then filter to see the relevant records.
you'll need to use a REGEX_Match for act_nn_end
Something like:
REGEX_Match([activity], 'act_\d+_end')
Otherwise exactly what I was about to post
Absolutely, I wasn't looking carefully at the data format.
Here is a combined version of my post and @jdunkerley79
Oops. Jumped too soon. This only selects the last record of the group. What I want to do is get the other records for that group - including the _start and _end and the variable number of records between them. Thanks.
This is very close. I need to scope the JOIN to only the current [group_id] and then I think it will work. I will try prepending the [group_id] in the parse tool. thanks!