Hello Alteryx community,
I hope everyone is doing as well as you can be during this pandemic. I have what I hope to be a relatively simple question. Here's what I'm trying to do:
I have a file with three columns: the first contains a unique ID, the second contains a string that categorizes the row, and the third contains a string relating to an open-text question response. For example:
Response ID | Category | Text |
A1B1 | Small | I loved it! |
A1B2 | Small | It was OK! |
A1B3 | Giant | It was the best! |
New rows are downloaded every morning, and there will eventually be hundreds of thousands of them, so manual coding doesn't work here.
What I need to do is filter the total dataset so that, at the end of the workflow, I have a file that contains only the rows that have been added since the last time I ran the workflow.
For example, let's say I ran the workflow on July 1 at 12:00 PM and collected the three rows above. Then, I run the workflow on July 2 at 12:00 PM and I see the following:
Response ID | Category | Text |
A1B1 | Small | I loved it! |
A1B2 | Small | It was OK! |
A1B3 | Giant | It was the best! |
A1B4 | Moderate | The Alteryx community is the best! |
A1B5 | Giant | No really, they're the best! |
I would want to filter those two "new" rows into a separate table, so that they can be outputted to a file.
Alternatively, I'd like to create a column that assigns a number to the row based when it was added to the master dataset. For example, using the second table in this thread as a starting point, let's say:
Response ID | Category | Text | Run |
A1B1 | Small | I loved it! | 1 |
A1B2 | Small | It was OK! | 1 |
A1B3 | Giant | It was the best! | 1 |
A1B4 | Moderate | The Alteryx community is the best! | 2 |
A1B5 | Giant | No really, they're the best! | 3 |
In this case, I'd know the last "Run" that had been filtered out and I would be able to say: "OK, filter all rows with 2 or 3 in the Run column and output them to a separate file."
I'm open to other approaches as well, but the final goal is to have a file that contains ONLY newly added rows.
Solved! Go to Solution.
Hi @essemMLB,
your text examples are lovely! Here is how I'd attempt to solve this problem:
I'd split it up in three files:
- one file provides all the input ('New File')
- one saves all data ('Stored files')
- one saves news lines ('New Lines output')
The workflows joins the data together in two ways:
- combined stored + new rows to save the for the next run (including run ID)
- only new lines
You are probably able to use the multi-row formula for your run ID.
I'll attach the rough sample workflow. Let me know if it helps.
Best
Alex
@essemMLB - how about including a new timestamp column to your existing table? that way you know when the data was loaded, and can filter out records based on max dates?
This worked like a charm! Thank you!
(I hope you are safe and healthy!)