Hi
I have two datasets :
RecordID | Date |
1 | 01/08/2019 |
2 | 02/08/2019 |
3 | 03/08/2019 |
4 | 04/08/2019 |
5 | 05/08/2019 |
and
Date | Column1 | Column2 | Column3 |
01/08/2019 | A | E | 123 |
01/08/2019 | B | F | 456 |
01/08/2019 | C | G | 785 |
01/08/2019 | D | H | 7546 |
02/08/2019 | A | X | 1321 |
02/08/2019 | B | Y | 12302 |
02/08/2019 | C | Z | 1322 |
02/08/2019 | D | AA | 15152 |
05/08/2019 | A | FF | 1213 |
05/08/2019 | B | SS | 154 |
05/08/2019 | C | GG | 2189 |
05/08/2019 | D | RW | 5968 |
When I try to join them I get the following:
RecordID | Date | Column1 | Column2 | Column3 |
1 | 01/08/2019 | A | E | 123 |
1 | 01/08/2019 | B | F | 456 |
1 | 01/08/2019 | C | G | 785 |
1 | 01/08/2019 | D | H | 7546 |
2 | 02/08/2019 | A | X | 1321 |
2 | 02/08/2019 | B | Y | 12302 |
2 | 02/08/2019 | C | Z | 1322 |
2 | 02/08/2019 | D | AA | 15152 |
3 | 03/08/2019 | [Null] | [Null] | [Null] |
4 | 04/08/2019 | [Null] | [Null] | [Null] |
5 | 05/08/2019 | A | FF | 1213 |
5 | 05/08/2019 | B | SS | 154 |
5 | 05/08/2019 | C | GG | 2189 |
5 | 05/08/2019 | D | RW | 5968 |
What I try to achieve is to retain the values of the last date in which data existed (02/08/2019) into the dates in which data do not exist (3/8/2019 and 4/8/2019).
My output should look like:
RecordID | Date | Column1 | Column2 | Column3 |
1 | 01/08/2019 | A | E | 123 |
1 | 01/08/2019 | B | F | 456 |
1 | 01/08/2019 | C | G | 785 |
1 | 01/08/2019 | D | H | 7546 |
2 | 02/08/2019 | A | X | 1321 |
2 | 02/08/2019 | B | Y | 12302 |
2 | 02/08/2019 | C | Z | 1322 |
2 | 02/08/2019 | D | AA | 15152 |
3 | 03/08/2019 | A | X | 1321 |
3 | 03/08/2019 | B | Y | 12302 |
3 | 03/08/2019 | C | Z | 1322 |
3 | 03/08/2019 | D | AA | 15152 |
4 | 04/08/2019 | A | X | 1321 |
4 | 04/08/2019 | B | Y | 12302 |
4 | 04/08/2019 | C | Z | 1322 |
4 | 04/08/2019 | D | AA | 15152 |
5 | 05/08/2019 | A | FF | 1213 |
5 | 05/08/2019 | B | SS | 154 |
5 | 05/08/2019 | C | GG | 2189 |
5 | 05/08/2019 | D | RW | 5968 |
Please note that it is not just a single row that I want to retain, but a set of rows instead. Also, even if I have 4 days without data, the script should retain the data of the last available date (5 days earlier) to all 4 days.
How could someone do that? It looks like a dynamic join (??) is needed based on a condition (Column1 = [null] )
Many thanks in advance
Solved! Go to Solution.
Hi @alexankyr
How many missing dates do you often have?
In the attached workflow i have used a formula tool to create a new date field from the Left Output Anchor of the Join (the dates which don't have a corresponding ID)
This date field is taking the previous days date, which you can then use to join to the original data set.
Then once this process has been completed you can union the workflows back together.
Would this work for you?
Regards
Will
Hi
Thanks for the tip. I was thinking about a structure like this, but I quickly abandoned it because it is not quite dynamic and it works for a set number of "missing dates" (2 in your example).
For example, in August 2019 I had 10 missing dates, so If i employed this structure I should do 10 sequential joins.
In theory I can have even more than 10 missing dates if we fall into a bank holiday period (Easter or Christmas) or if some external statements (datasets) are not available.
Ideally, the number of sequential joins in the workflow should be dynamic to the number of missing dates. That's why I referred to a dynamic join before until I end up to a complete dataset.
thanks
Hi @alexankyr
For a dynamic solution that will handle any number of missing dates, try this solution
Start with a unique tool to pull out the list of possible Column1 values. Append these to the date table to get one row for each date/column1 combination. Join this to the original transactions and add a Union tool to include the non-matched rows. Sort this list by Column1 and then Date. Use a couple of Multirow tools to fill in the missing values for Column2 and Column3. The final sort brings back the original transaction order, resulting in the following.
This solution uses 2 Multirow tools since you only have 2 columns to fill. If your actual data set has more columns, or a varying number of columns, transpose these columns so that you're only working with one multirow tool and then cross tab to get the original column structure back.
Dan
Hi @danilang
Your workflow worked like a charm in my dataset.
Luckily, I only need 2 columns to fill, but my dataset consists of more columns that I do not need hence I dropped them.
It would be great to see how the workflow would be if I had a varying number of columns, just for future reference.
Many thanks!
Hi @alexankyr
Here's another approach that's very dynamic. It relies on the number of unique rows per Date (in your case, 4), and uses Generate Rows a few times. Once to generate the number of rows for each unmatched date. Another time to generate rows from the dates that are missing from your data.
When a date is imputed into the data set, it relates back to the rows from which to pull the data (Record ID and Match ID in my workflow), so you'll have a row to row match. This would be instead of a few multi row, or transpose-multi-row as proposed in @danilang 's excellent solution.
The filled in data is unioned with the original data, and all is sorted.
A bit messy because of crossing connections, but gets the job done.
Let me know if this works.
Cheers!
Esther
@alexankyr
Try out my solution with a different number of columns and see if it works for you. It should, at least for the way I've tested it (though you'll have to clean up the last join tool to get rid of extra columns on the Right side)
Cheers
Esther
Esther