Dear Community,
I would like to introduce some random values into a dataset for research purposes (i.e. to determine the impact of abnormal values on different algorithms robustness).
The idea is that I have a raw dataset with a few columns (Xs) and the predicted column (Y).
I would like to introduce abnormal values within the range of each column, like this:
Original dataset:
X1 | X2 | X3 | X4 | X5 | Y |
1 | 11 | 403 | 55 | 19 | 641 |
2 | 42 | 361 | 65 | 17 | 1110 |
6 | 6 | 206 | 71 | 17 | 350 |
2 | 46 | 267 | 20 | 17 | 1061 |
3 | 40 | 261 | 70 | 17 | 979 |
4 | 40 | 148 | 55 | 17 | 860 |
8 | 33 | 350 | 74 | 20 | 1055 |
4 | 29 | 484 | 59 | 17 | 1011 |
9 | 38 | 308 | 22 | 16 | 936 |
Modified dataset:
X1 | X2 | X3 | X4 | X5 | Y |
1 | 11 | 403 | 55 | 19 | 641 |
2 | 42 | 122 | 65 | 17 | 1110 |
6 | 6 | 206 | 71 | 20 | 350 |
10 | 46 | 267 | 64 | 17 | 1061 |
3 | 40 | 261 | 70 | 17 | 979 |
4 | 40 | 148 | 55 | 17 | 860 |
8 | 5 | 350 | 74 | 20 | 1055 |
4 | 29 | 298 | 59 | 17 | 1011 |
9 | 38 | 308 | 22 | 16 | 936 |
So far, I am calculating the min and max of each column, add them to the data set and create the updated “X” with a formula:
X1 | X1_Min | X1_Max | X2 | X2_Min | X2_Max | X3 | … | Y |
1 | 1 | 9 | 11 | 6 | 46 | 403 | … | 641 |
2 | 1 | 9 | 42 | 6 | 46 | 361 | … | 1110 |
6 | 1 | 9 | 6 | 6 | 46 | 206 | … | 350 |
2 | 1 | 9 | 46 | 6 | 46 | 267 | … | 1061 |
3 | 1 | 9 | 40 | 6 | 46 | 261 | … | 979 |
4 | 1 | 9 | 40 | 6 | 46 | 148 | … | 860 |
8 | 1 | 9 | 33 | 6 | 46 | 350 | … | 1055 |
4 | 1 | 9 | 29 | 6 | 46 | 484 | … | 1011 |
9 | 1 | 9 | 38 | 6 | 46 | 308 | … | 936 |
For a 1% introduction of abnormal values in X1, I proceed like this with a formula:
IF RandInt(100) = 100 THEN
[Min_X1]+[Max_X1]*Rand()
ELSE
[X1]
ENDIF
So it means that I have to do it for each column... (I have some datasets with more than 50 "X" columns...)
Is there a way to use the multi-field formula tool like this:
IF RandInt(100) = 100 THEN
[Next Column]+[Next+1 Column]*Rand()
ELSE
[_CurrentField_]
ENDIF
I have attached the dummy dataset and workflow it this can help J
Thanks,
Pierre-Louis
Solved! Go to Solution.
There's loads of ways to do this. My first instinct was to take your current logic and wrap it up into a macro that runs through every column. To do this, you just need to transpose the data first, then shift it back at the end in a cross tab.
Transposing means we can refer to [Name] and [Value] as opposed to specific field names, making it dynamic. Probably could be done through a multi field too though - like i say, this would just be my default method.
(note that i've switched to "IF RandInt(10) = 10" just for testing, you can switch that back...
Example attached.
Hope this helps
Andy
Thanks a lot @andyuttley !
I am happy to see that there was no "obvious" answer which I have missed... and understand that the "transpose" option is regularly an answer to the questions I am raising. I need to deep dive on these possible applications!
I am going to "mimic" your solution on real and large datasets 🙂
Hi @JoeS !
Wow, I had never touched the dynamic replace tool and this is a very powerful one!
I will do my best to transpose it to my real datasets... the hardest challenge will be to keep it in mind for future problems 🙂
Indeed, it's definitely an under utilised tool!
@JoeS wrote:Indeed, it's definitely an under utilised tool!
Yes, was great to see it in a relevant use case here; definitely one I want to get using
@andyuttley wrote:Yes, was great to see it in a relevant use case here; definitely one I want to get using
Yeah, a clue to remind me to consider it is if the data itself is needed to write the formula, then it may be the right way to go.
Like in this instance you needed the min and max values to write the formula then it works.
I was about to write other reasons, then thought "I should check to see if there is a mastery of it" and lo and behold, its here