Hi Everyone, I have a variable called SUM which has around 16 million entries I need to assign them to buckets for e.g.
SUM | Bucket |
90000 | 50,001 - 100,000 |
35000 | 0 - 50,000 |
I know that something like this can be quickly done with an ifelse statement however I have a large number of values and they need to go into several buckets. The bucket breakdown is as follows:
0 - 50000 |
50001 - 100000 |
100001 - 250000 |
250001 - 500000 |
500001 - 1000000 |
1000001 - 1500000 |
1500001 - 2000000 |
2000001 - 2500000 |
2500001 - 3000000 |
3000001 - 3500000 |
3500001 - 4000000 |
4000001 - 4500000 |
4500001 - 5000000 |
5000001 - 6000000 |
6000001 - 7000000 |
7000001 - 8000000 |
8000001 - 9000000 |
9000001 - 10000000 |
10000001 - 11000000 |
11000001 - 12000000 |
12000001 - 13000000 |
13000001 - 9999999999 |
is there a more efficient way then using ifelse statements? I was thinking of using Tiles function but wasn't sure if that would work here.
Thanks in advance!
Solved! Go to Solution.
@HadiElmo you could use an Input Text tool, with the Group Start and Group End dates specified for all groups you need, then use a filter tool to bring back the row where the value falls in between the group start and end values. A formula tool could then be used to put the group start and group end columns into one Group column. See the attached example.
If you are doing this for all 16 million rows, however, the Append Fields option would create 16 million rows x the number of groups, which would make the workflow take forever to run.
the problem is - your buckets are arbitrary. Sometimes the range is 50000, sometimes its 100,000 sometimes it's 1,000,000. This makes sense to you - but putting it into logic is complicated - it is a pain. Your solution is exactly what you think it is - a massive if statement or a batch macro which matches a value. if you create a 50k threshold you can take the ceil of the number and divide by 50k. You can then adjust your values as needed and further parse it out based upon that value (if it it's value 1, take a ceil by 10k or whatever) and if it's value 13-19 you can turn it into value 12, and 20+ can become X - whatever your logic is.
you can use the tile tool in manual mode - but that will still require some work. you can do something nifty like bring in your range as a datasource, use summarize tool to turn it into xml - edit the xml of the manual part of the tile tool to automatically create your thresholds. It's not intuitive but it's possible.
and here's an alternative append fields way... we make some modifications to the ranges to have them split by | - we then crosstab. we append a single field. we then use multi-field formula with regex to bin. We overwrite those values into true/false range grid - which would actually work if you were doing one hot encoding.... If that's the goal - hey! You/re done. But if that's not the goal - then I transpose and add a filter.
Please have a look at another solution with Dynamic Replace Tool, which looks simple and "dynamic".😀
Workflow
Expression in Formula Tool
[New Field] = "SUM"
[Expression] =
"[_CurrentField_] >= " + GetWord([Bucket], 0) +
" AND [_CurrentField_] <= " + GetWord([Bucket], 2)
Dynamic Replace configuration
Field Name Field : New Field
Boolean Expression Field : Expression
Output Value Field : Bucket
Output
SUM | Bucket |
0 | 0 - 50000 |
1 | 0 - 50000 |
50000 | 0 - 50000 |
50001 | 50001 - 100000 |
50002 | 50001 - 100000 |
100000 | 50001 - 100000 |
100001 | 100001 - 250000 |