a large dataset (200m+ records) is reduced to the following 2 fields (read from a table, tiled and group by/count), file attached
the file needs to be split sequentially into X pieces with the number of records written to each file at or below a specified threshold.
the threshold is determined based on the total record count and the number of records allowed in each file
the split needs to be determined by a range of codes so the ranges can be used in other processing. codes cannot be split across ranges
for example using the sample
file looks like this
code count
1001 308
1003 2424
1005 1849
1007 777
1009 869
1013 340
1015 279
1017 141
1019 142
3 files should be created if the threshold is set to 3000
the resulting files would be split as follows
code count
1001 308
1003 2424
total count 2732 -- file1
1005 1849
1007 777
total count 2626 --file2
1009 869
1013 340
1015 279
1017 141
1019 142
1021 254
total count 2025 file3
etc
the ranges need to be used in another process (outside alteryx)
there is no specific output format required but needs to identify the start and end points of range. i.e.
| 1001 | file1 |
| 1003 | file1 |
| 1005 | file2 |
| 1007 | file2 |
| 1009 | file3 |
| 1013 | file3 |
| 1015 | file3 |
| 1017 | file3 |
| 1019 | file3 |
| 1021 | file3 |
or
1001 - 1003 1
1005 - 1007 2
1009 - 1021 3
Thank you!