Hello,
I have 100+ columns of data where I need to create a fill rate table that gives me a breakdown of how many times each element within that field is present. For example, if I have a field called "Vehicle Make" I will want to know how many times "Honda", "Mazda", "Ford" and etc appears. If I just had to do this on one field it would be super simple and I would just need the one summarize tool. However, I need to do this on all 100+ fields and do not want to have 100+ different streams of data grouping the different fields. I was thinking a macro might be able to accomplish this where I tell the macro to group the Nth field each each iteration (Iteration 1 group/count the 1st field, iteration 2 group/count the 2nd field and so on).
I did some benchmarking, starting with 10 records and 294 columns and running 5x the amount of records on each new iteration. It appears that the macro doesn't scale better than the traditional transpose/summarize method (bummer!).
So, I will try to run all 250m records and use the transpose/summarize approach, however, if anyone knows of a computationally faster way I am all ears because it makes my knees shake thinking of a summarize tool having to sumarize 250M*294=73.5B records. Thank you all!
Thinking about this some more, I know there are ways to split data files horizontally, but is there a way to split data files vertically? Namely, what if I split up my 1 data file consisting of 294 columns into 294 data files consisting of 1 column. Then I create a macro that reads the files one at a time, groups it and outputs the summary results. This way all 294 columns do not need to be in the machines memory all at the same time and it just needs to think about one column (or one data file) at a time. Thoughts? Thank you for all the brain power you have lent me thus far.
Hi @trevorwightman ,
The macro itself is basically trying to address the problem that I want it to select a single column, the transponse and summarise the data all without knowing what the column is actually called.
Basically I created a column with the name "1" just so I could pre-configure what the transpose and summarise are supposed to do.
The dynamic select only keeps columns named "1" or whatever the column we're inputting is, I make 1 = to whatever values are in our column of interest.
That way the preconfigured Transpose and Summarise will still work, I then added a header column and set that value to our input column name.
It's confusing because if you try and see how it's configured it won't work, because the dynamic select won't update the available fields.
It's not ideal the way it's set up but it was supposed to be a quick demo. It actually contains an error now I've looked at it again - the test I did yesterday improved to 20 seconds from 30+, however, it's still slower than the other method.
It's still about twice as slow as the non-macro method - but then that might just me setting it up incorrectly!
Regards,
Ben