This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm trying to run a MB Analysis on a very large dataset +500million rows, ~2,000 unique item identifiers. It's probable a dead end on the MB Inspect / Rules tool as it seems like too much data to get it to work whenever I've tried! So now attempting to use the MB affinity tool as according to a few other forum posts it's a lot quicker. Is there an optimum number of records per data chunk in the configuration tool though? (Is it counter intuitive like the sort/join memory usage a lower number is better?!)
Mine is currently set at 256000 per chunk but should this be higher or lower to optimise the workflow? I'm running on a server that has 256gb RAM so memory isn't too much of an issue I hope!
Sorry if this is a ridiculous question - but can you confirm that you are referring to a Market Basket analysis when you mention "MB Analysis"? I've not used these much, so just wanted to make sure I'm on the right track.
There are a few people who frequent this community who are skilled at R or statistical analysis, so I've tagged them below
Market Basket Analysis is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. It's a subset of affinity analysis.
It's an iterative process, and needs to create matrices of combinations of the items/ transactions to create the association rules or frequent itemsets, therefore can be quite memory/ process intensive - which depends on your machine specs.
There is not necessarily an "optimum" number of records (based on the number of transactions and/or the number of items per transaction) so you're definitely on the right track with chunking.
I'm interested in what the tagged Community users have experienced using these tools; and please post your findings back to this thread - it will undoubtedly help other users.
I can't offer too much advice here as I have never used the alteryx MB affinity tool myself, I can however point you to a blog post which I have written on Market Basket analysis which explains why it can be so memory intensive with large datasets.
My gut feeling with this is that if you chunk your dataset into smaller segments then it is likely to improve performance (please don't quote me on this as it's just based on my understanding).
It would be interesting to see, using a subset of data, whether the size of the chunks does in fact affect the output result and of course also identifying which is quickest. Perhaps this is something you could have a go at, before as Criston says, feeding back to the community so others can use your knowledge!