Weighting a matrix using TF-IDF

Question

Hi everyone,

I'm trying to put together a complete data science workflow for my class. I'm therefore working through the TalkingData Mobile User Demographics competition on Kaggle (https://www.kaggle.com/c/talkingdata-mobile-user-demographics). One of the issues, I'm having is that I want to reduce the matrix of users vs application categories used. After showing that the results are worthless after the PCA, I've decided to use the Term-Frequency Inverse Document Frequency (TF-IDF) formula on the data to downweight common categories and users that use a lot of applications. I have no problem with the TF part, but the IDF part seems to require use of the multi-row formula tool, which I'm not great with yet.

Any help appreciated. Data and formula in the attached workflow.

Kai :-)

SwallowingPrideandAskingforHelp.yxzp

KaneG · Accepted Answer

Hi @KaiLarsen,

I would do this with summarise tools as below:

However, if you wanted to use the multi-row formula, you would need to get the groupings right in the tool and run a running total for each element. So, to get the total number of people, sort by device_id and then use multi-row to create a field: [People]=IIF([device_id]!=[Row-1:device_id],[Row-1:People]+1,[Row-1:People]), then do similar for People/Category, but grouping by Category in the Multi-row field. You'll then be able to perform your calculation, however, that will perform it on all 100 rows, whereas the sumamrise will get the value per category and then join it back on by category.

Kane

KaiLarsen · Accepted Answer

That was really helpful Kane. I was able to use what you provided to improve the workflow.

I've tried to make it as understandable as possible, so would love to know whether folks can follow it and even if anyone can see improvements. I know that at a minimum, the unique tool is an indication that I could simplify this earlier. I'm sure my students would thank you for saving them from my thinking.

Kai :-)