Hi everyone,
I'm trying to put together a complete data science workflow for my class. I'm therefore working through the TalkingData Mobile User Demographics competition on Kaggle (https://www.kaggle.com/c/talkingdata-mobile-user-demographics). One of the issues, I'm having is that I want to reduce the matrix of users vs application categories used. After showing that the results are worthless after the PCA, I've decided to use the Term-Frequency Inverse Document Frequency (TF-IDF) formula on the data to downweight common categories and users that use a lot of applications. I have no problem with the TF part, but the IDF part seems to require use of the multi-row formula tool, which I'm not great with yet.
Any help appreciated. Data and formula in the attached workflow.
Kai :-)