Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Weighting a matrix using TF-IDF

KaiLarsen
10 - Fireball

Hi everyone,

 

I'm trying to put together a complete data science workflow for my class. I'm therefore working through the TalkingData Mobile User Demographics competition on Kaggle (https://www.kaggle.com/c/talkingdata-mobile-user-demographics). One of the issues, I'm having is that I want to reduce the matrix of users vs application categories used. After showing that the results are worthless after the PCA, I've decided to use the Term-Frequency Inverse Document Frequency (TF-IDF) formula on the data to downweight common categories and users that use a lot of applications. I have no problem with the TF part, but the IDF part seems to require use of the multi-row formula tool, which I'm not great with yet.  

 

Any help appreciated. Data and formula in the attached workflow.

 

Kai :-)

2 REPLIES 2
KaneG
Alteryx Alumni (Retired)

Hi @KaiLarsen,

 

I would do this with summarise tools as below:

 

Image 001 - 20160808 - 130841.png

 

However, if you wanted to use the multi-row formula, you would need to get the groupings right in the tool and run a running total for each element. So, to get the total number of people, sort by device_id and then use multi-row to create a field: [People]=IIF([device_id]!=[Row-1:device_id],[Row-1:People]+1,[Row-1:People]), then do similar for People/Category, but grouping by Category in the Multi-row field. You'll then be able to perform your calculation, however, that will perform it on all 100 rows, whereas the sumamrise will get the value per category and then join it back on by category.

 

Kane

KaiLarsen
10 - Fireball

That was really helpful Kane. I was able to use what you provided to improve the workflow.  

 

I've tried to make it as understandable as possible, so would love to know whether folks can follow it and even if anyone can see improvements. I know that at a minimum, the unique tool is an indication that I could simplify this earlier. I'm sure my students would thank you for saving them from my thinking.

 

Screen Shot 2016-08-08 at 1.50.02 AM.png

Kai :-)

Labels
Top Solution Authors