Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Counting hashtag combinations

JefBus
7 - Meteor

Hi,

 

I'm cooking up a little visualisation project and am at a slight roadblock:

So, i have a data dump of a number of messages (somewhere up in the 7 digit-range) with hashtags embedded

e.g.
- Amy, don't have a #cow! http://f.co/Asp4fdkX #farmlife

- I love #Farmlife #vacation #midwest
- Cows & cows & #cow, as far as the eye can see #farmlife #midwest http://www.midwest.com

 

Now i'd want to count occurences of all hashtag combinations to see which clusters emerge
the 3 messages above would then have the following combinations

tweet 1   
 #cow#farmlife1
tweet 2   
 #farmlife#midwest1
 #farmlife#vacation1
 #midwest#vacation1
tweet 3   
 #cow#farmlife1
 #midwest#cow1
 #farmlife#midwest1

 

after summarizing this would become:

All combos   
 #cow#farmlife2
 #farmlife#midwest2
 #farmlife#vacation1
 #midwest#vacation1
 #midwest#cow1

 

sounds like the right input for a non-directional network graph to visualize how these hashtags are related and clustered, which are often combined, and which are not

I'm still in the contemplation+ parsing out invalid characters and cleaning stage at this point ( human-written messages and csv format apparently do not play well together!)

but I have a few minor issues / concerns:

 

how to extract the pairs?
messages have anywhere between 1 and 8 hashtags, so anywhere between 1 and 28 combinations per tweet (
based on the formula: 

 

    n!
--------------

 r! (n - r )!


I haven't got much of a clue how to handle this specific dynamic in Alteryx yet,
any of you ever had a similar problem?


I think there might be an iterative or batch macro in here, one that

- takes a single tweet
- extracts all hashtags and orders them alphabetically
- somehow builds an array of all combinations <<-- ( this is my main problem)
-outputs this into a table with 2 columns ( or summarizes them)
- goes to the next tweet repeats, and appends this result to the previous one,
until all tweets are finished

 

How to de-duplicate:
( there is no directionality!)

#cows#farmlife4
#farmlife#cows6


to become

#cows#farmlife10

 

( I suspect, that once I get to this point, alphabetically sorting the hashtags before summarizing them may very well clear up this problem, but maybe the pairs should be aphabetically sorted at the extraction stage? I do not know if there is a simple way to do this kind of cross-column compare (and replace?) )

 

Also haven't figured out yet what to do if someone uses a single hashtag twice ( there shouldn't be 2 nodes with the same hashtag/label, so i guess I could integrate a filter to filter out based on an expression like [column1#tag] != [column2#tag].

 

Thought i'd post it here, seems like an interesting conundrum, I'm figuring out a lot of the specifics right now by writing it up here
sadly, i cannot share the original dataset, so I hope it's ok to include a little part of the dataset that was used in weekly challenge #89 and #90 ( Analyzing social data)
as it already has similar data and the hashtags have already been split off.

Any 2cents or pointers for an approach that could work would be #hugely welcome

2 REPLIES 2
danrh
13 - Pulsar

Give this one a try:

image.png

The top input is your data, the bottom is adding a delimiter so I could keep the #'s.  How I handled getting the unique pairs was to join the data back to itself, then removing records where the two hashtags are the same.  I think you're spot on with alphabetizing the hashtags prior to counting them.

 

Hope it helps!

JefBus
7 - Meteor

Thanks @danrh !

a very clean and elegant solution

I was making it much harder then it was supposed to be,
joining the data back onto itself to get the combbinations: Great idea, never thought of that.

i knew i was missing something thinking about an iterative or batch macro to do this!
This is probably A LOT faster then iteratively cycling through every row/tweet to extract the tags, ( 600K rows in the base dataset, that could take a while :) )
the adapted flow now blazes throuugh  the entire data set in sth like 15-23 seconds!

Labels