Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Counting hashtag combinations

JefBus
7 - Meteor

Hi,

 

I'm cooking up a little visualisation project and am at a slight roadblock:

So, i have a data dump of a number of messages (somewhere up in the 7 digit-range) with hashtags embedded

e.g.
- Amy, don't have a #cow! http://f.co/Asp4fdkX #farmlife

- I love #Farmlife #vacation #midwest
- Cows & cows & #cow, as far as the eye can see #farmlife #midwest http://www.midwest.com

 

Now i'd want to count occurences of all hashtag combinations to see which clusters emerge
the 3 messages above would then have the following combinations

tweet 1   
 #cow#farmlife1
tweet 2   
 #farmlife#midwest1
 #farmlife#vacation1
 #midwest#vacation1
tweet 3   
 #cow#farmlife1
 #midwest#cow1
 #farmlife#midwest1

 

after summarizing this would become:

All combos   
 #cow#farmlife2
 #farmlife#midwest2
 #farmlife#vacation1
 #midwest#vacation1
 #midwest#cow1

 

sounds like the right input for a non-directional network graph to visualize how these hashtags are related and clustered, which are often combined, and which are not

I'm still in the contemplation+ parsing out invalid characters and cleaning stage at this point ( human-written messages and csv format apparently do not play well together!)

but I have a few minor issues / concerns:

 

how to extract the pairs?
messages have anywhere between 1 and 8 hashtags, so anywhere between 1 and 28 combinations per tweet (
based on the formula: 

 

    n!
--------------

 r! (n - r )!


I haven't got much of a clue how to handle this specific dynamic in Alteryx yet,
any of you ever had a similar problem?


I think there might be an iterative or batch macro in here, one that

- takes a single tweet
- extracts all hashtags and orders them alphabetically
- somehow builds an array of all combinations <<-- ( this is my main problem)
-outputs this into a table with 2 columns ( or summarizes them)
- goes to the next tweet repeats, and appends this result to the previous one,
until all tweets are finished

 

How to de-duplicate:
( there is no directionality!)

#cows#farmlife4
#farmlife#cows6


to become

#cows#farmlife10

 

( I suspect, that once I get to this point, alphabetically sorting the hashtags before summarizing them may very well clear up this problem, but maybe the pairs should be aphabetically sorted at the extraction stage? I do not know if there is a simple way to do this kind of cross-column compare (and replace?) )

 

Also haven't figured out yet what to do if someone uses a single hashtag twice ( there shouldn't be 2 nodes with the same hashtag/label, so i guess I could integrate a filter to filter out based on an expression like [column1#tag] != [column2#tag].

 

Thought i'd post it here, seems like an interesting conundrum, I'm figuring out a lot of the specifics right now by writing it up here
sadly, i cannot share the original dataset, so I hope it's ok to include a little part of the dataset that was used in weekly challenge #89 and #90 ( Analyzing social data)
as it already has similar data and the hashtags have already been split off.

Any 2cents or pointers for an approach that could work would be #hugely welcome

2 REPLIES 2
danrh
13 - Pulsar

Give this one a try:

image.png

The top input is your data, the bottom is adding a delimiter so I could keep the #'s.  How I handled getting the unique pairs was to join the data back to itself, then removing records where the two hashtags are the same.  I think you're spot on with alphabetizing the hashtags prior to counting them.

 

Hope it helps!

JefBus
7 - Meteor

Thanks @danrh !

a very clean and elegant solution

I was making it much harder then it was supposed to be,
joining the data back onto itself to get the combbinations: Great idea, never thought of that.

i knew i was missing something thinking about an iterative or batch macro to do this!
This is probably A LOT faster then iteratively cycling through every row/tweet to extract the tags, ( 600K rows in the base dataset, that could take a while :) )
the adapted flow now blazes throuugh  the entire data set in sth like 15-23 seconds!

Labels