Hi,
I'm cooking up a little visualisation project and am at a slight roadblock:
So, i have a data dump of a number of messages (somewhere up in the 7 digit-range) with hashtags embedded
e.g.
- Amy, don't have a #cow! http://f.co/Asp4fdkX #farmlife
- I love #Farmlife #vacation #midwest
- Cows & cows & #cow, as far as the eye can see #farmlife #midwest http://www.midwest.com
Now i'd want to count occurences of all hashtag combinations to see which clusters emerge
the 3 messages above would then have the following combinations
tweet 1 | |||
#cow | #farmlife | 1 | |
tweet 2 | |||
#farmlife | #midwest | 1 | |
#farmlife | #vacation | 1 | |
#midwest | #vacation | 1 | |
tweet 3 | |||
#cow | #farmlife | 1 | |
#midwest | #cow | 1 | |
#farmlife | #midwest | 1 |
after summarizing this would become:
All combos | |||
#cow | #farmlife | 2 | |
#farmlife | #midwest | 2 | |
#farmlife | #vacation | 1 | |
#midwest | #vacation | 1 | |
#midwest | #cow | 1 |
sounds like the right input for a non-directional network graph to visualize how these hashtags are related and clustered, which are often combined, and which are not
I'm still in the contemplation+ parsing out invalid characters and cleaning stage at this point ( human-written messages and csv format apparently do not play well together!)
but I have a few minor issues / concerns:
how to extract the pairs?
messages have anywhere between 1 and 8 hashtags, so anywhere between 1 and 28 combinations per tweet (
based on the formula:
n!
--------------
r! (n - r )!
I haven't got much of a clue how to handle this specific dynamic in Alteryx yet,
any of you ever had a similar problem?
I think there might be an iterative or batch macro in here, one that
- takes a single tweet
- extracts all hashtags and orders them alphabetically
- somehow builds an array of all combinations <<-- ( this is my main problem)
-outputs this into a table with 2 columns ( or summarizes them)
- goes to the next tweet repeats, and appends this result to the previous one,
until all tweets are finished
How to de-duplicate:
( there is no directionality!)
#cows | #farmlife | 4 |
#farmlife | #cows | 6 |
to become
#cows | #farmlife | 10 |
( I suspect, that once I get to this point, alphabetically sorting the hashtags before summarizing them may very well clear up this problem, but maybe the pairs should be aphabetically sorted at the extraction stage? I do not know if there is a simple way to do this kind of cross-column compare (and replace?) )
Also haven't figured out yet what to do if someone uses a single hashtag twice ( there shouldn't be 2 nodes with the same hashtag/label, so i guess I could integrate a filter to filter out based on an expression like [column1#tag] != [column2#tag].
Thought i'd post it here, seems like an interesting conundrum, I'm figuring out a lot of the specifics right now by writing it up here
sadly, i cannot share the original dataset, so I hope it's ok to include a little part of the dataset that was used in weekly challenge #89 and #90 ( Analyzing social data)
as it already has similar data and the hashtags have already been split off.
Any 2cents or pointers for an approach that could work would be #hugely welcome
Solved! Go to Solution.
Give this one a try:
The top input is your data, the bottom is adding a delimiter so I could keep the #'s. How I handled getting the unique pairs was to join the data back to itself, then removing records where the two hashtags are the same. I think you're spot on with alphabetizing the hashtags prior to counting them.
Hope it helps!
Thanks @danrh !
a very clean and elegant solution
I was making it much harder then it was supposed to be,
joining the data back onto itself to get the combbinations: Great idea, never thought of that.
i knew i was missing something thinking about an iterative or batch macro to do this!
This is probably A LOT faster then iteratively cycling through every row/tweet to extract the tags, ( 600K rows in the base dataset, that could take a while :) )
the adapted flow now blazes throuugh the entire data set in sth like 15-23 seconds!