Alteryx Designer Desktop Discussions

JefBus · ‎11-21-2017

Hi,

I'm cooking up a little visualisation project and am at a slight roadblock:

So, i have a data dump of a number of messages (somewhere up in the 7 digit-range) with hashtags embedded

e.g.
- Amy, don't have a #cow! http://f.co/Asp4fdkX #farmlife

- I love #Farmlife #vacation #midwest
- Cows & cows & #cow, as far as the eye can see #farmlife #midwest http://www.midwest.com

Now i'd want to count occurences of all hashtag combinations to see which clusters emerge
the 3 messages above would then have the following combinations

tweet 1
	#cow	#farmlife	1
tweet 2
	#farmlife	#midwest	1
	#farmlife	#vacation	1
	#midwest	#vacation	1
tweet 3
	#cow	#farmlife	1
	#midwest	#cow	1
	#farmlife	#midwest	1

after summarizing this would become:

All combos
	#cow	#farmlife	2
	#farmlife	#midwest	2
	#farmlife	#vacation	1
	#midwest	#vacation	1
	#midwest	#cow	1

sounds like the right input for a non-directional network graph to visualize how these hashtags are related and clustered, which are often combined, and which are not

I'm still in the contemplation+ parsing out invalid characters and cleaning stage at this point ( human-written messages and csv format apparently do not play well together!)

but I have a few minor issues / concerns:

how to extract the pairs?
messages have anywhere between 1 and 8 hashtags, so anywhere between 1 and 28 combinations per tweet (
based on the formula:

n!
--------------

r! (n - r )!

I haven't got much of a clue how to handle this specific dynamic in Alteryx yet,
any of you ever had a similar problem?

I think there might be an iterative or batch macro in here, one that

- takes a single tweet
- extracts all hashtags and orders them alphabetically
- somehow builds an array of all combinations <<-- ( this is my main problem)
-outputs this into a table with 2 columns ( or summarizes them)
- goes to the next tweet repeats, and appends this result to the previous one,
until all tweets are finished

How to de-duplicate:
( there is no directionality!)

#cows	#farmlife	4
#farmlife	#cows	6

to become

#cows

#farmlife

10

( I suspect, that once I get to this point, alphabetically sorting the hashtags before summarizing them may very well clear up this problem, but maybe the pairs should be aphabetically sorted at the extraction stage? I do not know if there is a simple way to do this kind of cross-column compare (and replace?) )

Also haven't figured out yet what to do if someone uses a single hashtag twice ( there shouldn't be 2 nodes with the same hashtag/label, so i guess I could integrate a filter to filter out based on an expression like [column1#tag] != [column2#tag].

Thought i'd post it here, seems like an interesting conundrum, I'm figuring out a lot of the specifics right now by writing it up here
sadly, i cannot share the original dataset, so I hope it's ok to include a little part of the dataset that was used in weekly challenge #89 and #90 ( Analyzing social data)
as it already has similar data and the hashtags have already been split off.

Any 2cents or pointers for an approach that could work would be #hugely welcome

danrh · ‎11-21-2017

Give this one a try:

The top input is your data, the bottom is adding a delimiter so I could keep the #'s. How I handled getting the unique pairs was to join the data back to itself, then removing records where the two hashtags are the same. I think you're spot on with alphabetizing the hashtags prior to counting them.

Hope it helps!

JefBus · ‎11-28-2017

Thanks @danrh !

a very clean and elegant solution

I was making it much harder then it was supposed to be,
joining the data back onto itself to get the combbinations: Great idea, never thought of that.

i knew i was missing something thinking about an iterative or batch macro to do this!
This is probably A LOT faster then iteratively cycling through every row/tweet to extract the tags, ( 600K rows in the base dataset, that could take a while :) )
the adapted flow now blazes throuugh the entire data set in sth like 15-23 seconds!

Alteryx Designer Desktop Discussions

Counting hashtag combinations