Weekly Challenge

Solve the challenge, share your solution and summit the ranks of our Community!
IDEAS WANTED

We're actively looking for ideas on how to improve Weekly Challenges and would love to hear what you think!

Submit Feedback
We've recently made an accessibility improvement to the community and therefore posts without any content are no longer allowed. Please use the spoiler feature or add a short message in the message body in order to submit your weekly challenge.

Challenge #205: Taynalysis

Highlighted
16 - Nebula
16 - Nebula

Wow - this took much longer than I expected - mainly due to the specifics of how we deal with different types of replacement.

 

Did this in Alteryx; and then in Python - the line counts match but other than that the answers are different than the model solution.

(i.e. my answer is wrong - don't copy my exam)

 

Spoiler
Solution.pngPython1.pngPython2.png

#SnakingMyWayThruChallenges

 

 

Highlighted
8 - Asteroid

Only for album Red and 1989, some words are different. It is Quiet challenging for the beginner. But I loved it.

 

Thanks,

Sudhakar S

12 - Quasar
Spoiler
Spoiler AlertSpoiler Alert
Highlighted
8 - Asteroid

in attach my solution

Highlighted
Alteryx Certified Partner

here it is

Highlighted
8 - Asteroid

My numbers don't tie exactly either. Are we counting null lines? Also the answers seem different when you use "Remove unwanted characters - Punctuation" using Data Cleanse vs using the formula as provided in the solution. Anyways, here is my answer.

Highlighted
7 - Meteor

Challenge 205 Solution

Spoiler
Challenge 205_LHZ.JPG
Highlighted
14 - Magnetar
14 - Magnetar

House Johnson is a BIG fan of Tay Tay... this was a fun one

 

Spoiler
... and extra fun since I figured out (after MUCH GOOGLE RESEARCH) how to do it both in Alteryx and Python! #SnakingMyWayThruChallenges 

WeeklyChallenge205.JPG

Here's the Python script I came up with (not elegant by any means, but hey, it works.)

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
import pandas as pd
import numpy as np

#Package.installPackages(['pandas','numpy'])


#################################
# Input source data, creating a dataframe for lyrics and a list for stopwords.

from ayx import Alteryx
df_lyrics = Alteryx.read("#1")

stopwords = Alteryx.read("#2")['Stop_Words'].values.tolist()

df_lyrics[:3]


#################################
# Split lyrics, replace stopwords with None values, and filter out None values.

list_lyrics = df_lyrics.lyric.str.split().tolist()
list_lyrics = [[j.lower() for j in i] for i in list_lyrics]

# Replace specific characters in split words.

for i, x in enumerate(list_lyrics):
for j, a in enumerate(x):
list_lyrics[i][j] = a.replace(',', '') \
.replace('"','') \
.replace("(","") \
.replace(")","") \
.replace(".","") \
.replace("?","") \
.replace("!","")

# Replace stopwords in the list of lyrics, and convert to Series > DataFrame.

list_replaced = list()

for i in range(len(list_lyrics)):
list_iteration = [x for x in list_lyrics[i] if x not in stopwords]
list_replaced.append(list_iteration)

df_listlyrics = pd.DataFrame(pd.Series(list_replaced))


#################################
# Join new list of lyrics to original df_lyrics.

df_join = df_lyrics.join(df_listlyrics, how='left') \
.drop(["lyric"], axis = 1) \
.rename(columns = {0:'Lyrics'})

# Join lyrics back to album information, and cleanup/pivot data so that each lyric word is in a row.

df_join = df_join.Lyrics.apply(pd.Series) \
.merge(df_join, right_index = True, left_index = True) \
.drop(['Lyrics','artist'], axis = 1) \
.melt(id_vars = ['album','track_title','track_n','line','year'], value_name = 'lyric') \
.drop(['variable'], axis = 1)

# Assign value of 1 to each lyric for counting purposes.

df_join['value'] = 1

# Perform group by and aggregation functions to derive the list of top ten words.

df_count = df_join.drop(['track_title','track_n','line'], axis = 1) \
.groupby(['year','album','lyric'], as_index = False).sum() \

df_count.sort_values(by=['year','album','value'], inplace = True, ascending = False)

df_topten = df_count.groupby(['album'], as_index = True).head(10) \
.drop(['value'], axis = 1) \
.groupby(['year','album'], as_index = False).agg({'lyric': ' '.join})


#################################
# Count total lines in album.

lyrics_count = df_lyrics.groupby(['album'])['line'].count()


#################################
# Perform group by and aggregation functions to derive the count of unique lines.

unique_lyrics = pd.DataFrame(pd.Series(list_lyrics)) \
.rename(columns = {0:'lines'}) \
.join(df_lyrics['album'])

unique_lyrics['liststring'] = [' '.join(map(str,l)) for l in unique_lyrics['lines']]

unique_lyrics = unique_lyrics.drop(['lines'], axis = 1) \
.drop_duplicates() \
.groupby(['album'])['liststring'].count()


#################################
# Join different results together.

df_final = df_topten.merge(lyrics_count, on = 'album') \
.merge(unique_lyrics, on = 'album') \
.rename(columns = {'year':'Year', 'album':'Album', 'lyric':'Top 10 Lyrics', 'line':'Total Lines', 'liststring':'Unique Lines'})

df_final['Duplicate Lines'] = df_final['Total Lines'] - df_final['Unique Lines']
df_final['Repetitiveness Percentage'] = df_final['Duplicate Lines'] / df_final['Total Lines']


#################################
Alteryx.write(df_final,1)


#################################

Cheers!

NJ

Highlighted
8 - Asteroid
Spoiler
Solution Attached

balajilolla_0-1588876501045.png

 

Highlighted
12 - Quasar
12 - Quasar

head scratcher at first.... but after reading the comments it started to make sense. Also added a python version (probably not efficient though) - that one was tough...

Spoiler
jarrod_0-1588946737271.png

 

#SnakingMyWayThruChallenges