Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEAWow - this took much longer than I expected - mainly due to the specifics of how we deal with different types of replacement.
Did this in Alteryx; and then in Python - the line counts match but other than that the answers are different than the model solution.
(i.e. my answer is wrong - don't copy my exam)
#SnakingMyWayThruChallenges
Challenge 205 Solution
House Johnson is a BIG fan of Tay Tay... this was a fun one 🙂
Here's the Python script I came up with (not elegant by any means, but hey, it works.)
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
import pandas as pd
import numpy as np
#Package.installPackages(['pandas','numpy'])
#################################
# Input source data, creating a dataframe for lyrics and a list for stopwords.
from ayx import Alteryx
df_lyrics = Alteryx.read("#1")
stopwords = Alteryx.read("#2")['Stop_Words'].values.tolist()
df_lyrics[:3]
#################################
# Split lyrics, replace stopwords with None values, and filter out None values.
list_lyrics = df_lyrics.lyric.str.split().tolist()
list_lyrics = [[j.lower() for j in i] for i in list_lyrics]
# Replace specific characters in split words.
for i, x in enumerate(list_lyrics):
for j, a in enumerate(x):
list_lyrics[i][j] = a.replace(',', '') \
.replace('"','') \
.replace("(","") \
.replace(")","") \
.replace(".","") \
.replace("?","") \
.replace("!","")
# Replace stopwords in the list of lyrics, and convert to Series > DataFrame.
list_replaced = list()
for i in range(len(list_lyrics)):
list_iteration = [x for x in list_lyrics[i] if x not in stopwords]
list_replaced.append(list_iteration)
df_listlyrics = pd.DataFrame(pd.Series(list_replaced))
#################################
# Join new list of lyrics to original df_lyrics.
df_join = df_lyrics.join(df_listlyrics, how='left') \
.drop(["lyric"], axis = 1) \
.rename(columns = {0:'Lyrics'})
# Join lyrics back to album information, and cleanup/pivot data so that each lyric word is in a row.
df_join = df_join.Lyrics.apply(pd.Series) \
.merge(df_join, right_index = True, left_index = True) \
.drop(['Lyrics','artist'], axis = 1) \
.melt(id_vars = ['album','track_title','track_n','line','year'], value_name = 'lyric') \
.drop(['variable'], axis = 1)
# Assign value of 1 to each lyric for counting purposes.
df_join['value'] = 1
# Perform group by and aggregation functions to derive the list of top ten words.
df_count = df_join.drop(['track_title','track_n','line'], axis = 1) \
.groupby(['year','album','lyric'], as_index = False).sum() \
df_count.sort_values(by=['year','album','value'], inplace = True, ascending = False)
df_topten = df_count.groupby(['album'], as_index = True).head(10) \
.drop(['value'], axis = 1) \
.groupby(['year','album'], as_index = False).agg({'lyric': ' '.join})
#################################
# Count total lines in album.
lyrics_count = df_lyrics.groupby(['album'])['line'].count()
#################################
# Perform group by and aggregation functions to derive the count of unique lines.
unique_lyrics = pd.DataFrame(pd.Series(list_lyrics)) \
.rename(columns = {0:'lines'}) \
.join(df_lyrics['album'])
unique_lyrics['liststring'] = [' '.join(map(str,l)) for l in unique_lyrics['lines']]
unique_lyrics = unique_lyrics.drop(['lines'], axis = 1) \
.drop_duplicates() \
.groupby(['album'])['liststring'].count()
#################################
# Join different results together.
df_final = df_topten.merge(lyrics_count, on = 'album') \
.merge(unique_lyrics, on = 'album') \
.rename(columns = {'year':'Year', 'album':'Album', 'lyric':'Top 10 Lyrics', 'line':'Total Lines', 'liststring':'Unique Lines'})
df_final['Duplicate Lines'] = df_final['Total Lines'] - df_final['Unique Lines']
df_final['Repetitiveness Percentage'] = df_final['Duplicate Lines'] / df_final['Total Lines']
#################################
Alteryx.write(df_final,1)
#################################
Cheers!
NJ
head scratcher at first.... but after reading the comments it started to make sense. Also added a python version (probably not efficient though) - that one was tough...
#SnakingMyWayThruChallenges