Weekly Challenges

SeanAdams · ‎04-10-2020

Wow - this took much longer than I expected - mainly due to the specifics of how we deal with different types of replacement.

Did this in Alteryx; and then in Python - the line counts match but other than that the answers are different than the model solution.

(i.e. my answer is wrong - don't copy my exam)

Spoiler

#SnakingMyWayThruChallenges

Sudha8990AI · ‎04-10-2020

Only for album Red and 1989, some words are different. It is Quiet challenging for the beginner. But I loved it.

Thanks,

Sudhakar S

RobertOdera · ‎04-19-2020

Spoiler

Spoiler Alert

JohanRanschaert · ‎04-22-2020

in attach my solution

JeremyGonzva · ‎04-24-2020

here it is

deepaprash · ‎04-27-2020

My numbers don't tie exactly either. Are we counting null lines? Also the answers seem different when you use "Remove unwanted characters - Punctuation" using Data Cleanse vs using the formula as provided in the solution. Anyways, here is my answer.

leehaozhong · ‎05-01-2020

Challenge 205 Solution

Spoiler

NicoleJohnson · ‎05-06-2020

House Johnson is a BIG fan of Tay Tay... this was a fun one 🙂

Spoiler

... and extra fun since I figured out (after MUCH GOOGLE RESEARCH) how to do it both in Alteryx and Python! #SnakingMyWayThruChallenges

Here's the Python script I came up with (not elegant by any means, but hey, it works.)

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
import pandas as pd
import numpy as np

#Package.installPackages(['pandas','numpy'])

#################################
# Input source data, creating a dataframe for lyrics and a list for stopwords.

from ayx import Alteryx
df_lyrics = Alteryx.read("#1")

stopwords = Alteryx.read("#2")['Stop_Words'].values.tolist()

df_lyrics[:3]

#################################
# Split lyrics, replace stopwords with None values, and filter out None values.

list_lyrics = df_lyrics.lyric.str.split().tolist()
list_lyrics = [[j.lower() for j in i] for i in list_lyrics]

# Replace specific characters in split words.

for i, x in enumerate(list_lyrics):
for j, a in enumerate(x):
list_lyrics[i][j] = a.replace(',', '') \
.replace('"','') \
.replace("(","") \
.replace(")","") \
.replace(".","") \
.replace("?","") \
.replace("!","")

# Replace stopwords in the list of lyrics, and convert to Series > DataFrame.

list_replaced = list()

for i in range(len(list_lyrics)):
list_iteration = [x for x in list_lyrics[i] if x not in stopwords]
list_replaced.append(list_iteration)

df_listlyrics = pd.DataFrame(pd.Series(list_replaced))

#################################
# Join new list of lyrics to original df_lyrics.

df_join = df_lyrics.join(df_listlyrics, how='left') \
.drop(["lyric"], axis = 1) \
.rename(columns = {0:'Lyrics'})

# Join lyrics back to album information, and cleanup/pivot data so that each lyric word is in a row.

df_join = df_join.Lyrics.apply(pd.Series) \
.merge(df_join, right_index = True, left_index = True) \
.drop(['Lyrics','artist'], axis = 1) \
.melt(id_vars = ['album','track_title','track_n','line','year'], value_name = 'lyric') \
.drop(['variable'], axis = 1)

# Assign value of 1 to each lyric for counting purposes.

df_join['value'] = 1

# Perform group by and aggregation functions to derive the list of top ten words.

df_count = df_join.drop(['track_title','track_n','line'], axis = 1) \
.groupby(['year','album','lyric'], as_index = False).sum() \

df_count.sort_values(by=['year','album','value'], inplace = True, ascending = False)

df_topten = df_count.groupby(['album'], as_index = True).head(10) \
.drop(['value'], axis = 1) \
.groupby(['year','album'], as_index = False).agg({'lyric': ' '.join})

#################################
# Count total lines in album.

lyrics_count = df_lyrics.groupby(['album'])['line'].count()

#################################
# Perform group by and aggregation functions to derive the count of unique lines.

unique_lyrics = pd.DataFrame(pd.Series(list_lyrics)) \
.rename(columns = {0:'lines'}) \
.join(df_lyrics['album'])

unique_lyrics['liststring'] = [' '.join(map(str,l)) for l in unique_lyrics['lines']]

unique_lyrics = unique_lyrics.drop(['lines'], axis = 1) \
.drop_duplicates() \
.groupby(['album'])['liststring'].count()

#################################
# Join different results together.

df_final = df_topten.merge(lyrics_count, on = 'album') \
.merge(unique_lyrics, on = 'album') \
.rename(columns = {'year':'Year', 'album':'Album', 'lyric':'Top 10 Lyrics', 'line':'Total Lines', 'liststring':'Unique Lines'})

df_final['Duplicate Lines'] = df_final['Total Lines'] - df_final['Unique Lines']
df_final['Repetitiveness Percentage'] = df_final['Duplicate Lines'] / df_final['Total Lines']

#################################
Alteryx.write(df_final,1)

#################################

... and extra fun since I figured out (after MUCH GOOGLE RESEARCH) how to do it both in Alteryx and Python! #SnakingMyWayThruChallenges Here's the Python script I came up with (not elegant by any means, but hey, it works.)################################## List all non-standard packages to be imported by your# script here (only missing packages will be installed)from ayx import Packageimport pandas as pdimport numpy as np#Package.installPackages(['pandas','numpy'])################################## Input source data, creating a dataframe for lyrics and a list for stopwords.from ayx import Alteryxdf_lyrics = Alteryx.read("#1")stopwords = Alteryx.read("#2")['Stop_Words'].values.tolist()df_lyrics[:3]################################## Split lyrics, replace stopwords with None values, and filter out None values.list_lyrics = df_lyrics.lyric.str.split().tolist()list_lyrics = [[j.lower() for j in i] for i in list_lyrics]# Replace specific characters in split words.for i, x in enumerate(list_lyrics):for j, a in enumerate(x):list_lyrics[i][j] = a.replace(',', '') \.replace('"','') \.replace("(","") \.replace(")","") \.replace(".","") \.replace("?","") \.replace("!","")# Replace stopwords in the list of lyrics, and convert to Series > DataFrame.list_replaced = list()for i in range(len(list_lyrics)):list_iteration = [x for x in list_lyrics[i] if x not in stopwords]list_replaced.append(list_iteration)df_listlyrics = pd.DataFrame(pd.Series(list_replaced))################################## Join new list of lyrics to original df_lyrics.df_join = df_lyrics.join(df_listlyrics, how='left') \.drop(["lyric"], axis = 1) \.rename(columns = {0:'Lyrics'})# Join lyrics back to album information, and cleanup/pivot data so that each lyric word is in a row.df_join = df_join.Lyrics.apply(pd.Series) \.merge(df_join, right_index = True, left_index = True) \.drop(['Lyrics','artist'], axis = 1) \.melt(id_vars = ['album','track_title','track_n','line','year'], value_name = 'lyric') \.drop(['variable'], axis = 1)# Assign value of 1 to each lyric for counting purposes.df_join['value'] = 1# Perform group by and aggregation functions to derive the list of top ten words.df_count = df_join.drop(['track_title','track_n','line'], axis = 1) \.groupby(['year','album','lyric'], as_index = False).sum() \df_count.sort_values(by=['year','album','value'], inplace = True, ascending = False)df_topten = df_count.groupby(['album'], as_index = True).head(10) \.drop(['value'], axis = 1) \.groupby(['year','album'], as_index = False).agg({'lyric': ' '.join})################################## Count total lines in album.lyrics_count = df_lyrics.groupby(['album'])['line'].count()################################## Perform group by and aggregation functions to derive the count of unique lines.unique_lyrics = pd.DataFrame(pd.Series(list_lyrics)) \.rename(columns = {0:'lines'}) \.join(df_lyrics['album'])unique_lyrics['liststring'] = [' '.join(map(str,l)) for l in unique_lyrics['lines']]unique_lyrics = unique_lyrics.drop(['lines'], axis = 1) \.drop_duplicates() \.groupby(['album'])['liststring'].count()################################## Join different results together.df_final = df_topten.merge(lyrics_count, on = 'album') \.merge(unique_lyrics, on = 'album') \.rename(columns = {'year':'Year', 'album':'Album', 'lyric':'Top 10 Lyrics', 'line':'Total Lines', 'liststring':'Unique Lines'})df_final['Duplicate Lines'] = df_final['Total Lines'] - df_final['Unique Lines']df_final['Repetitiveness Percentage'] = df_final['Duplicate Lines'] / df_final['Total Lines']#################################Alteryx.write(df_final,1)#################################

Cheers!

NJ

balajilolla · ‎05-07-2020

Spoiler

Solution Attached

jarrod · ‎05-08-2020

head scratcher at first.... but after reading the comments it started to make sense. Also added a python version (probably not efficient though) - that one was tough...

Spoiler

#SnakingMyWayThruChallenges

Weekly Challenges

IDEAS WANTED

Challenge #205: Taynalysis