Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語
IDEAS WANTED

Want to get involved? We're always looking for ideas and content for Weekly Challenges.

SUBMIT YOUR IDEA

Challenge #205: Taynalysis

SeanAdams
17 - Castor
17 - Castor

Wow - this took much longer than I expected - mainly due to the specifics of how we deal with different types of replacement.

 

Did this in Alteryx; and then in Python - the line counts match but other than that the answers are different than the model solution.

(i.e. my answer is wrong - don't copy my exam)

 

Spoiler
Solution.pngPython1.pngPython2.png

#SnakingMyWayThruChallenges

 

 

Sudha8990AI
8 - Asteroid

Only for album Red and 1989, some words are different. It is Quiet challenging for the beginner. But I loved it.

 

Thanks,

Sudhakar S

RobertOdera
13 - Pulsar
Spoiler
Spoiler AlertSpoiler Alert
JohanRanschaert
8 - Asteroid

in attach my solution

JeremyGonzva
8 - Asteroid

here it is

deepaprash
8 - Asteroid

My numbers don't tie exactly either. Are we counting null lines? Also the answers seem different when you use "Remove unwanted characters - Punctuation" using Data Cleanse vs using the formula as provided in the solution. Anyways, here is my answer.

leehaozhong
8 - Asteroid

Challenge 205 Solution

Spoiler
Challenge 205_LHZ.JPG
NicoleJohnson
ACE Emeritus
ACE Emeritus

House Johnson is a BIG fan of Tay Tay... this was a fun one 🙂

 

Spoiler
... and extra fun since I figured out (after MUCH GOOGLE RESEARCH) how to do it both in Alteryx and Python! #SnakingMyWayThruChallenges 

WeeklyChallenge205.JPG

Here's the Python script I came up with (not elegant by any means, but hey, it works.)

#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
import pandas as pd
import numpy as np

#Package.installPackages(['pandas','numpy'])


#################################
# Input source data, creating a dataframe for lyrics and a list for stopwords.

from ayx import Alteryx
df_lyrics = Alteryx.read("#1")

stopwords = Alteryx.read("#2")['Stop_Words'].values.tolist()

df_lyrics[:3]


#################################
# Split lyrics, replace stopwords with None values, and filter out None values.

list_lyrics = df_lyrics.lyric.str.split().tolist()
list_lyrics = [[j.lower() for j in i] for i in list_lyrics]

# Replace specific characters in split words.

for i, x in enumerate(list_lyrics):
for j, a in enumerate(x):
list_lyrics[i][j] = a.replace(',', '') \
.replace('"','') \
.replace("(","") \
.replace(")","") \
.replace(".","") \
.replace("?","") \
.replace("!","")

# Replace stopwords in the list of lyrics, and convert to Series > DataFrame.

list_replaced = list()

for i in range(len(list_lyrics)):
list_iteration = [x for x in list_lyrics[i] if x not in stopwords]
list_replaced.append(list_iteration)

df_listlyrics = pd.DataFrame(pd.Series(list_replaced))


#################################
# Join new list of lyrics to original df_lyrics.

df_join = df_lyrics.join(df_listlyrics, how='left') \
.drop(["lyric"], axis = 1) \
.rename(columns = {0:'Lyrics'})

# Join lyrics back to album information, and cleanup/pivot data so that each lyric word is in a row.

df_join = df_join.Lyrics.apply(pd.Series) \
.merge(df_join, right_index = True, left_index = True) \
.drop(['Lyrics','artist'], axis = 1) \
.melt(id_vars = ['album','track_title','track_n','line','year'], value_name = 'lyric') \
.drop(['variable'], axis = 1)

# Assign value of 1 to each lyric for counting purposes.

df_join['value'] = 1

# Perform group by and aggregation functions to derive the list of top ten words.

df_count = df_join.drop(['track_title','track_n','line'], axis = 1) \
.groupby(['year','album','lyric'], as_index = False).sum() \

df_count.sort_values(by=['year','album','value'], inplace = True, ascending = False)

df_topten = df_count.groupby(['album'], as_index = True).head(10) \
.drop(['value'], axis = 1) \
.groupby(['year','album'], as_index = False).agg({'lyric': ' '.join})


#################################
# Count total lines in album.

lyrics_count = df_lyrics.groupby(['album'])['line'].count()


#################################
# Perform group by and aggregation functions to derive the count of unique lines.

unique_lyrics = pd.DataFrame(pd.Series(list_lyrics)) \
.rename(columns = {0:'lines'}) \
.join(df_lyrics['album'])

unique_lyrics['liststring'] = [' '.join(map(str,l)) for l in unique_lyrics['lines']]

unique_lyrics = unique_lyrics.drop(['lines'], axis = 1) \
.drop_duplicates() \
.groupby(['album'])['liststring'].count()


#################################
# Join different results together.

df_final = df_topten.merge(lyrics_count, on = 'album') \
.merge(unique_lyrics, on = 'album') \
.rename(columns = {'year':'Year', 'album':'Album', 'lyric':'Top 10 Lyrics', 'line':'Total Lines', 'liststring':'Unique Lines'})

df_final['Duplicate Lines'] = df_final['Total Lines'] - df_final['Unique Lines']
df_final['Repetitiveness Percentage'] = df_final['Duplicate Lines'] / df_final['Total Lines']


#################################
Alteryx.write(df_final,1)


#################################

Cheers!

NJ

balajilolla
8 - Asteroid
Spoiler
Solution Attached

balajilolla_0-1588876501045.png

 

jarrod
ACE Emeritus
ACE Emeritus

head scratcher at first.... but after reading the comments it started to make sense. Also added a python version (probably not efficient though) - that one was tough...

Spoiler
jarrod_0-1588946737271.png

 

#SnakingMyWayThruChallenges