Weekly Challenges

nmacpherson · ‎03-02-2020

patrick_digan · ‎03-02-2020

TonyA · ‎03-02-2020

Getting some slight differences. I checked some of the other solutions and they seem to be seeing the same discrepancies.

Kenda · ‎03-02-2020

Spoiler

cgoodman3 · ‎03-02-2020

Slight differences in my answer to the provided example

Spoiler

Chris
Check out my collaboration with fellow ACE Joshua Burkhow at AlterTricks.com

aanandkumar · ‎03-02-2020

Here is my solution. I couldn't figure out how the numbers were calculated so some of the numbers are off.

mbogusz · ‎03-02-2020

Spoiler

Some slight differences in expected vs. actual

sambitd · ‎03-02-2020

Hi All ,

This is my very first weekly challenge response 🙂

I am excited to share the news that my paper "Workday Data Migration : How we saved over 2000 hours of manual effort" was chosen for the Excellence Award !!!!

For this weekly challenge , I used the summarise function and the count function on the lyric field to return counts , count distinct of lines per album. Using the data I arrived at the duplicate records. The data matched for some records but was off by 1 number for a few. Attached is my workflow.

Regards

Sambit

cam_w · ‎03-02-2020

Spoiler

#################################
# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])


#################################
from ayx import Alteryx
from collections import Counter 

import re

dfExpected = Alteryx.read("#Output")
dfLyrics = Alteryx.read("#Lyrics")
dfStopwords = Alteryx.read("#Stopwords")


#################################
# Create a simple list of stopwords
stopwords = [w[0] for w in dfStopwords.values.tolist()]


#################################
dfTop10 = dfLyrics.groupby(['year','album'])['lyric'].apply(" ".join).reset_index()

def get_top_10_words(word_list):
    word_list = re.sub(r"[^a-zA-Z0-9\s\']", r'', word_list)
    list_ = word_list.split()
    not_stop = [word for word in list_ if word.lower() not in stopwords]
    counter = Counter(not_stop)
    return " ".join([word for (word, count) in counter.most_common(10)])

dfTop10['lyric'] = dfTop10['lyric'].apply(get_top_10_words)


#################################
dfLines = dfLyrics.groupby(['year','album'])['lyric'].agg(['nunique','count']).reset_index()

dfLines['dups'] = dfLines['count'] - dfLines['nunique']
dfLines['percent'] = dfLines['dups'] * 100 / dfLines['count']

dfLines = dfLines[['year', 'album', 'nunique', 'dups', 'count', 'percent']]


#################################
dfOutput = dfTop10.merge(dfLines).rename(columns=
                                         {"year": "Album Year",
                                          "album": "Album Name",
                                          "lyric": "Top_10_Lyrics",
                                          "nunique": "Unique_Lines_Per_Album",
                                          "dups": "Duplicate_Lines_Per_Album",
                                          "count": "Total_Lines_Per_Album",
                                          "percent": "Repetativeness_Percentage"
                                         }
                                        )


#################################
Alteryx.write(dfOutput, 1)

################################# # List all non-standard packages to be imported by your # script here (only missing packages will be installed) from ayx import Package #Package.installPackages(['pandas','numpy']) ################################# from ayx import Alteryx from collections import Counter import re dfExpected = Alteryx.read("#Output") dfLyrics = Alteryx.read("#Lyrics") dfStopwords = Alteryx.read("#Stopwords") ################################# # Create a simple list of stopwords stopwords = [w[0] for w in dfStopwords.values.tolist()] ################################# dfTop10 = dfLyrics.groupby(['year','album'])['lyric'].apply(" ".join).reset_index() def get_top_10_words(word_list): word_list = re.sub(r"[^a-zA-Z0-9\s\']", r'', word_list) list_ = word_list.split() not_stop = [word for word in list_ if word.lower() not in stopwords] counter = Counter(not_stop) return " ".join([word for (word, count) in counter.most_common(10)]) dfTop10['lyric'] = dfTop10['lyric'].apply(get_top_10_words) ################################# dfLines = dfLyrics.groupby(['year','album'])['lyric'].agg(['nunique','count']).reset_index() dfLines['dups'] = dfLines['count'] - dfLines['nunique'] dfLines['percent'] = dfLines['dups'] * 100 / dfLines['count'] dfLines = dfLines[['year', 'album', 'nunique', 'dups', 'count', 'percent']] ################################# dfOutput = dfTop10.merge(dfLines).rename(columns= {"year": "Album Year", "album": "Album Name", "lyric": "Top_10_Lyrics", "nunique": "Unique_Lines_Per_Album", "dups": "Duplicate_Lines_Per_Album", "count": "Total_Lines_Per_Album", "percent": "Repetativeness_Percentage" } ) ################################# Alteryx.write(dfOutput, 1)

I wanted to practice my data frames with this one, so I used the python tool. Like others have mentioned, my results are very close to the expected output counts.

On a whim I also tried training an RNN (not attached) to generate new T-Swift songs, but after 30 epochs it was over-fitting. Reducing the number of epochs produced incoherent lyrics. At approximately 33k words, there wasn't enough data to satisfy the network. We'll have to wait for more Taylor albums! 🙂

chris_ramsay_dup_425 · ‎03-03-2020

Thanks for the challenge! Here's my solution

Weekly Challenges

IDEAS WANTED

Challenge #205: Taynalysis