Challenge #205: Taynalysis

Alteryx Alumni (Retired)

Getting some slight differences. I checked some of the other solutions and they seem to be seeing the same discrepancies.

Slight differences in my answer to the provided example

Chris
Here is my solution. I couldn't figure out how the numbers were calculated so some of the numbers are off.

Some slight differences in expected vs. actual
Hi All ,

This is my very first weekly challenge response 🙂

I am excited to share the news that my paper "Workday Data Migration : How we saved over 2000 hours of manual effort" was chosen for the Excellence Award !!!!

For this weekly challenge , I used the summarise function and the count function on the lyric field to return counts , count distinct of lines per album. Using the data I arrived at the duplicate records. The data matched for some records but was off by 1 number for a few. Attached is my workflow.

Regards

Sambit

``````#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])

#################################
from ayx import Alteryx
from collections import Counter

import re

#################################
# Create a simple list of stopwords
stopwords = [w[0] for w in dfStopwords.values.tolist()]

#################################
dfTop10 = dfLyrics.groupby(['year','album'])['lyric'].apply(" ".join).reset_index()

def get_top_10_words(word_list):
word_list = re.sub(r"[^a-zA-Z0-9\s\']", r'', word_list)
list_ = word_list.split()
not_stop = [word for word in list_ if word.lower() not in stopwords]
counter = Counter(not_stop)
return " ".join([word for (word, count) in counter.most_common(10)])

dfTop10['lyric'] = dfTop10['lyric'].apply(get_top_10_words)

#################################
dfLines = dfLyrics.groupby(['year','album'])['lyric'].agg(['nunique','count']).reset_index()

dfLines['dups'] = dfLines['count'] - dfLines['nunique']
dfLines['percent'] = dfLines['dups'] * 100 / dfLines['count']

dfLines = dfLines[['year', 'album', 'nunique', 'dups', 'count', 'percent']]

#################################
dfOutput = dfTop10.merge(dfLines).rename(columns=
{"year": "Album Year",
"album": "Album Name",
"lyric": "Top_10_Lyrics",
"nunique": "Unique_Lines_Per_Album",
"dups": "Duplicate_Lines_Per_Album",
"count": "Total_Lines_Per_Album",
"percent": "Repetativeness_Percentage"
}
)

#################################
Alteryx.write(dfOutput, 1)``````

I wanted to practice my data frames with this one, so I used the python tool. Like others have mentioned, my results are very close to the expected output counts.

On a whim I also tried training an RNN (not attached) to generate new T-Swift songs, but after 30 epochs it was over-fitting. Reducing the number of epochs produced incoherent lyrics. At approximately 33k words, there wasn't enough data to satisfy the network. We'll have to wait for more Taylor albums! 🙂

Thanks for the challenge! Here's my solution