This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
We're actively looking for ideas on how to improve Weekly Challenges and would love to hear what you think!
Submit FeedbackSlight differences in my answer to the provided example
Hi All ,
This is my very first weekly challenge response 🙂
I am excited to share the news that my paper "Workday Data Migration : How we saved over 2000 hours of manual effort" was chosen for the Excellence Award !!!!
For this weekly challenge , I used the summarise function and the count function on the lyric field to return counts , count distinct of lines per album. Using the data I arrived at the duplicate records. The data matched for some records but was off by 1 number for a few. Attached is my workflow.
Regards
Sambit
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
#################################
from ayx import Alteryx
from collections import Counter
import re
dfExpected = Alteryx.read("#Output")
dfLyrics = Alteryx.read("#Lyrics")
dfStopwords = Alteryx.read("#Stopwords")
#################################
# Create a simple list of stopwords
stopwords = [w[0] for w in dfStopwords.values.tolist()]
#################################
dfTop10 = dfLyrics.groupby(['year','album'])['lyric'].apply(" ".join).reset_index()
def get_top_10_words(word_list):
word_list = re.sub(r"[^a-zA-Z0-9\s\']", r'', word_list)
list_ = word_list.split()
not_stop = [word for word in list_ if word.lower() not in stopwords]
counter = Counter(not_stop)
return " ".join([word for (word, count) in counter.most_common(10)])
dfTop10['lyric'] = dfTop10['lyric'].apply(get_top_10_words)
#################################
dfLines = dfLyrics.groupby(['year','album'])['lyric'].agg(['nunique','count']).reset_index()
dfLines['dups'] = dfLines['count'] - dfLines['nunique']
dfLines['percent'] = dfLines['dups'] * 100 / dfLines['count']
dfLines = dfLines[['year', 'album', 'nunique', 'dups', 'count', 'percent']]
#################################
dfOutput = dfTop10.merge(dfLines).rename(columns=
{"year": "Album Year",
"album": "Album Name",
"lyric": "Top_10_Lyrics",
"nunique": "Unique_Lines_Per_Album",
"dups": "Duplicate_Lines_Per_Album",
"count": "Total_Lines_Per_Album",
"percent": "Repetativeness_Percentage"
}
)
#################################
Alteryx.write(dfOutput, 1)
I wanted to practice my data frames with this one, so I used the python tool. Like others have mentioned, my results are very close to the expected output counts.
On a whim I also tried training an RNN (not attached) to generate new T-Swift songs, but after 30 epochs it was over-fitting. Reducing the number of epochs produced incoherent lyrics. At approximately 33k words, there wasn't enough data to satisfy the network. We'll have to wait for more Taylor albums! 🙂