Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEASlight differences in my answer to the provided example
Hi All ,
This is my very first weekly challenge response 🙂
I am excited to share the news that my paper "Workday Data Migration : How we saved over 2000 hours of manual effort" was chosen for the Excellence Award !!!!
For this weekly challenge , I used the summarise function and the count function on the lyric field to return counts , count distinct of lines per album. Using the data I arrived at the duplicate records. The data matched for some records but was off by 1 number for a few. Attached is my workflow.
Regards
Sambit
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
#################################
from ayx import Alteryx
from collections import Counter
import re
dfExpected = Alteryx.read("#Output")
dfLyrics = Alteryx.read("#Lyrics")
dfStopwords = Alteryx.read("#Stopwords")
#################################
# Create a simple list of stopwords
stopwords = [w[0] for w in dfStopwords.values.tolist()]
#################################
dfTop10 = dfLyrics.groupby(['year','album'])['lyric'].apply(" ".join).reset_index()
def get_top_10_words(word_list):
word_list = re.sub(r"[^a-zA-Z0-9\s\']", r'', word_list)
list_ = word_list.split()
not_stop = [word for word in list_ if word.lower() not in stopwords]
counter = Counter(not_stop)
return " ".join([word for (word, count) in counter.most_common(10)])
dfTop10['lyric'] = dfTop10['lyric'].apply(get_top_10_words)
#################################
dfLines = dfLyrics.groupby(['year','album'])['lyric'].agg(['nunique','count']).reset_index()
dfLines['dups'] = dfLines['count'] - dfLines['nunique']
dfLines['percent'] = dfLines['dups'] * 100 / dfLines['count']
dfLines = dfLines[['year', 'album', 'nunique', 'dups', 'count', 'percent']]
#################################
dfOutput = dfTop10.merge(dfLines).rename(columns=
{"year": "Album Year",
"album": "Album Name",
"lyric": "Top_10_Lyrics",
"nunique": "Unique_Lines_Per_Album",
"dups": "Duplicate_Lines_Per_Album",
"count": "Total_Lines_Per_Album",
"percent": "Repetativeness_Percentage"
}
)
#################################
Alteryx.write(dfOutput, 1)
I wanted to practice my data frames with this one, so I used the python tool. Like others have mentioned, my results are very close to the expected output counts.
On a whim I also tried training an RNN (not attached) to generate new T-Swift songs, but after 30 epochs it was over-fitting. Reducing the number of epochs produced incoherent lyrics. At approximately 33k words, there wasn't enough data to satisfy the network. We'll have to wait for more Taylor albums! 🙂