Want to get involved? We're always looking for ideas and content for Weekly Challenges.
SUBMIT YOUR IDEADidn't get every record but close.
Different results from the original, but that seems to be a common theme. I join the resentment in working with XML as the data size snowballs without any way to put in countermeasures.
Here's my solution. I used a combination of the Python tool and traditional Alteryx tools. For this challenge, I don't think there's anything within Alteryx better than the Python tool.
The entire challenge can be done within the Python tool. Anyway, the Python tool also allows for (by using pandas and the ElementTree XML API) the easiest understanding of the problem. I believe my output file is significantly more complete than the sample output file.
Workflow
Code
from ayx import Alteryx
import pandas as pd
import xml.etree.ElementTree as ET
def parse_xml(root: str) -> list:
programs = root.findall('program')
out_list = list()
# Loop over programs
for program in programs:
program_id = program.find('programID').text
orchestra = program.find('orchestra').text
season = program.find('season').text
# Can have multiple concerts
concerts = program.findall('concertInfo')
works_info = program.find('worksInfo')
# Can have multiple works
for work in works_info.findall('work'):
# If the work is an intermission, go to the next work
if work.find('interval') is not None:
continue
# E.g., 5733*
work_id = work.attrib['ID']
# E.g., One Sweet Morning
if work.find('movement') is not None:
movement = work.find('movement').text
else:
movement = None
# E.g., Bernstein, Leonard
if work.find('composerName') is not None:
composer = work.find('composerName').text
else:
composer = None
# E.g., WEST SIDE STORY (WITH FILM)
if work.find('workTitle') is not None:
work_title = work.find('workTitle').text
else:
work_title = None
# E.g., Newman, David
if work.find('conductorName') is not None:
conductor_name = work.find('conductorName').text
else:
conductor_name = None
# Initialize as an empty string -> e.g., Duke, Cherry (Mezzo-Soprano) Gosling, Stephen (Piano)
soloists = ''
# Test for soloists
if work.find('soloists') is not None:
# Can have multiple soloists
for soloist in work.find('soloists'):
if soloist.find('soloistName').text != None:
soloist_name = soloist.find('soloistName').text
else:
soloist_name = 'NONE'
if soloist.find('soloistInstrument').text != None:
soloist_instrument = soloist.find('soloistInstrument').text
else:
soloist_instrument = 'NONE'
concat_soloist = soloist_name + " (" + soloist_instrument + ")"
soloists += concat_soloist + ' '
# Not using soloist roles in the output
# soloist_roles = soloist.find('soloistRoles').text
# Trim the extra space on the right
soloists = soloists.rstrip()
# For each concert, add the relevant work data
for concert in concerts:
concert_location = concert.find('Location').text
event_type = concert.find('eventType').text
venue = concert.find('Venue').text
concert_date = concert.find('Date').text
concert_time = concert.find('Time').text
out_list.append(
(program_id,
work_id,
orchestra,
season,
concert_location,
event_type,
venue,
concert_date,
concert_time,
conductor_name,
composer,
work_title,
movement,
soloists
)
)
return out_list
if __name__ == '__main__':
df_out = pd.DataFrame(
data=parse_xml(root=ET.fromstring(Alteryx.read('#1')['DownloadData'][0])),
columns=['Program ID',
'Piece ID',
'Ensemble',
'Season',
'Location',
'Venue',
'Event Type',
'Performance Date',
'Time',
'Conductor',
'Composer',
'Work',
'Movement',
'Soloists',
]
)
Alteryx.write(df_out, 1)
Not perfectly matching output like some others and took me a hell of a lot longer than I expected. But I'm happy with what I learned with this one
LOTs of trial and error involved with this one. I still don't quite understand the XML tree structure and why some parts need to be parsed as Roots and some as Child, when they all look like Child...
Solved...
I think.
Got more results then the sample output, however my information seems perfectly valid. There are even ID#s now in the output that dont show up at all in the initial raw download, so I am guessing something changed further upstream than we can access. Or at least that is my excuse. :)