Weekly Challenges

mmontgomery · ‎05-31-2022

Challenge #116

FrederikE · ‎08-07-2022

Spoiler

Didn't get every record but close.

JalenBraegelmann · ‎08-26-2022

Spoiler

Different results from the original, but that seems to be a common theme. I join the resentment in working with XML as the data size snowballs without any way to put in countermeasures.

grazitti_sapna · ‎09-11-2022

Sapna Gupta

acarter881 · ‎10-14-2022

Here's my solution. I used a combination of the Python tool and traditional Alteryx tools. For this challenge, I don't think there's anything within Alteryx better than the Python tool.

The entire challenge can be done within the Python tool. Anyway, the Python tool also allows for (by using pandas and the ElementTree XML API) the easiest understanding of the problem. I believe my output file is significantly more complete than the sample output file.

Workflow

Spoiler

Code

Spoiler

from ayx import Alteryx
import pandas as pd
import xml.etree.ElementTree as ET

def parse_xml(root: str) -> list:
    programs = root.findall('program')
    
    out_list = list()
    
    # Loop over programs
    for program in programs:
        program_id = program.find('programID').text
        orchestra = program.find('orchestra').text
        season = program.find('season').text

        # Can have multiple concerts
        concerts = program.findall('concertInfo')

        works_info = program.find('worksInfo')

        # Can have multiple works
        for work in works_info.findall('work'):
            # If the work is an intermission, go to the next work
            if work.find('interval') is not None:
                continue
            # E.g., 5733*
            work_id = work.attrib['ID']
            # E.g., One Sweet Morning
            if work.find('movement') is not None:
                movement = work.find('movement').text
            else:
                movement = None
            # E.g., Bernstein,  Leonard
            if work.find('composerName') is not None:
                composer = work.find('composerName').text
            else:
                composer = None
            # E.g., WEST SIDE STORY (WITH FILM)
            if work.find('workTitle') is not None:
                work_title = work.find('workTitle').text
            else:
                work_title = None
            # E.g., Newman, David
            if work.find('conductorName') is not None:
                conductor_name = work.find('conductorName').text
            else:
                conductor_name = None
                
            # Initialize as an empty string -> e.g., Duke, Cherry (Mezzo-Soprano) Gosling, Stephen (Piano) 
            soloists = ''
            
            # Test for soloists
            if work.find('soloists') is not None:
                # Can have multiple soloists
                for soloist in work.find('soloists'):
                    if soloist.find('soloistName').text != None:
                        soloist_name = soloist.find('soloistName').text
                    else:
                        soloist_name = 'NONE'
                    if soloist.find('soloistInstrument').text != None:
                        soloist_instrument = soloist.find('soloistInstrument').text
                    else:
                        soloist_instrument = 'NONE'
                        
                    concat_soloist = soloist_name + " (" + soloist_instrument + ")"
                    soloists += concat_soloist + ' '
                    # Not using soloist roles in the output
                    # soloist_roles = soloist.find('soloistRoles').text

            # Trim the extra space on the right
            soloists = soloists.rstrip()
     
            # For each concert, add the relevant work data
            for concert in concerts:
                concert_location = concert.find('Location').text
                event_type = concert.find('eventType').text
                venue = concert.find('Venue').text
                concert_date = concert.find('Date').text
                concert_time = concert.find('Time').text

                out_list.append(
                (program_id,
                 work_id,
                 orchestra,
                 season,
                 concert_location,
                 event_type,
                 venue,
                 concert_date,
                 concert_time,
                 conductor_name,
                 composer,
                 work_title,
                 movement,
                 soloists
                )
            )
    
    return out_list
        
if __name__ == '__main__':
    df_out = pd.DataFrame(
                        data=parse_xml(root=ET.fromstring(Alteryx.read('#1')['DownloadData'][0])),
                        columns=['Program ID',
                                'Piece ID',
                                'Ensemble',
                                 'Season',
                                 'Location',
                                 'Venue',
                                 'Event Type',
                                 'Performance Date',
                                 'Time',
                                 'Conductor',
                                 'Composer',
                                 'Work',
                                 'Movement',
                                 'Soloists',
                                ]
                        )
    Alteryx.write(df_out, 1)

from ayx import Alteryx import pandas as pd import xml.etree.ElementTree as ET def parse_xml(root: str) -> list: programs = root.findall('program') out_list = list() # Loop over programs for program in programs: program_id = program.find('programID').text orchestra = program.find('orchestra').text season = program.find('season').text # Can have multiple concerts concerts = program.findall('concertInfo') works_info = program.find('worksInfo') # Can have multiple works for work in works_info.findall('work'): # If the work is an intermission, go to the next work if work.find('interval') is not None: continue # E.g., 5733* work_id = work.attrib['ID'] # E.g., One Sweet Morning if work.find('movement') is not None: movement = work.find('movement').text else: movement = None # E.g., Bernstein, Leonard if work.find('composerName') is not None: composer = work.find('composerName').text else: composer = None # E.g., WEST SIDE STORY (WITH FILM) if work.find('workTitle') is not None: work_title = work.find('workTitle').text else: work_title = None # E.g., Newman, David if work.find('conductorName') is not None: conductor_name = work.find('conductorName').text else: conductor_name = None # Initialize as an empty string -> e.g., Duke, Cherry (Mezzo-Soprano) Gosling, Stephen (Piano) soloists = '' # Test for soloists if work.find('soloists') is not None: # Can have multiple soloists for soloist in work.find('soloists'): if soloist.find('soloistName').text != None: soloist_name = soloist.find('soloistName').text else: soloist_name = 'NONE' if soloist.find('soloistInstrument').text != None: soloist_instrument = soloist.find('soloistInstrument').text else: soloist_instrument = 'NONE' concat_soloist = soloist_name + " (" + soloist_instrument + ")" soloists += concat_soloist + ' ' # Not using soloist roles in the output # soloist_roles = soloist.find('soloistRoles').text # Trim the extra space on the right soloists = soloists.rstrip() # For each concert, add the relevant work data for concert in concerts: concert_location = concert.find('Location').text event_type = concert.find('eventType').text venue = concert.find('Venue').text concert_date = concert.find('Date').text concert_time = concert.find('Time').text out_list.append( (program_id, work_id, orchestra, season, concert_location, event_type, venue, concert_date, concert_time, conductor_name, composer, work_title, movement, soloists ) ) return out_list if __name__ == '__main__': df_out = pd.DataFrame( data=parse_xml(root=ET.fromstring(Alteryx.read('#1')['DownloadData'][0])), columns=['Program ID', 'Piece ID', 'Ensemble', 'Season', 'Location', 'Venue', 'Event Type', 'Performance Date', 'Time', 'Conductor', 'Composer', 'Work', 'Movement', 'Soloists', ] ) Alteryx.write(df_out, 1)

JamesCharnley · ‎11-07-2022

Spoiler

Not perfectly matching output like some others and took me a hell of a lot longer than I expected. But I'm happy with what I learned with this one

martinding · ‎11-16-2022

Spoiler

LOTs of trial and error involved with this one. I still don't quite understand the XML tree structure and why some parts need to be parsed as Roots and some as Child, when they all look like Child...

ahsanaali · ‎12-14-2022

Spoiler

joshbennett · ‎01-02-2023

Fun XML parsing!

My solution:

Spoiler

DanielG · ‎01-19-2023

Solved...

I think.

Got more results then the sample output, however my information seems perfectly valid. There are even ID#s now in the output that dont show up at all in the initial raw download, so I am guessing something changed further upstream than we can access. Or at least that is my excuse. :)

Weekly Challenges

IDEAS WANTED

Challenge #116: A Symphony of Parsing Tools!