Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語
IDEAS WANTED

Want to get involved? We're always looking for ideas and content for Weekly Challenges.

SUBMIT YOUR IDEA

Challenge #116: A Symphony of Parsing Tools!

mmontgomery
10 - Fireball

Challenge #116

FrederikE
13 - Pulsar
Spoiler

Didn't get every record but close. 

FrederikE_0-1659909725393.png

 

JalenBraegelmann
8 - Asteroid
Spoiler
JalenBraegelmann_0-1661548577075.png

 

Different results from the original, but that seems to be a common theme. I join the resentment in working with XML as the data size snowballs without any way to put in countermeasures.

grazitti_sapna
17 - Castor

grazitti_sapna_0-1662959390608.png

 

Sapna Gupta
acarter881
12 - Quasar

Here's my solution. I used a combination of the Python tool and traditional Alteryx tools. For this challenge, I don't think there's anything within Alteryx better than the Python tool.

The entire challenge can be done within the Python tool. Anyway, the Python tool also allows for (by using pandas and the ElementTree XML API) the easiest understanding of the problem. I believe my output file is significantly more complete than the sample output file.

 

Workflow

Spoiler
my_weekly_challenge_116_solution.png

Code

Spoiler
from ayx import Alteryx
import pandas as pd
import xml.etree.ElementTree as ET

def parse_xml(root: str) -> list:
    programs = root.findall('program')
    
    out_list = list()
    
    # Loop over programs
    for program in programs:
        program_id = program.find('programID').text
        orchestra = program.find('orchestra').text
        season = program.find('season').text

        # Can have multiple concerts
        concerts = program.findall('concertInfo')

        works_info = program.find('worksInfo')

        # Can have multiple works
        for work in works_info.findall('work'):
            # If the work is an intermission, go to the next work
            if work.find('interval') is not None:
                continue
            # E.g., 5733*
            work_id = work.attrib['ID']
            # E.g., One Sweet Morning
            if work.find('movement') is not None:
                movement = work.find('movement').text
            else:
                movement = None
            # E.g., Bernstein,  Leonard
            if work.find('composerName') is not None:
                composer = work.find('composerName').text
            else:
                composer = None
            # E.g., WEST SIDE STORY (WITH FILM)
            if work.find('workTitle') is not None:
                work_title = work.find('workTitle').text
            else:
                work_title = None
            # E.g., Newman, David
            if work.find('conductorName') is not None:
                conductor_name = work.find('conductorName').text
            else:
                conductor_name = None
                
            # Initialize as an empty string -> e.g., Duke, Cherry (Mezzo-Soprano) Gosling, Stephen (Piano) 
            soloists = ''
            
            # Test for soloists
            if work.find('soloists') is not None:
                # Can have multiple soloists
                for soloist in work.find('soloists'):
                    if soloist.find('soloistName').text != None:
                        soloist_name = soloist.find('soloistName').text
                    else:
                        soloist_name = 'NONE'
                    if soloist.find('soloistInstrument').text != None:
                        soloist_instrument = soloist.find('soloistInstrument').text
                    else:
                        soloist_instrument = 'NONE'
                        
                    concat_soloist = soloist_name + " (" + soloist_instrument + ")"
                    soloists += concat_soloist + ' '
                    # Not using soloist roles in the output
                    # soloist_roles = soloist.find('soloistRoles').text

            # Trim the extra space on the right
            soloists = soloists.rstrip()
     
            # For each concert, add the relevant work data
            for concert in concerts:
                concert_location = concert.find('Location').text
                event_type = concert.find('eventType').text
                venue = concert.find('Venue').text
                concert_date = concert.find('Date').text
                concert_time = concert.find('Time').text

                out_list.append(
                (program_id,
                 work_id,
                 orchestra,
                 season,
                 concert_location,
                 event_type,
                 venue,
                 concert_date,
                 concert_time,
                 conductor_name,
                 composer,
                 work_title,
                 movement,
                 soloists
                )
            )
    
    return out_list
        
if __name__ == '__main__':
    df_out = pd.DataFrame(
                        data=parse_xml(root=ET.fromstring(Alteryx.read('#1')['DownloadData'][0])),
                        columns=['Program ID',
                                'Piece ID',
                                'Ensemble',
                                 'Season',
                                 'Location',
                                 'Venue',
                                 'Event Type',
                                 'Performance Date',
                                 'Time',
                                 'Conductor',
                                 'Composer',
                                 'Work',
                                 'Movement',
                                 'Soloists',
                                ]
                        )
    Alteryx.write(df_out, 1)

 

 

JamesCharnley
13 - Pulsar
Spoiler
JamesCharnley_0-1667833753182.png

 

Not perfectly matching output like some others and took me a hell of a lot longer than I expected. But I'm happy with what I learned with this one

martinding
13 - Pulsar
Spoiler
116.png

LOTs of trial and error involved with this one. I still don't quite understand the XML tree structure and why some parts need to be parsed as Roots and some as Child, when they all look like Child...

ahsanaali
11 - Bolide
Spoiler
ahsanaali_1-1671039629850.png

 

joshbennett
11 - Bolide
11 - Bolide

Fun XML parsing!

 

My solution:

Spoiler
solution.png

DanielG
12 - Quasar

Solved...

 

I think.

 

Got more results then the sample output, however my information seems perfectly valid.  There are even ID#s now in the output that dont show up at all in the initial raw download, so I am guessing something changed further upstream than we can access.  Or at least that is my excuse.  :)