Free Trial

Weekly Challenges

Solve the challenge, share your solution and summit the ranks of our Community!

Also available in | Français | Português | Español | 日本語
IDEAS WANTED

Want to get involved? We're always looking for ideas and content for Weekly Challenges.

SUBMIT YOUR IDEA

Challenge #40: Parsing a HTML File

GeneR
Alteryx Alumni (Retired)

Happy Monday… oh wait it's Tuesday already.  Sorry for the delay if you are an international Alteryx community member, yesterday the USA and Canada celebrated Labor Day in honor of working people.

 

Hopefully everyone had fun debugging the Macro last week, the link to the solution for that challenge (#39) is HERE.  For this week we look at what needs to be done to process raw HTML data after using the download tool to scrape the web.

  

One of the features of the Alteryx download tool is that it can pull down the raw HTML code from a web page.  This practice sometimes referred to as web scraping is useful when there is embedded data in the page you want to access from Alteryx.  The challenge is that the raw HTML needs to parsed to prepare the data for use.

 

Use case:  5280 Magazine in Denver published a list of the best doctors in the Denver metro area, you need to download that list in database form. (Note the Raw HTML has been provided in the workflow)

 

Objective:  Parse the HTML into a database format containing fields for the ID, Physician, Address, City and Practice

 

Good luck, I hope you are having fun with these challenges and expanding your knowledge of Alteryx.  Thanks to all that participate and have provided feedback.

MattD
Alteryx Alumni (Retired)

Here's a solution:

Spoiler
40 solution.PNG
Former Alteryx, Inc. Support Engineer, Community Data Architect, Data Scientist then Data Engineer
brianprestidge
8 - Asteroid

My Solution (I Reg-ex'd the **** out of it!) :-)

 

FYI - ID 649 was wrong in the provided output solution as the Practice was in the City field:

 

PS. Loving These Challenges - Keep Em Coming!

 

Error

 

 

Spoiler
My Solution
Solution

 

brianprestidge
8 - Asteroid

Hello My Alteryx Friends....

 

Is this not the same challenge as Week 40 or am i missing something?

 

Week 40: http://community.alteryx.com/t5/Alteryx-Knowledge-Base/Weekly-Exercise-40-Data-Prep-HTML-Parsing-Dr-...

GeneR
Alteryx Alumni (Retired)

@brianprestidge  you're not missing anything, I must have liked that one so much I posted it twice.  I will make sure I have something original for next Monday.  Thanks for playing along and keeping us honest!

 

 

brianprestidge
8 - Asteroid

Haha - My pleasure!

 

I agree, it was a good one so why not do it again!! :-)

TaraM
Alteryx Alumni (Retired)

A solution has been posted

Spoiler
2016-10-17 08_44_39-Alteryx Designer x64 - DataPrep_HTMLParsing_DrNames_Solution.yxmd_.png

 

Tara McCoy
Joe_Mako
12 - Quasar

I saw this in another thread, you can find my attached workbook at:
http://community.alteryx.com/t5/Dublin-IRL/Weekly-Exercise-9/gpm-p/36238#M47

 

Here are three points on the differences between your output and what I came up with:

1. You have an issue with a character encoding in your output
My Results:
493 Yuko Kitahama-D'Ambrosia Denver 4500 E. Ninth Ave., Suite 200 Obstetrics and Gynecology
Your Results:
493 Yuko Kitahama-D'Ambrosia 4500 E. Ninth Ave., Suite 200 Denver Obstetrics and Gynecology

 

2. For Jesse Mills, the "(..)" text is in the span tag, and in all others the span tag contains the address, but your output has that text in the city field, and then the Practice in the City.
My Results:
649 Jesse Mills (No longer practicing in the Denver area) Reproductive Endocrinology and Infertility [Null]
Your Results:
649 Jesse Mills [Null] (No longer practicing in the Denver area) Reproductive Endocrinology and Infertility

 

3. 51 physicians have multiple practices, for example, Reginald Bell. Your results only kept the first. I outputted it as a comma separated list in the field.

 

XML Parse

GeneR
Alteryx Alumni (Retired)

Nice! @Joe_Mako

 

Thanks!

SeanAdams
17 - Castor
17 - Castor

I found the same as @Joe_Mako - row 649 in the provided solution has some data corruption.

 

For @GeneR@TaraM - for some reason the raw-data for this exercise seems to have dropped of the posting, but it is still available on the link to the Dublin User Group that @Joe_Mako provided below - would you mind adding this to the original challenge posting so that the folk who try this have the data set to work with?  

 

Finally - I felt a little silly when I looked at the posted solution from @TaraM which uses the natural tags to split the data - clearly I did this the hard way.

 

Have a good weekend all

Sean