community
cancel
Showing results for 
Search instead for 
Did you mean: 
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Weekly Challenge
Do you have the skills to make it to the top? Subscribe to our weekly challenges. Try your best to solve the problem, share your solution, and see how others tackled the same problem. We share our answer too.
Unable to display your progress at this time. Please try again a little later, or contact an administrator if you continue to see this error.
Getting started with Designer? | Start your journey with our new Learning Path!

Challenge #40: Parsing a HTML File

Alteryx Alumni (Retired)

Happy Monday… oh wait it's Tuesday already.  Sorry for the delay if you are an international Alteryx community member, yesterday the USA and Canada celebrated Labor Day in honor of working people.

 

Hopefully everyone had fun debugging the Macro last week, the link to the solution for that challenge (#39) is HERE.  For this week we look at what needs to be done to process raw HTML data after using the download tool to scrape the web.

  

One of the features of the Alteryx download tool is that it can pull down the raw HTML code from a web page.  This practice sometimes referred to as web scraping is useful when there is embedded data in the page you want to access from Alteryx.  The challenge is that the raw HTML needs to parsed to prepare the data for use.

 

Use case:  5280 Magazine in Denver published a list of the best doctors in the Denver metro area, you need to download that list in database form. (Note the Raw HTML has been provided in the workflow)

 

Objective:  Parse the HTML into a database format containing fields for the ID, Physician, Address, City and Practice

 

Good luck, I hope you are having fun with these challenges and expanding your knowledge of Alteryx.  Thanks to all that participate and have provided feedback.

Community Data Engineer
Community Data Engineer

Here's a solution:

Spoiler
40 solution.PNG
Alteryx Partner

My Solution (I Reg-ex'd the **** out of it!) :-)

 

FYI - ID 649 was wrong in the provided output solution as the Practice was in the City field:

 

PS. Loving These Challenges - Keep Em Coming!

 

Error

 

 

Spoiler
My Solution
Solution

 

Alteryx Partner

Hello My Alteryx Friends....

 

Is this not the same challenge as Week 40 or am i missing something?

 

Week 40: http://community.alteryx.com/t5/Alteryx-Knowledge-Base/Weekly-Exercise-40-Data-Prep-HTML-Parsing-Dr-...

Alteryx Alumni (Retired)

@brianprestidge  you're not missing anything, I must have liked that one so much I posted it twice.  I will make sure I have something original for next Monday.  Thanks for playing along and keeping us honest!

 

 

Alteryx Partner

Haha - My pleasure!

 

I agree, it was a good one so why not do it again!! :-)

Creative Director
Creative Director

A solution has been posted

Spoiler
2016-10-17 08_44_39-Alteryx Designer x64 - DataPrep_HTMLParsing_DrNames_Solution.yxmd_.png

 

Tara McCoy
Quasar
Quasar

I saw this in another thread, you can find my attached workbook at:
http://community.alteryx.com/t5/Dublin-IRL/Weekly-Exercise-9/gpm-p/36238#M47

 

Here are three points on the differences between your output and what I came up with:

1. You have an issue with a character encoding in your output
My Results:
493 Yuko Kitahama-D'Ambrosia Denver 4500 E. Ninth Ave., Suite 200 Obstetrics and Gynecology
Your Results:
493 Yuko Kitahama-D'Ambrosia 4500 E. Ninth Ave., Suite 200 Denver Obstetrics and Gynecology

 

2. For Jesse Mills, the "(..)" text is in the span tag, and in all others the span tag contains the address, but your output has that text in the city field, and then the Practice in the City.
My Results:
649 Jesse Mills (No longer practicing in the Denver area) Reproductive Endocrinology and Infertility [Null]
Your Results:
649 Jesse Mills [Null] (No longer practicing in the Denver area) Reproductive Endocrinology and Infertility

 

3. 51 physicians have multiple practices, for example, Reginald Bell. Your results only kept the first. I outputted it as a comma separated list in the field.

 

XML Parse

Alteryx Alumni (Retired)

Nice! @Joe_Mako

 

Thanks!

Nebula
Nebula

I found the same as @Joe_Mako - row 649 in the provided solution has some data corruption.

 

For @GeneR@TaraM - for some reason the raw-data for this exercise seems to have dropped of the posting, but it is still available on the link to the Dublin User Group that @Joe_Mako provided below - would you mind adding this to the original challenge posting so that the folk who try this have the data set to work with?  

 

Finally - I felt a little silly when I looked at the posted solution from @TaraM which uses the natural tags to split the data - clearly I did this the hard way.

 

Have a good weekend all

Sean