Challenge #40: Parsing a HTML File

Happy Monday… oh wait it's Tuesday already. Sorry for the delay if you are an international Alteryx community member, yesterday the USA and Canada celebrated Labor Day in honor of working people.

Hopefully everyone had fun debugging the Macro last week, the link to the solution for that challenge (#39) is HERE. For this week we look at what needs to be done to process raw HTML data after using the download tool to scrape the web.

One of the features of the Alteryx download tool is that it can pull down the raw HTML code from a web page. This practice sometimes referred to as web scraping is useful when there is embedded data in the page you want to access from Alteryx. The challenge is that the raw HTML needs to parsed to prepare the data for use.

Use case: 5280 Magazine in Denver published a list of the best doctors in the Denver metro area, you need to download that list in database form. (Note the Raw HTML has been provided in the workflow)

Objective: Parse the HTML into a database format containing fields for the ID, Physician, Address, City and Practice

Good luck, I hope you are having fun with these challenges and expanding your knowledge of Alteryx. Thanks to all that participate and have provided feedback.

challenge_40_start_file.yxmd

challenge_40_solution.yxmd

downloadeddata.yxdb

Join

Preparation

Advanced

Intermediate

Data Preparation

Transform

Accepted answers

All comments

MattD

Here's a solution:

Spoiler

brianprestidge

My Solution (I Reg-ex'd the **** out of it!) :-)

FYI - ID 649 was wrong in the provided output solution as the Practice was in the City field:

PS. Loving These Challenges - Keep Em Coming!

Error

Spoiler

My Solution

brianprestidge

Hello My Alteryx Friends....

Is this not the same challenge as Week 40 or am i missing something?

Week 40: http://community.alteryx.com/t5/Alteryx-Knowledge-Base/Weekly-Exercise-40-Data-Prep-HTML-Parsing-Dr-Names-Intermediate/ta-p/32333

GeneR

@brianprestidge you're not missing anything, I must have liked that one so much I posted it twice. I will make sure I have something original for next Monday. Thanks for playing along and keeping us honest!

brianprestidge

Haha - My pleasure!

I agree, it was a good one so why not do it again!! :-)

TaraM

A solution has been posted

Spoiler

2016-10-17 08_44_39-Alteryx Designer x64 - DataPrep_HTMLParsing_DrNames_Solution.yxmd_.png

Joe_Mako

I saw this in another thread, you can find my attached workbook at:
http://community.alteryx.com/t5/Dublin-IRL/Weekly-Exercise-9/gpm-p/36238#M47

Here are three points on the differences between your output and what I came up with:

1. You have an issue with a character encoding in your output
My Results:
493 Yuko Kitahama-D'Ambrosia Denver 4500 E. Ninth Ave., Suite 200 Obstetrics and Gynecology
Your Results:
493 Yuko Kitahama-D'Ambrosia 4500 E. Ninth Ave., Suite 200 Denver Obstetrics and Gynecology

2. For Jesse Mills, the "(..)" text is in the span tag, and in all others the span tag contains the address, but your output has that text in the city field, and then the Practice in the City.
My Results:
649 Jesse Mills (No longer practicing in the Denver area) Reproductive Endocrinology and Infertility [Null]
Your Results:
649 Jesse Mills [Null] (No longer practicing in the Denver area) Reproductive Endocrinology and Infertility

3. 51 physicians have multiple practices, for example, Reginald Bell. Your results only kept the first. I outputted it as a comma separated list in the field.

XML Parse

GeneR

Nice! @Joe_Mako

Thanks!

SeanAdams

I found the same as @Joe_Mako - row 649 in the provided solution has some data corruption.

For @GeneR & @TaraM - for some reason the raw-data for this exercise seems to have dropped of the posting, but it is still available on the link to the Dublin User Group that @Joe_Mako provided below - would you mind adding this to the original challenge posting so that the folk who try this have the data set to work with?

Finally - I felt a little silly when I looked at the posted solution from @TaraM which uses the natural tags to split the data - clearly I did this the hard way.

Have a good weekend all

Sean

challenge_40_SeanSolution.yxmd

NicoleJohnson

My solution. I was feeling pretty snazzy with all my new tips & tricks for RegEx (thanks for the links, @SeanAdams!)... only to realize with this challenge that I also know next to nothing about XML parsing. But a few searches later and I figured it out enough to fumble my way through this one... I'm finding all sorts of new tools with these challenges that I'll never ever use at work.

Spoiler

challenge_40_NicoleJohnson.yxmd

JoeM

Some of the data was reported missing but has now been added to the original post!

estherb47

Thanks for including the data!

Do I get bonus point for fewest tools used? First picture is a solution with RegEx parsing. 7 tools! I'm not counting the browse

Second solution is with formulas/cross tab/text to columns

Spoiler

Spoilers

challenge_40_EHB_solution.yxmd

LordNeilLord

Took me ages to get the right parse going but once I had it I was on a roll..

Spoiler

challenge_40_LNL.yxmd

SeanAdams

best hint I ever got on regex came from Mark ( @MarqueeCrew )

The tip is that http://regex101.com allows you to play with Regex phrases in real time, and it's a really easy way to learn and practice. I now use it for all my regex work, especially web-scraping!

Have a look - it really is a fantastic site, and if Alteryx could build something similar into the product, it would be legendary!

MsBindy

Whoops....put it on the wrong week. This is week 41 solution.

patrick_digan

HTML parsing is just another excuse to use James' XML input tool to parse everything.

Spoiler

challenge_40_start_file.yxzp

A_Twa

Solution attached.

challenge_40_start_file_AJT.yxmd

dominiklz

Spoiler

challenge_40_Dominik.yxmd

JoshKushner

On Challenge #1, Regex and I HATED each other. Now we're BEST FRIENDS. How did I never use Regex before this. I was wasting so much effort with sub-strings and unnecessary logic...

challenge_40_start_file.yxzp

samN

Regex will never let me down! Pretty happy how compact i was able to make this. I wish that parse would find more than just the first case. Could have limited the number of widgets to 4! ah well, can't have everything. Fun stuff.

Spoiler

challenge_40.yxmd

ggruccio

Needed a bit of help getting started - but then got it fairly easily from there. Lots of parsing.

challenge_40_gg_finish.yxmd

philipmannering

More XML parse fun times. Solution attached.

Spoiler

challenge_40 - Parsing HTML for List of Doctors.yxzp

philipmannering

More XML parse fun times. Solution attached.

Spoiler

challenge_40 - Parsing HTML for List of Doctors.yxzp

LandonG

Solution attached.

challenge_40_start_file.yxzp

jasperlch

Solution attached.

challenge_40_JL.yxmd

msicak

another great challenge! I also concatenated the practices together if there was more than 1.

Spoiler

challenge_40_MS.yxmd

jamielaird

Here's my solution. Thank god for Regex :-)

Spoiler

challenge_40_solution_JL.yxzp

Natasha

Instead of defaulting to regex I decided to use XML parse since I really dislike it and thought it would be a good practice. Also after scrapping a few website recently XML parsing doesn't look so bad anymore.

Spoiler

My results are slightly different compared to output as I concatenated multiple practices for the same physician, while in the output only the first one is captured.

Screen Shot 2017-12-28 at 00.13.11.png

challenge_40_Natasha.yxmd

Waynemk

and another

challenge_40_waynek.yxmd

dsmdavid

Some RegEx fun

Spoiler

solved#40.yxzp

JosephSerpis

Challenge Completed

Challenge_40_Joe_Serpis.yxmd

kcgreen

Done!

Spoiler

challenge_40_Regex_For_HTML.yxmd

CHarrison

Always love a bit of parsing

Challenge 40.yxmd

johnemery

Very challenging, especially for an intermediate challenge. Learning more about RegEx is great, and I attempted a secondary solution using formulas and such. Tricky tricky tricky.

Challenge 40 Solution.yxmd

Quick Links

This months top contributors

mceleavey 383

mbarone 337

Hollingsworth 335

LanisC 335

AdamR 335