Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
Don't forget to submit your entry for the Excellence Awards by October 30! | Need more information about the program? Check out the blog here

Parsing HTML tables

Highlighted
8 - Asteroid

Many thanks to the weekly challenge on this idea. 

  

One of the weekly challenges was to parse html and exract table data which got me thinking to build a generic workflow (and eventually an application) to get table data from any page. 

 

Hope to publish further improvements since web-scraping is a passion of mine.  Next step would be to add multi-page feature

 

Looking for community feedback.

 

 

Highlighted
Alteryx Certified Partner
Alteryx Certified Partner

Hi Mark,

 

Sound awsome with a generic table parser. For the multi page feature, you can have a look at http://community.alteryx.com/t5/Engine-Works-Blog/Web-Scraping-the-Community/ba-p/21210

 

I have done a iterative macro, that lets you scrape multiple pages such as the community - http://community.alteryx.com/t5/Data-Preparation-Blending/Insights-to-the-Alteryx-community-get-an-o...

 

A couple of tips:

  • This could be build into a macro that lets the user input stuff such as page, number of columns ect.
  • It could be nice to have the table header also included in the output.

 

Thanks a lot for sharing.

 

Best,

Daniel

Highlighted
8 - Asteroid

Hi Daniel,

 

Thanks for the reply.  Yes, I seen both those topics with some very good ideas I can use. One site I tested had tables built with <br /> tags.  Some of the options for an application: include headers, row/column delimiters, method for finding rows, single/multi page, table # (for pages with multiple tables), regex for parsing rows. The list is getting longer than I had imagined.

 

Likely this will end up being more of a tool that can tweeked to handle the wide variety of page layouts.  I've only been using alteryx for a few months but I do enjoy a challenge.  I plan to post my progress on this endeavor so stay tuned.

Highlighted
8 - Asteroid

Bless you for posting this. I am just starting to learn RegEx to parse  dozens of tables on websites and this is a super helpful start to learn from!

Labels