Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Help with RegEx for Web Scraping Workflow

ivesbr
7 - Meteor

Hi:

I developed a basic workflow to scrape data from a fantasy football site and then begin parsing out the data from the HTML.

I'm able to successfully pull out the player's first and last name along with their fantasy data. But for some reason my RegEx isn't pulling out the position and team data. I've attached my workflow.

I also realize that I still need to figure out how to tell the workflow to distinguish between receptions, TDs, etc. But one step at a time. Appreciate any help. Thanks!

9 REPLIES 9
CarlDi
Alteryx
Alteryx

Hi @ivesbr,

 

see @markp201's post below: 

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Parsing-HTML-tables/m-p/25886

 

It's not regex but it helped me understand HTML parsing when I first started using Designer. I got the result below by merely switching the text input to your URL. Hope that helps!

 

html.jpg

 

ivesbr
7 - Meteor

Hi Carl:

 

Thanks for the quick note back!  Any chance you can share the workflow so I can review the detailed steps? 

 

All the best,

CarlDi
Alteryx
Alteryx

here you go, @ivesbr 

ivesbr
7 - Meteor

Oh wait ... you just used Mark's workflow with my desired website correct?

geraldo
13 - Pulsar

Hi,

 

I tweaked your workflow to bring results the same way from the site. It may serve something. I would like you to take a look and do it.

Follows attached workflow

 

[]

ivesbr
7 - Meteor

Dang!  You and Carl are too good at this.  I added two minor steps at the end of the workflow (attached) to parse out the team and position into separate columns ... otherwise this looks really good.  Thank you!

 

Two quick follow up questions for you guys.  First, I would have figured that the RegEx parse function would be the more powerful solution for this use case.  Any reason why you guys chose to use the RegEx parse via the formula function?

 

Second, can you provide a little additional color around the formulas you both developed?  Just trying to understand the step by step process.  Thanks!

 

All the best,

ivesbr
7 - Meteor

And one last thing ... I copied and pasted the entire workflow so i could run the same parsing process for the next web page of data.  But it looks like my limited understanding of the RegEx function in the formula is preventing me from taking that approach.  Maybe this would be a better example to work with to help explain the formula.  See attached workflow ... thanks!

 

geraldo
13 - Pulsar

Hi @ivesbr 

 

I put some information in the workflow to get the formulas documented.

About your question - depending on complexity it is better to use regex in formula is more flexible and can be chained several

 

I attach the workflow with formula information

 

[]

ivesbr
7 - Meteor

Hi Geraldo:

 

Thank you for the additional documentation.  Very helpful!

 

I've now started to expand the overall workflow so it pulls data from each of the pages.  In total I believe there's 950 players with data.  So if there's a batter way to scrape the information, I'm all ears.  

 

But one thing I noticed is that the 3rd workflow for page 3 (see Austin Hooper below), has two instances where the replace doesn't seem to be working.

 

clipboard_image_0.png

 

I checked the source html text and saw that the next player down (Russell Wilson) has the same html structure, but that data is being scraped correctly.  See below.

 

Austin Hooper - </td><td class="stat stat_5 numeric"><span class="playerStat statId-5 playerId- 2555415

 

Russell Wilson - </td><td class="stat stat_5 numeric"><span class="playerStat statId-5 playerId- 2532975

 

Any thoughts on what may need to be tweaked or added to the formula?  Thanks!

 

  

Labels