Alteryx Designer Desktop Discussions

ivesbr · ‎09-10-2019

Hi:

I developed a basic workflow to scrape data from a fantasy football site and then begin parsing out the data from the HTML.

I'm able to successfully pull out the player's first and last name along with their fantasy data. But for some reason my RegEx isn't pulling out the position and team data. I've attached my workflow.

I also realize that I still need to figure out how to tell the workflow to distinguish between receptions, TDs, etc. But one step at a time. Appreciate any help. Thanks!

CarlDi · ‎09-10-2019

Hi @ivesbr,

see @markp201's post below:

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Parsing-HTML-tables/m-p/25886

It's not regex but it helped me understand HTML parsing when I first started using Designer. I got the result below by merely switching the text input to your URL. Hope that helps!

ivesbr · ‎09-10-2019

Hi Carl:

Thanks for the quick note back! Any chance you can share the workflow so I can review the detailed steps?

All the best,

CarlDi · ‎09-10-2019

here you go, @ivesbr

ivesbr · ‎09-10-2019

Oh wait ... you just used Mark's workflow with my desired website correct?

geraldo · ‎09-11-2019

Hi,

I tweaked your workflow to bring results the same way from the site. It may serve something. I would like you to take a look and do it.

Follows attached workflow

[]

ivesbr · ‎09-11-2019

Dang! You and Carl are too good at this. I added two minor steps at the end of the workflow (attached) to parse out the team and position into separate columns ... otherwise this looks really good. Thank you!

Two quick follow up questions for you guys. First, I would have figured that the RegEx parse function would be the more powerful solution for this use case. Any reason why you guys chose to use the RegEx parse via the formula function?

Second, can you provide a little additional color around the formulas you both developed? Just trying to understand the step by step process. Thanks!

All the best,

ivesbr · ‎09-11-2019

And one last thing ... I copied and pasted the entire workflow so i could run the same parsing process for the next web page of data. But it looks like my limited understanding of the RegEx function in the formula is preventing me from taking that approach. Maybe this would be a better example to work with to help explain the formula. See attached workflow ... thanks!

geraldo · ‎09-11-2019

Hi @ivesbr

I put some information in the workflow to get the formulas documented.

About your question - depending on complexity it is better to use regex in formula is more flexible and can be chained several

I attach the workflow with formula information

[]

ivesbr · ‎09-11-2019

Hi Geraldo:

Thank you for the additional documentation. Very helpful!

I've now started to expand the overall workflow so it pulls data from each of the pages. In total I believe there's 950 players with data. So if there's a batter way to scrape the information, I'm all ears.

But one thing I noticed is that the 3rd workflow for page 3 (see Austin Hooper below), has two instances where the replace doesn't seem to be working.

I checked the source html text and saw that the next player down (Russell Wilson) has the same html structure, but that data is being scraped correctly. See below.

Austin Hooper - </td><td class="stat stat_5 numeric"><span class="playerStat statId-5 playerId- 2555415

Russell Wilson - </td><td class="stat stat_5 numeric"><span class="playerStat statId-5 playerId- 2532975

Any thoughts on what may need to be tweaked or added to the formula? Thanks!

Alteryx Designer Desktop Discussions

Help with RegEx for Web Scraping Workflow

Re: Is there any way the computer vision tools can...

Re: Batch Macro

Re: How to get cell reference address from excel

Re: Replacing Forecast columns with Actual Data

Re: Row creation