Hi:
I developed a basic workflow to scrape data from a fantasy football site and then begin parsing out the data from the HTML.
I'm able to successfully pull out the player's first and last name along with their fantasy data. But for some reason my RegEx isn't pulling out the position and team data. I've attached my workflow.
I also realize that I still need to figure out how to tell the workflow to distinguish between receptions, TDs, etc. But one step at a time. Appreciate any help. Thanks!
Solved! Go to Solution.
Hi @ivesbr,
see @markp201's post below:
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Parsing-HTML-tables/m-p/25886
It's not regex but it helped me understand HTML parsing when I first started using Designer. I got the result below by merely switching the text input to your URL. Hope that helps!
Hi Carl:
Thanks for the quick note back! Any chance you can share the workflow so I can review the detailed steps?
All the best,
here you go, @ivesbr
Oh wait ... you just used Mark's workflow with my desired website correct?
Dang! You and Carl are too good at this. I added two minor steps at the end of the workflow (attached) to parse out the team and position into separate columns ... otherwise this looks really good. Thank you!
Two quick follow up questions for you guys. First, I would have figured that the RegEx parse function would be the more powerful solution for this use case. Any reason why you guys chose to use the RegEx parse via the formula function?
Second, can you provide a little additional color around the formulas you both developed? Just trying to understand the step by step process. Thanks!
All the best,
And one last thing ... I copied and pasted the entire workflow so i could run the same parsing process for the next web page of data. But it looks like my limited understanding of the RegEx function in the formula is preventing me from taking that approach. Maybe this would be a better example to work with to help explain the formula. See attached workflow ... thanks!
Hi @ivesbr
I put some information in the workflow to get the formulas documented.
About your question - depending on complexity it is better to use regex in formula is more flexible and can be chained several
I attach the workflow with formula information
[]
Hi Geraldo:
Thank you for the additional documentation. Very helpful!
I've now started to expand the overall workflow so it pulls data from each of the pages. In total I believe there's 950 players with data. So if there's a batter way to scrape the information, I'm all ears.
But one thing I noticed is that the 3rd workflow for page 3 (see Austin Hooper below), has two instances where the replace doesn't seem to be working.
I checked the source html text and saw that the next player down (Russell Wilson) has the same html structure, but that data is being scraped correctly. See below.
Austin Hooper - </td><td class="stat stat_5 numeric"><span class="playerStat statId-5 playerId- 2555415
Russell Wilson - </td><td class="stat stat_5 numeric"><span class="playerStat statId-5 playerId- 2532975
Any thoughts on what may need to be tweaked or added to the formula? Thanks!