This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 06-29-201612:11 PM- edited
a week ago
Web scraping, the process of extracting information (usually tabulated) from websites, is an extremely useful approach to still gather web-hosted data that isn’t supplied via APIs. In many cases, if the data you are looking for is stand-alone or captured completely on one page (no need for dynamic API queries), it is even faster than developing direct API connections to collect.
With the wealth of data already supplied on websites, easy access to this data can be a great supplement to your analyses to provide context or just provide the underlying data to ask new questions. Although there are a handful approaches to web scraping (two detailed on our community, here and here), there are a number of great, free, tools (parsehub and import.io to name a few) online that can streamline your web scraping efforts. This article details one approach that I find to be particularly easy, using import.io to create an extractor specific to your desired websites, and integrating calls to them into your workflow via a live query API link they provide through the service. You can do this in a few quick steps:
2. Once you’re signed up to use the service, navigate to your dashboard (a link can be found in the same corner of the homepage once logged in) to manage your extractors.
3. Click “New Extractor” in the top left hand corner and paste the URL that contains the data you’re trying to scrape in the “Create Extractor” pop up. Since fantasy football drafting season is just ahead of us, we’ll go ahead and use as an example tabulated data from last year’s top scorers provided by ESPN so you don’t end up like this guy (thank me later). We know our users go hard and the stakes are probably pretty high, so we want to want to get this right the first time, and using an approach that is reproducible enough to supply us with the requisite information needed to keep us among the top teams each year.
4. After a few moments, import.io will have scraped all the data from the webpage and display it to you in their “Data view.” Here you can add, remove, or rename columns to the table by selecting elements on the webpage – this is an optional step that can help you refine your dataset before generating your live query API URL for transfer, you can just as easily perform most of these operations in the Designer. For my example, I renamed the columns to reflect the statistic names on ESPN and added the “Misc TD” field that escaped the scraping algorithm.
5. Once your data is ready for import, click the red “Done” button in the top right hand corner. You’ll be redirected back to your dashboard where you can now see the extractor you created in the last step – select this extractor and look for the puzzle piece “Integrate” tab just below the extractor name in your view. You can copy and paste the “Live query API” (there’s also an option to download a CSV file of your data) listed here into a browser window to copy the JSON response that contains your data, or you can implement a call to it directly into your workflow using the Download Tool (just be sure to de-select “Encode URL Text” as you’re specifying the URL field):
That’s it! You should now have an integrated live query API for your webpage, and with an extractor that can be leveraged to rake data from that website if you want to try other pages as well. If you’d like to learn more about the approach, or on how to customize it with external scripts, try the import.io community. The sample I used above is attached here in the v10.5 workflow Webscrape.yxmd, you just have to update the live query API with one specific to your account, extractor, and webpage URL. If you decide to give it a try with the example above, be sure to let us know if we helped your fantasy team win big!