community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE

Web Scraping

Community Data Engineer
Community Data Engineer
Created on

Web scraping, the process of extracting information (usually tabulated) from websites, is an extremely useful approach to still gather web-hosted data that isn’t supplied via APIs. In many cases, if the data you are looking for is stand-alone or captured completely on one page (no need for dynamic API queries), it is even faster than developing direct API connections to collect.

 

With the wealth of data already supplied on websites, easy access to this data can be a great supplement to your analyses to provide context or just provide the underlying data to ask new questions. Although there are a handful approaches to web scraping (two detailed on our community, here and here), there are a number of great, free, tools (parsehub and import.io to name a few) online that can streamline your web scraping efforts. This article details one approach that I find to be particularly easy, using import.io to create an extractor specific to your desired websites, and integrating calls to them into your workflow via a live query API link they provide through the service. You can do this in a few quick steps:


1. Navigate to their homepage, https://www.import.io/, and “Sign up” in the top right hand corner:


1.png


2. Once you’re signed up to use the service, navigate to your dashboard (a link can be found in the same corner of the homepage once logged in) to manage your extractors.


3. Click “New Extractor” in the top left hand corner and paste the URL that contains the data you’re trying to scrape in the “Create Extractor” pop up. Since fantasy football drafting season is just ahead of us, we’ll go ahead and use as an example tabulated data from last year’s top scorers provided by ESPN so you don’t end up like this guy (thank me later). We know our users go hard and the stakes are probably pretty high, so we want to want to get this right the first time, and using an approach that is reproducible enough to supply us with the requisite information needed to keep us among the top teams each year.


4. After a few moments, import.io will have scraped all the data from the webpage and display it to you in their “Data view.” Here you can add, remove, or rename columns to the table by selecting elements on the webpage – this is an optional step that can help you refine your dataset before generating your live query API URL for transfer, you can just as easily perform most of these operations in the Designer. For my example, I renamed the columns to reflect the statistic names on ESPN and added the “Misc TD” field that escaped the scraping algorithm.


5. Once your data is ready for import, click the red “Done” button in the top right hand corner. You’ll be redirected back to your dashboard where you can now see the extractor you created in the last step – select this extractor and look for the puzzle piece “Integrate” tab just below the extractor name in your view. You can copy and paste the “Live query API” (there’s also an option to download a CSV file of your data) listed here into a browser window to copy the JSON response that contains your data, or you can implement a call to it directly into your workflow using the Download Tool (just be sure to de-select “Encode URL Text” as you’re specifying the URL field):

 

2.png


3.PNG


That’s it! You should now have an integrated live query API for your webpage, and with an extractor that can be leveraged to rake data from that website if you want to try other pages as well. If you’d like to learn more about the approach, or on how to customize it with external scripts, try the import.io community. The sample I used above is attached here in the v10.5 workflow Webscrape.yxmd, you just have to update the live query API with one specific to your account, extractor, and webpage URL. If you decide to give it a try with the example above, be sure to let us know if we helped your fantasy team win big!

Attachments
Comments
Meteor

 I love ScrapingHub as well to get data-- free with awesome support. 

Bolide

Harbinger, I agree & I would give you a star if I could! 🙂

Bolide

On a side note - is webscraping legal? We had an internal discussion about it & I did some research and findings are mixed...

Bolide

@simon

 

My company is also investing the legal risks behind webscraping. This came after my team had built a really powerful webscrape process with Alteryx! Based on what we've found, the answer depends on the website you are scrapping from and what you are doing with the data you scrape. Take a look at any Terms and Conditions that may be present on the webpage that contains the data or even the website that governs that webpage/data. Your legal team will hopefully be able to gauge the legality of scraping and storing such data based on what is written there.

Meteor

You've wandered into a grey area Smiley Happy. Generally, I do a good amount of reading of the website legal documentation before I scrape. You also may want to reach out to your company's counsel before you scrape-- that is if you have one! If you don't ask a lawyer buddy, they should be able to help.

Often, the act of scraping is not what gets you in legal trouble (however that may be true in a few cases), it is what you use the data for. For instance, if you simply want to make an informed purchase based on Amazon comments, you could scrape the site for a set of particular products, say coffee makers, run sentiment analysis on the data and choose which one you'd like waking your family up at 5:30AM each morning ;). However, if you were to scrape all of the possible coffee makers from Amzon and use that data for your own profit... say, to build a database of coffee makers that you then resell for profit, that probably isn't cool. My general rule of thumb is: if you're using data from a web scrapper to profit, it isn't cool. If you're using said data to inform a personal or purely academic decision, that's probably cool. Now, these are my opinions and not that of my company or their clients and I am in no way any legal counsel; I'd recommend you talk to a legal expert either from the site you're scraping or from your own organization. I hope this helps! 

 

--JH

Bolide

@_Harbinger_ @DultonM

 

I agree with both of you. It's definitely a grey area and maybe not everyone is aware of this!!! I've read some cases of Facebook vs, Ebay vs, and airline vs. Legality 'seems' to boil down to whether you're in a competitive space and making a profit. Each case/situation can be different...

So I feel scraping store addresses would be fine. Scraping stores with products and prices (can't get API access) maybe. In my case, this would be used to help the same company increase profits by porting data to FB ads. Either way, I'm glad I'm not the only one wondering about this. When in doubt - look at terms and talk to counsel.

Thanks for your input!

Asteroid

I think the integrate tab is not free anymore. On clicking the integrate tab it asks you to upgrade to the paid services. So I am guessing you can't get the "Live query API" for free on import.io

Alteryx Partner

@princejindal Yes, I found the same when clicking the 'integrate' button.

Community Data Engineer
Community Data Engineer

@princejindal and @VegasBeans, thank you for bringing that change in the service to our attention!  We'll start working on documenting alternative methodologies 🙂

Meteor

Every website should tell you what you can scrap and what you can not. You must refer to ROBOTS.TXT ont eh server for details.

 

Sample: https://www.linkedin.com/robots.txt

Comet

@MattD (or any other user)  I am trying to update my weekly challenge tracker and learn web scraping as well.  I can't seem to get page 4 of my posts.  The part of my workflow is using a Download tool pointed to URL https://community.alteryx.com/t5/forums/recentpostspage/user-id/39450/page/4.  The other pages (i.e. https://community.alteryx.com/t5/forums/recentpostspage/user-id/39450/page/3) come in fine, but I am getting an error after the dropdown data on this page of :

 

Your request failed. Please contact your system administrator and provide the date and time you received the error and this Exception ID: 713EC718.

Click your browser's Back button to continue.

 

I sent an email to support@alteryx.com as well as I think this is a webpage XML issue, but I'm not an expert in that area.  The rest of the web scraping from the Alteryx Community in the workflow is working fine.