Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!
The Product Idea boards have gotten an update to better integrate them within our Product team's idea cycle! However this update does have a few unique behaviors, if you have any questions about them check out our FAQ.

Alteryx Designer Desktop Ideas

Share your Designer Desktop product ideas - we're listening!
Submitting an Idea?

Be sure to review our Idea Submission Guidelines for more information!

Submission Guidelines

Improvement to Download Tool to allow for Web Scraping

The download tool is currently a general purpose tool that is used for many different things; from downloading FTP files; to scraping websites.

 

However, as a general purpose tool, it cannot serve the specific need of scraping a website without doing a huge amount of work to get there.    What makes Alteryx great is the fact that it drops the barrier so that regular folks can do some really powerful analytics, but the web scraping capabilities are not yet there and still require a tremendous amount of technical skill to accomplish.

 

I'll go through this from top to bottom:

  • Split capability: The download tool tries to be too many things to too many people.   Break it up into its component parts - one for FTP; one for Web Scraping; etc - with deep speciality.   You can still keep the download tool as the super-user version but by creating the specialized tools, we can make this much more user-friendly
  • Connection:  For enterprise users, where there's a locked down connectivity to the internet - there is no way to scrape web content without using CURL.   So we need the ability to connect to websites in a way that does not require curl or complex connectivity setups for users to navigate through web proxy settings.
    • Alteryx could auto-detect settings by allowing the user to point to the site within a controlled browse form like Excel does
  • Parameters: Many websites explicitly support named parameters (using ? notation) - it would be very useful to allow the user to link to these parameters explicitly without having to do complex string conjugations or %20 scrubbing to get of non-URL friendly characters
  • Content: Alteryx presents the user with no native ability to process HTML, so all scrubbing to pull out a specific field has to be done through complex read-through of the underlying source of the website (delivered in "DownloadedData") followed by guessing on patterns on how the site does tables or spans etc, followed by complex regex.    
    • Instead, we could present the user with a view of the web-page and ask them to select the elements that they want
    • This would serve the dual purpose of making this user-friendly for regular folks and abstract away the technicalities; but also would allow the download tool to eliminate all the other bits of the page that are not wanted like scripts; interstitial adverts; images; headers & footers etc.
  • Improved post / parse capability:   Sometimes the purpose of a URL is to generate a download (like the Google Finance API) - again, would be good to observe the user using the target site to record & interpret what they are looking for and what they get (e.g. the file from google)
  • HTML & XML types: why not an explicit type in Alteryx for web content?
  • Finally - HTML aware.   The browse tools are not currently HTML aware, so all the useful formatting to be able to see what's going on, expand nodes, find patterns etc - all this has to be copied out of Alteryx into Notepad ++.   Given the ubiquity of HTML parsers and pretty printers and editors, it should be reasonably easy to get a cheap component that can provide this capability

 

4 Comments
SeanAdams
17 - Castor
17 - Castor

Hey @TashaA

 

 

This is the web scraping discussion that we talked through at Inspire.    It woudl be VERY useful if we could pull web-scraping into a brand new tool which is a specific website connector that has the rich web-scraping functionality of Excel.

andyuttley
11 - Bolide
11 - Bolide

I love this idea! 

Might also be good to get some additional HTML parsing functionality alongside this (similar to Python's BeautifulSoup package); I know this can be recreated manually, but would be great to have out the box 

cgoodman3
14 - Magnetar
14 - Magnetar

It would be great to have something as simple as the importhtml function in google sheets for scraping tables from websites. This could either be a function in the formula tool or within the download tool (more sensible?).

 

When they demonstrated copy and paste from a website into the text input tool, I assumed this is what it would do.

AlteryxCommunityTeam
Alteryx Community Team
Alteryx Community Team
Status changed to: Accepting Votes