This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
We're currently using Regex and text to columns to parse raw HTML as text into the appropriate format when web scraping, when a tool to at least parse tables would be hugely beneficial.
This functionality exists within Qlik so it would be nice to have this replicated in Alteryx.
Obviously, we need to retain the ability to scrape raw HTML, but automatically parsing data using the <td>, <th> and <tr> tags would be nice.
In the following page there is a table showing the states and territories of the US:
With Qlik, you can input the URL and it will return the available tables in tabular format:
As this functionality exists elsewhere it would be nice to incorporate this into Alteryx.
+1. I think a macro could be created and utilized in the meantime, but an out of the box tool would be even better.
Yeah, macros are great for repetitive ad hoc tasks that are pretty much unique to the situation, but repetitive tasks that are generic across all users is something that I feel should be developed as part of the core functionality. I mean, who doesn't parse HTML tables?
Thanks for the request. This is something that we have seen a need for both from customer requests as well as internal use of Alteryx. Some work has been done to try and create a tool for this, but it still needs more work in order to finish it up. There are a lot of edge cases with HTML tables that are taking some work. We will continue to look into it.
This is one of a few areas that I think that we can improve the download tool - the other is to add native support within Alteryx for HTML and for XML.
We talked about this with @Ned and @AdamR and @NickJ at Inspire. Essentially the idea would be to implement a new type within Alteryx for XML / HTML - and this would allow you to parse this kind of data using an object model.
One of the common functions in parsing HTML is to spot a table, and then pull this out into data - as you say above - and this would be one of the first capabilities that we could look to implement on this new type.
Fully support your thinking here - trying to unpick tables out of a text field in a data stream is more pain than it needs to be currently.
It's one of those things I should be fairly straightforward to implement (the quality of the HTML notwithstanding), and I think is aligned with removing the need for technical intervention if users don't have the Regex skills required.
In the short term I would suggest implementing functionality similar to the ImportHTML function in Google Sheets, and dealing with fringe cases at a later date if ever.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.