Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Download Tool - HTML parsing to retrieve just the Text (Questions/Answers) from a Website

taxguy33
8 - Asteroid

Hello,

 

I admittedly don't have much experience at all parsing HTML data, but have been tasked with attempting to screen scrape the Text off the various Countries from this website:

 

https://www.omnipresent.com/global-employment-solutions/peo-albania

 

For example, I am trying to get the data of all the Text you see on the screen like "Employee income Taxes in Albania  - the rate of personal taxation varies depending on the income tax bracket the individual belongs to. This ranges between 0% - 23%". I'm wanting to create a matrix with all of these answers between countries, but figured I'd need to correctly parse out this text / answers to these questions for one country, before applying it to all the countries.

 

Is anyone able to take a stab at this using the Download tool and removing the various tags from HTML to try to just get the text shown on the website left within the Alteryx workflow?

 

Thanks! 

 

3 REPLIES 3
acarter881
12 - Quasar

Hi, @taxguy33.

 

My post here may help: https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Discussions/Web-Scrape-Branch-Details/m-p/...

 

One reason I may not want to use the Download tool is because it has some constraints, one of which is that it is synchronous, meaning it sends requests one at a time, waiting for the response, before sending another request. Python has asynchronous functions that allow you to send requests to many websites at once. Also, Python has robust solutions for solving this exact use case.

ArnaldoSandoval
12 - Quasar

Hi @taxguy33 

 

I am attaching a workflow able to scrap the information you need from the Albanian Global Employment Solutions & PEO web page you supply, I will details how some of the rules-formulas in the workflow were derived:

 

Albanian Global Employment Solutions & PEO inspection: (F12 in Google Chrome)

Albanian-GES-Inspect-01.png

  • Based on the description supplied the page has several questions with answers organized vertically.
  • We inspect this page, and select the question-section: Employer Costs in Albania, as shown in the screenshot above.
  • At the inspection panel on the right, we identified where the text "Employer Costs in Albania" is rendered through HTML, specially this tag: "<h2 class="h2 atlas-heading">" this tag appears on each question.
  • Almost each question-section has the mentioned html tag, except near the end of the end of the page, where it is a bit different.
  • The different tags found are: <h2 class="h2 atlas-heading">; <p>; <h2 class="h2">; <h2>; <h2 class="margin_top margin_zero">; and <p class="mt20">
  • We need to parse the HTML code returned by the Download tool by these html tags, we do that be replacing them with a silly character, like ¬ this character is seldom use in HTML pages, so it is safe to use.
  • Once the html tags had been replaced we Split the HTML page into rows using the silly character as a delimiter.

Albanian page processing, Alteryx Workflow:

Albanian-GES-Inspect-02.png

  • The workflow applies the parsing-scrapping rules described above productin a table; doing so for the Albanian page.

The 0% -23 % range is returned by the workflow:

Albanian-GES-Inspect-03.png

Comments and Conclusions:

  • The workflow scraps data from the Omnipresent's page for Albania.
  • It think you want to scrap the Omnipresent pages for each European country, you may achieve that by replacing the country's URL in the Text Input at the begining of the workflow.
  • Or, Adding the URL and Country to the starting Text Input (I just added this feature, and it works)

Albanian-GES-Inspect-04.png

  • The result table now include the country.
  • I did not invest much time cleansing the Answers data, as I was not fully clear of your requirement-request, but that should not be something complex to achieve.

 

Hope this helps,

Arnaldo

 

taxguy33
8 - Asteroid

@ArnaldoSandoval  This is awesome! Thank you so much - you spelled out everything you were doing perfectly and easy to learn from.

Labels