Hi All
Need help on webscraping
i see some missing values(data) with Alteryx webscraping
Below table contains original source code and the Alteryx webscraping - shows missing the entire row value highlighted in Blue
Please Advise
Alteryx flow -
Web Url Source | Alteryx Webscraping |
<td class="rowLabel" style="width: 210px;">Project:</td> | <td class="rowLabel" style="width: 210px;"Project:</td> |
<td> | <td> |
<span class="drop_hilite">August 2017</span> | <span class="drop_hilite">August 2017</span> <span class="add_hilite">December 2019</span> </td> |
</td> | </tr> |
<td> | <tr> |
<span class="add_hilite">December 2019</span> | <td class="rowLabel" style="width: 210px;">Status:</td> |
</td> | <td> |
</tr> | <span class="add_hilite">Active, </span>not <span class="drop_hilite">yet </span> recruiting </td> |
<tr> | </tr> |
<td class="rowLabel" style="width: 210px;">Status:</td> | <tr> |
<td> | <td class="rowLabel" style="width: 210px;"Start:</td> |
Not <span class="drop_hilite">yet </span>recruiting | <td> |
</td> | |
<td> | |
<span class="add_hilite">Active, </span>not recruiting | |
</td> | |
</tr> | |
<tr> | |
<td class="rowLabel" style="width: 210px;"Start:</td> |
This webpage likely has dynamically generated content after the page load which would mean that a download tool would not grab all of the information. You may want to leverage the Python tool and Selenium to accomplish this if that is the case. Here is a helpful article that walks you through the process: https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Python-Code-Tool-Web-Scraping-Dynamic-...
It will scrape the HTML just like the download tool will, but you can have it wait for the extra content to load. Additionally, Selenium is incredibly powerful and you can use it to click buttons and pass values into text boxes. I have used it for quite a few use cases where I needed to actually interact with a webpage beyond just scraping the HTML.
As a final note, you may want to see if the webpage has an API available. This is always preferable over web scraping, because API calls are more resilient when it comes to data structure. You can imagine a scenario where someone changes the layout or content of a webpage and therefore causes your workflow to not find the same tags that you were using previously. Here is a helpful article on API calls: https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/APIs-in-Alteryx-cURL-and-Download-T...