I'm trying to get Alteryx to pull a certain hyperlink from this source code view-source:https://www.fda.gov/drugs/drug-approvals-and-databases/approved-drug-products-therapeuti... (specifically line 484).
I think I could download the source code as text (although I'm not sure how) and change the text to columns and from there filter to the hyperlink I need.
I'm wondering if it's possible to direct Alteryx to pull the data straight from that location so even if the hyperlink changes, the data will be updated. Or if not, if it's possible for Alteryx to automatically pull text data from the source code?
Solved! Go to Solution.
Hi @helenjin1
Here's a WF that gets(eventually) your pdf. It demonstrates the basics of web scraping
It starts by building the absolute path from the site and relative path addresses. The download tool gets the html from this path and the Text to Columns splits it to rows on the new line character. After adding a Record ID, the filter pulls out line 484 and the Regex Parse tool gets the relative path.
Since there's Patents page between the home page and the file you're looking for, the process repeats for this page.
Finally in the PDF container, the pdf file is downloaded and saved to disk in the same directory as the workflow.
This isn't the best way to find this since the process uses line numbers to find the links. You'll want to modify this to look for the HREF tags using some kind of search instead of using line numbers. This will future-proof the workflow against any HTML changes that move the lines around
Dan
Thank you! That was really helpful