I need help scraping data from https://www.ffiec.gov/npw/Institution/TopHoldings. Each top holding company has a set of subsidiaries that I’ll like to download. Please help as I’m new to this, I get a 403 message when I try…thank you
If it's 403 Forbidden, then nothing else can be done with the Download tool! You'll have to find another way - maybe an RPA tool can replicate the clicks to download the CSV from that site
I checked briefly, and it looks like the site is rejecting requests from Alteryx, likely because it blocks automated tools or requires a browser-based user agent.
I think it's difficult to achieve this using Alteryx.
https://www.ffiec.gov/npw/Institution/TopHolderList
having said that --- they have a data download. go through any apis/datadownloads BEFORE trying any web scraping.
sample csv---> https://www.ffiec.gov/npw/FinancialReport/ReturnFinancialReportCSV?rpt=BHCPR&id=1039502&dt=20241231
Hey @KD82
I've put together a solution using Alteryx's Python tool (found in the Developer Tool Palette) to scrape the table from the url https://www.ffiec.gov/npw/Institution/TopHoldings.
The attached workflow utilizes the Python libraries selenium & pandas to extract the data to pre-process in Alteryx. If you need to scrape a different url, the script may require minor adjustments to accommodate the new page structure.
While this solution is slightly more complex than other download methods, it automates the data extraction and preprocessing within Alteryx, eliminating the need for manual intervention 😊!
To use this solution:
Install Required Libraries: If not already installed, you'll need to add selenium & websockets to your miniconda environment’s site-packages, as this is where Alteryx executes Python commands. (I used pip install from command line / then copy + pasted into desired folder)
Import and Run the Workflow: Export the attached workflow in Alteryx, import the provided Python script into the Jupyter notebook within the Python tool, and run the workflow.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
I’ll really like to know where to search for the api’s as this will could allow use of the download tool…much appreciated
This is really helpful but I have restrictions on python libraries usage…did learn a lot from what you provided…🙏🏾👏🏾
Hey --- in Chrome use control/shift/j to open the developer console. explore the network tab to see what's running and identify the backend apis. I tend to stick to that method vs Selenium unless I'm doing something which needs browser automation.
One more thing--- https://www.ffiec.gov/npw/FinancialReport/ReturnFinancialReportCSV?rpt=BHCPR&id=1039502&dt=20241231 --- the csv link is the combo of datetime (20241231) and the specific id for that entity ("RssdId": 1039502 - for example for JP Morgan)... you can try to link these calls if you want to access the sub records.
Thank you…I’m having a little challenge with this part where I want the hierarchy for say JP and how I can modify the date. Please see attached