Scraping tables from webpage using BeautifulSoup, error message: list index out of range
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi everyone,
I am trying to scape all the tables from the webpage https://support.f5.com/csp/article/K4309
Please see the attached python code.
My code is giving the following error message:
A screenshot of the first table that I am trying to scrape (head and tail) is given below. The tables' columns are consistent except for the very last row. And I am wondering if this is the reason for this error message. If so, or else, can someone please help me to correct this mistake?
..........
..........
..........
Thank you for your help
Rouche
- Labels:
- Download
- Parse
- Python
- Regex
- Tips and Tricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I think the reason for your error is that you get the cookie notification before the webpage. So you can't find any of the tables. You might be able to solve it with adding cookies to headers, or using selenium or calling an api as illustrated in the attached workflow.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Philip,
Thank you so much for the flow and also extracting the first table for me.
Can I ask - how did you know that content.0.value has table 1?
Thanks,
Rouche
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
No problem. I had a look at the results in the Browse Tool and it matched what was on the webpage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Ok, I am trying to figure out how to map back from the JSON names, but do not find a pattern to be able to find the next tables. Will you be willing to help me do this?
Where did you put the browse tool? If I put the browse tool after the select it shows only the JSON_Name and JSON_ValueStirng information.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Philip,
Have not heard from you for a couple of day. I am wondering if you will be able to help me identify which JSON_Name is associated with which table? I tried a few things and also added a browse tool by the select / JSON tool, but I do not see code in the source code reflecting there.
Would appreciate it if you can help.
Rouche
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Roche. I'm sorry, didn't see your reply. It looks like all the tables are in that one cell.
I've parsed out each table into its own row. Take a look at the attached workflow.
Let me know if there's anything else.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Philip,
Thank you so much for the flow and your time! Greatly appreciate your help :)
Can I ask another question - how did you know that content.0.value is the JSON_Name with table 1? I understand that tokenizing caused you to see that all the tables is contained in content.0.value, but I do not know how you managed to know that this is the value for a table. Have tried to add a browse tool at the select but that did not give me any additional information.
Thanks,
Rouche
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
If you double click on a cell in the results window (a), then you can see what's in it (b). You can close the "cell viewer" by clicking Cell Viewer (c). You can do all this without a Browse Tool (and just using the output anchor) however the text will be truncated if too long.
So I double clicked on content.0.value, and saw that the html for the whole web page was in there. That's why I filtered to, and started parsing, that particular record/cell.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thinking about it, a more dynamic (and obvious) way of doing it would be to just user the Filter tool with using the Contains() function. Filter to the records that contains some text I know will be in the table, maybe even the table tag `<table` ...
