Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Scraping tables from webpage using BeautifulSoup, error message: list index out of range

Roche
8 - Asteroid

Hi everyone, 

 

I am trying to scape all the tables from the webpage https://support.f5.com/csp/article/K4309

Please see the attached python code.

My code is giving the following error message:

Roche_2-1647523396609.png

A screenshot of the first table that I am trying to scrape (head and tail) is given below.  The tables' columns are consistent except for the very last row.  And I am wondering if this is the reason for this error message.  If so, or else, can someone please help me to correct this mistake? 

 

Roche_0-1647523072338.png

..........

..........

..........

Roche_1-1647523108320.png

 

Thank you for your help

 

Rouche

 

11 REPLIES 11
PhilipMannering
16 - Nebula
16 - Nebula

I think the reason for your error is that you get the cookie notification before the webpage. So you can't find any of the tables. You might be able to solve it with adding cookies to headers, or using selenium or calling an api as illustrated in the attached workflow.

PhilipMannering_0-1647603159839.png

 

Roche
8 - Asteroid

Hi Philip, 

 

Thank you so much for the flow and also extracting the first table for me.  

Can I ask - how did you know that content.0.value has table 1?  

 

Thanks, 

Rouche

PhilipMannering
16 - Nebula
16 - Nebula

No problem. I had a look at the results in the Browse Tool and it matched what was on the webpage.

Roche
8 - Asteroid

Ok, I am trying to figure out how to map back from the JSON names, but do not find a pattern to be able to find the next tables.  Will you be willing to help me do this?

Where did you put the browse tool?  If I put the browse tool after the select it shows only the JSON_Name and JSON_ValueStirng information.

Roche
8 - Asteroid

Hi Philip, 

 

Have not heard from you for a couple of day.  I am wondering if you will be able to help me identify which JSON_Name is associated with which table?  I tried a few things and also added a browse tool by the select / JSON tool, but I do not see code in the source code reflecting there.

 

Would appreciate it if you can help.

 

Rouche

PhilipMannering
16 - Nebula
16 - Nebula

Hi @Roche. I'm sorry, didn't see your reply. It looks like all the tables are in that one cell. 

 

I've parsed out each table into its own row. Take a look at the attached workflow.

PhilipMannering_0-1648113663394.png

Let me know if there's anything else.

 

Roche
8 - Asteroid

Hi Philip,

 

Thank you so much for the flow and your time!  Greatly appreciate your help :)

 

Can I ask another question - how did you know that content.0.value is the JSON_Name with table 1?  I understand that tokenizing caused you to see that all the tables is contained in content.0.value, but I do not know how you managed to know that this is the value for a table.  Have tried to add a browse tool at the select but that did not give me any additional information.

 

Thanks,

 

Rouche

PhilipMannering
16 - Nebula
16 - Nebula

If you double click on a cell in the results window (a), then you can see what's in it (b). You can close the "cell viewer" by clicking Cell Viewer (c). You can do all this without a Browse Tool (and just using the output anchor) however the text will be truncated if too long.

 

So I double clicked on content.0.value, and saw that the html for the whole web page was in there. That's why I filtered to, and started parsing, that particular record/cell.

 

PhilipMannering_0-1648224035922.png

 

PhilipMannering
16 - Nebula
16 - Nebula

Thinking about it, a more dynamic (and obvious) way of doing it would be to just user the Filter tool with using the Contains() function. Filter to the records that contains some text I know will be in the table, maybe even the table tag `<table` ...

 

 

Labels