Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

web scraping

BRRLL99
11 - Bolide

Hello,

 

i would like to download all the 2000 rows and 7 columns from given website

 

https://www.forbes.com/lists/global2000/?sh=1a3b071b5ac0

 

 

2 REPLIES 2
binuacs
20 - Arcturus
NeilR
Alteryx Alumni (Retired)

Use the Download tool to scrape the HTML, then comb through the DownloadData to find what you're interested in. Once you find it, it's a parsing exercise. 

 

NeilR_0-1659453961816.png

 

To "comb through" the DownloadData, I wrote the data out to CSV and opened it in Notepad++, then searched for something unique in the G2000 table (like Microsoft's "2,054" market value).

 

NeilR_0-1659455627227.png

 

Now we can see that the data we need is encased in DIVs of the following structure:

 

<div class=""marketValue  table-cell  market value "">$2,054.37 B</div>

 

Now the only tricky thing left is to write regex to capture this. I'm sure there's more than one way to do this but I ended up with:

 

<div class=(?:.*?)table-cell(.*?)<\/div>

 

Some resources that helped me build my regex:

regex101: build, test, and debug regex

Greedy and lazy quantifiers (javascript.info)

regex - What is a non-capturing group in regular expressions? - Stack Overflow

 

Labels