Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

web scraping

BRRLL99
11 - Bolide

Hello,

 

i would like to download all the 2000 rows and 7 columns from given website

 

https://www.forbes.com/lists/global2000/?sh=1a3b071b5ac0

 

 

2 REPLIES 2
binuacs
21 - Polaris
NeilR
Alteryx Alumni (Retired)

Use the Download tool to scrape the HTML, then comb through the DownloadData to find what you're interested in. Once you find it, it's a parsing exercise. 

 

NeilR_0-1659453961816.png

 

To "comb through" the DownloadData, I wrote the data out to CSV and opened it in Notepad++, then searched for something unique in the G2000 table (like Microsoft's "2,054" market value).

 

NeilR_0-1659455627227.png

 

Now we can see that the data we need is encased in DIVs of the following structure:

 

<div class=""marketValue  table-cell  market value "">$2,054.37 B</div>

 

Now the only tricky thing left is to write regex to capture this. I'm sure there's more than one way to do this but I ended up with:

 

<div class=(?:.*?)table-cell(.*?)<\/div>

 

Some resources that helped me build my regex:

regex101: build, test, and debug regex

Greedy and lazy quantifiers (javascript.info)

regex - What is a non-capturing group in regular expressions? - Stack Overflow

 

Labels