In case you missed the announcement: The Alteryx One Fall Release is here! Learn more about the new features and capabilities here
ACT NOW: The Alteryx team will be retiring support for Community account recovery and Community email-change requests after December 31, 2025. Set up your security questions now so you can recover your account anytime, just log out and back in to get started. Learn more here
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

web scraping

BRRLL99
11 - Bolide

Hello,

 

i would like to download all the 2000 rows and 7 columns from given website

 

https://www.forbes.com/lists/global2000/?sh=1a3b071b5ac0

 

 

2 REPLIES 2
binu_acs
21 - Polaris
NeilR
Alteryx Alumni (Retired)

Use the Download tool to scrape the HTML, then comb through the DownloadData to find what you're interested in. Once you find it, it's a parsing exercise. 

 

NeilR_0-1659453961816.png

 

To "comb through" the DownloadData, I wrote the data out to CSV and opened it in Notepad++, then searched for something unique in the G2000 table (like Microsoft's "2,054" market value).

 

NeilR_0-1659455627227.png

 

Now we can see that the data we need is encased in DIVs of the following structure:

 

<div class=""marketValue  table-cell  market value "">$2,054.37 B</div>

 

Now the only tricky thing left is to write regex to capture this. I'm sure there's more than one way to do this but I ended up with:

 

<div class=(?:.*?)table-cell(.*?)<\/div>

 

Some resources that helped me build my regex:

regex101: build, test, and debug regex

Greedy and lazy quantifiers (javascript.info)

regex - What is a non-capturing group in regular expressions? - Stack Overflow

 

Labels
Top Solution Authors