Alteryx Designer Desktop Discussions

BRRLL99 · ‎08-01-2022

Hello,

i would like to download all the 2000 rows and 7 columns from given website

https://www.forbes.com/lists/global2000/?sh=1a3b071b5ac0

binuacs · ‎08-01-2022

@BRRLL99 The post might be useful for your use case

https://www.thedataschool.co.uk/robbin-vernooij/web-scraping-html-tables-an-alteryx-workflow-and-r-s...

NeilR · ‎08-02-2022

Use the Download tool to scrape the HTML, then comb through the DownloadData to find what you're interested in. Once you find it, it's a parsing exercise.

To "comb through" the DownloadData, I wrote the data out to CSV and opened it in Notepad++, then searched for something unique in the G2000 table (like Microsoft's "2,054" market value).

Now we can see that the data we need is encased in DIVs of the following structure:

<div class=""marketValue  table-cell  market value "">$2,054.37 B</div>

Now the only tricky thing left is to write regex to capture this. I'm sure there's more than one way to do this but I ended up with:

<div class=(?:.*?)table-cell(.*?)<\/div>

Some resources that helped me build my regex:

regex101: build, test, and debug regex

Greedy and lazy quantifiers (javascript.info)

regex - What is a non-capturing group in regular expressions? - Stack Overflow

Alteryx Designer Desktop Discussions

web scraping