web scraping
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello,
i would like to download all the 2000 rows and 7 columns from given website
https://www.forbes.com/lists/global2000/?sh=1a3b071b5ac0
Solved! Go to Solution.
- Labels:
- Data Investigation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Use the Download tool to scrape the HTML, then comb through the DownloadData to find what you're interested in. Once you find it, it's a parsing exercise.
To "comb through" the DownloadData, I wrote the data out to CSV and opened it in Notepad++, then searched for something unique in the G2000 table (like Microsoft's "2,054" market value).
Now we can see that the data we need is encased in DIVs of the following structure:
<div class=""marketValue table-cell market value "">$2,054.37 B</div>
Now the only tricky thing left is to write regex to capture this. I'm sure there's more than one way to do this but I ended up with:
<div class=(?:.*?)table-cell(.*?)<\/div>
Some resources that helped me build my regex:
regex101: build, test, and debug regex
Greedy and lazy quantifiers (javascript.info)
regex - What is a non-capturing group in regular expressions? - Stack Overflow