Prevent HUGE amount of data when scraping HTML
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Good morning
We are collecting HTML data from this this web page using a Download tool. When we run the workflow the Download tool tells us the data we are downloaded is a huge 1.1TB - which is obviously not good 😧 - but when we look at the raw data there are only about 50,000 records.
Has anyone had to overcome this kind of thing before? Perhaps there are some tricks to avoid creating such a data volume in the download or possibly a way to avoid it happening in the first place.
Here's hoping 🤞
Thanks
ianjonna
Solved! Go to Solution.
- Labels:
- Download
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @ianjonnaCAA,
I have absolutely no idea how you've got to 1.1TB!
Do you have an example of your workflow to share?
When I try it I get about 1.3MB!
Regards,
Ben
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Ben
Most grateful for you getting back to me. Having seen your workflow it looks like the way I configured the tool matches but ................ looks like my issue actually originates in a following xml parse tool instead. A copy of my workflow is attached if you have any ideas?
Thanks again
Cheers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @ianjonnaCAA
Ah I see!, It's due to the "Include in Output" checkbox.
With it turned on you've ended up with the full download data duplicated 100,000+ times, hence the massive size.
Regards,
Ben
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
thanks @Ben_H , will look now 😀
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @ianjonnaCAA
If you need more examples check out, Weekly Challenge-116 A Symphony of Parsing Tools. It's input is a large, complex XML file. The file is large enough that using a straight linear expansion, causes your memory use to explode to the point where the workflow would take days to run. The submitted solutions demonstrate how to parse the required elements and also remove the resulting xxx_Outer_XML fields to keep your memory use in check.
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Brilliant. thanks @danilang😊
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@ben - The checkbox does appear for me in the config panel.
I am on 2021.3 - is is possible that the checkbox is in a more recent version
(nb - i can't upgrade, not yet scheduled by the organisation 😟)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @ianjonnaCAA
The Return Outer XML check box has been there since I've been using Alteryx(2018). And according to the Help documentation it was there in 2021.3 as well. Can you post a screen shot?
Also to reduce memory use a Select tool after each XML Parse to deselect any xml fields that you no longer need.
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
here's the screen shot:
Am I doing something wrong here?
