Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Prevent HUGE amount of data when scraping HTML

ianjohnston
8 - Asteroid

Good morning

 

We are collecting HTML data from this this web page  using a Download tool. When we run the workflow the Download tool tells us the data we are downloaded is a huge  1.1TB - which is obviously not good   😧   - but when we look at the raw data there are only about 50,000 records.

 

Has anyone had to overcome this kind of thing before? Perhaps there are some tricks to avoid creating such a data volume in the download or possibly a way to avoid it happening in the first place.

 

Here's hoping 🤞

 

Thanks 

ianjonna

12 REPLIES 12
Ben_H
11 - Bolide

Hi @ianjohnston,

 

I have absolutely no idea how you've got to 1.1TB!

 

Do you have an example of your workflow to share?

 

When I try it I get about 1.3MB!

 

Regards,

 

Ben

ianjohnston
8 - Asteroid

Hi Ben

 

Most grateful for you getting back to me. Having seen your workflow it looks like the way I configured the tool  matches but ................ looks like my issue actually originates in a following xml parse tool instead.  A copy of my workflow is attached if you have any ideas?

 

Thanks again 

 

Cheers

Ben_H
11 - Bolide

Hi @ianjohnston

 

Ah I see!, It's due to the "Include in Output" checkbox.

 

With it turned on you've ended up with the full download data duplicated 100,000+ times, hence the massive size.

 

Ben_H_0-1665398534532.png

Regards,

 

Ben

 

ianjohnston
8 - Asteroid

thanks @Ben_H , will look now 😀

danilang
19 - Altair
19 - Altair

Hi @ianjohnston 

 

If you need more examples check out, Weekly Challenge-116 A Symphony of Parsing Tools.  It's input is a large, complex XML file. The file is large enough that using a straight linear expansion, causes your memory use to explode to the point where the workflow would take days to run.  The submitted solutions demonstrate how to parse the required elements and also remove the resulting xxx_Outer_XML fields to keep your memory use in check.

 

Dan

ianjohnston
8 - Asteroid

Brilliant. thanks @danilang😊

ianjohnston
8 - Asteroid

@ben  - The checkbox does appear for me in the config panel. 

 

I am on 2021.3 - is is possible that the checkbox is in a more recent version

 

(nb - i can't upgrade, not yet scheduled by the organisation 😟)

 

danilang
19 - Altair
19 - Altair

Hi @ianjohnston 

 

The Return Outer XML check box has been there since I've been using Alteryx(2018).  And according to the Help documentation it was there in 2021.3 as well. Can you post a screen shot? 

 

Also to reduce memory use a Select tool after each XML Parse to deselect any xml fields that you no longer need.

 

Dan  

ianjohnston
8 - Asteroid

here's the screen shot: 

ianjohnston_0-1665404474087.png

 

Am I doing something wrong here?

Labels