Alteryx Designer Desktop Discussions

ianjonnaCAA · ‎10-10-2022

Good morning

We are collecting HTML data from this this web page using a Download tool. When we run the workflow the Download tool tells us the data we are downloaded is a huge 1.1TB - which is obviously not good 😧 - but when we look at the raw data there are only about 50,000 records.

Has anyone had to overcome this kind of thing before? Perhaps there are some tricks to avoid creating such a data volume in the download or possibly a way to avoid it happening in the first place.

Here's hoping 🤞

Thanks

ianjonna

Ben_H · ‎10-10-2022

Hi @ianjonnaCAA,

I have absolutely no idea how you've got to 1.1TB!

Do you have an example of your workflow to share?

When I try it I get about 1.3MB!

Regards,

Ben

ianjonnaCAA · ‎10-10-2022

Hi Ben

Most grateful for you getting back to me. Having seen your workflow it looks like the way I configured the tool matches but ................ looks like my issue actually originates in a following xml parse tool instead. A copy of my workflow is attached if you have any ideas?

Thanks again

Cheers

Ben_H · ‎10-10-2022

Hi @ianjonnaCAA

Ah I see!, It's due to the "Include in Output" checkbox.

With it turned on you've ended up with the full download data duplicated 100,000+ times, hence the massive size.

Regards,

Ben

ianjonnaCAA · ‎10-10-2022

thanks @Ben_H , will look now 😀

danilang · ‎10-10-2022

Hi @ianjonnaCAA

If you need more examples check out, Weekly Challenge-116 A Symphony of Parsing Tools. It's input is a large, complex XML file. The file is large enough that using a straight linear expansion, causes your memory use to explode to the point where the workflow would take days to run. The submitted solutions demonstrate how to parse the required elements and also remove the resulting xxx_Outer_XML fields to keep your memory use in check.

Dan

ianjonnaCAA · ‎10-10-2022

Brilliant. thanks @danilang😊

ianjonnaCAA · ‎10-10-2022

@ben - The checkbox does appear for me in the config panel.

I am on 2021.3 - is is possible that the checkbox is in a more recent version

(nb - i can't upgrade, not yet scheduled by the organisation 😟)

danilang · ‎10-10-2022

Hi @ianjonnaCAA

The Return Outer XML check box has been there since I've been using Alteryx(2018). And according to the Help documentation it was there in 2021.3 as well. Can you post a screen shot?

Also to reduce memory use a Select tool after each XML Parse to deselect any xml fields that you no longer need.

Dan

ianjonnaCAA · ‎10-10-2022

here's the screen shot:

Am I doing something wrong here?

Alteryx Designer Desktop Discussions

Prevent HUGE amount of data when scraping HTML

Re: Issue with “Block Until Done” and Multiple Out...

Re: Unable to get an output

Re: Extracting the list of sheet names across mult...

Re: Chaining Apps

Re: Unable to read in all raw xml from an excel fi...