community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
SOLVED

Parsing XML input files

Meteoroid

I have designed a workflow to extract all levels of xml ,but it is taking lot of time.Any suggestions on how to improve performance?

Inactive User
Not applicable

There may not be if you this is just the parsing. What you can do though is cache the workflow after all the parsing to speed up development and once ready for scheduling it will be a moot point anyway.

Aurora

Hi @dexter90 

 

Parsing multilevel XML can generate very large intermediary data sets.  This is because for each new row at each level you create you're duplicating the outer_xml of all the previous levels.  As an example, if you have a 1MB XML that's only one level deep but generates 1000 rows, you end up with a 1GB data set.  Add in a few more levels and your in-memory data can easily exceed the amount of available RAM on your system.  At this point your system will start swapping to disk and your workflow will slow to a crawl.  The way to get around this is prune the Outer_XML of all levels as soon as you can.  Once you have level 2 parsed, use a Select tool to remove the Outer_XML column for level 1.  Continue this process as you proceed through the levels and you should be able to keep memory use in check.    

 

Another thing to look out for is the situation where you have multiple independent collections all at the same same level.  Attempting to process all of these collections sequentially in a single stream is a another good way to explode the size of your data set.  Instead, branch each of the these collections into their own streams and include only the Outer_XML for that collection as well as a key column for the parent.  Process the collection completely, drop the Outer_XML column and join back to the parent object on the key. 

 

Check out the solutions to Weekly Challenge 116 A Symphony of Parsing tools.  This challenge was actually built to show you how to get around theses kinds of memory issues.  If you attempt to parse it all in one stream with no clean up, you end up with a dataset that exceeds 100GB.

 

Dan

 

      

Labels