This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
There may not be if you this is just the parsing. What you can do though is cache the workflow after all the parsing to speed up development and once ready for scheduling it will be a moot point anyway.
Parsing multilevel XML can generate very large intermediary data sets. This is because for each new row at each level you create you're duplicating the outer_xml of all the previous levels. As an example, if you have a 1MB XML that's only one level deep but generates 1000 rows, you end up with a 1GB data set. Add in a few more levels and your in-memory data can easily exceed the amount of available RAM on your system. At this point your system will start swapping to disk and your workflow will slow to a crawl. The way to get around this is prune the Outer_XML of all levels as soon as you can. Once you have level 2 parsed, use a Select tool to remove the Outer_XML column for level 1. Continue this process as you proceed through the levels and you should be able to keep memory use in check.
Another thing to look out for is the situation where you have multiple independent collections all at the same same level. Attempting to process all of these collections sequentially in a single stream is a another good way to explode the size of your data set. Instead, branch each of the these collections into their own streams and include only the Outer_XML for that collection as well as a key column for the parent. Process the collection completely, drop the Outer_XML column and join back to the parent object on the key.
Check out the solutions to Weekly Challenge 116 A Symphony of Parsing tools. This challenge was actually built to show you how to get around theses kinds of memory issues. If you attempt to parse it all in one stream with no clean up, you end up with a dataset that exceeds 100GB.