XML Parse - performance strategies
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi -
I have a large (1.7million records, 14GB) XML file. It has a lot of nested node structures. I've set up a parsing routine that does most of the work, but it takes ~3 hours to run on a pretty decent machine. I'll be running this daily, and I'm looking for strategies to improve performance. I was thinking to try some / all of the following but wanted to see if anyone has some good advice first:
1. split the file and run several smaller but identical jobs
2. since I've added a record id, after the main parse slit the resulting major OuterXML sections to their own jobs
3. somehow inspect the modified_time element within the XML and then (on day 2+) parse only the records that have been modified
Before I do this, I wonder if I'm missing a much more basic / fundamental approach change.
Also for #3, any tips on how to inspect and use an element like that would be appreciated.
Thanks!
Pete
Solved! Go to Solution.
- Labels:
- Best Practices
- Preparation
- Transformation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
As an update, while the original method I was using, the parsing you started, and a couple other approaches all work, I think that its simply easier to handle complex XML (lots of nested nodes, lots of non-required elements, lots of records, big file size) in other tools and then use the command line tool. Haven't set up anything final yet, but that's what I'm looking at.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Just throwing this out there, but have you tried the Parse XML app up on the Gallery? One of our Engineers built it out (as both a standalone Analytic app, and a downloadable macro) and he's actually been looking for some complex XML files to break it :)
We have used it internally as a good way to start doing some data discovery on workflows and understanding the overall file XML structure, so I would be curious to see how others rate its performance.
Senior Solutions Architect
Alteryx, Inc.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Sophia,
Thank you for responding!
I tried the app and it doesn't seem to work on the file. Do you think Alex would be willing to try to run it and provide some insight? If so, how would I get a sample XML to him?
Thanks!
Pete
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@PeterGoldey I'll PM you about the app.
I know the file itself is huge (and I'm sure the parsing makes it larger) but were you able to get the macro version of Alex's app working at all?
Senior Solutions Architect
Alteryx, Inc.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
sure about tree analytics (new topic for me) and whether that tool would be
useful to my project.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
As an FYI, I haven't found anything easier / quicker. The XML is actually more complex than indicated in the original post. A couple strategies that I am using are:
Transposing child nodes: Some of the parent nodes (photos for one) contain n child nodes (photos1, photos2, etc.) each of which contain info for a single - in this case - image. By transposing the parent node parse results and extracting the numerical (1, 2, 3, etc.) from the name field (the child XML node name) i now have a record for each image for each original parent node record. I think filter out NULL "values" to reduce the record count and XML parse this vertical table in a single step.
Without the transpose step, there are as many as 140 child nodes each of which would need its own XML Parse tool.
Inspecting the main node and filtering
Other than the first full run, each successive run will only need to:
a) identify and parse new records
b) identify and parse change records
c) identify "missing" records so I can delete them from the master file
a) new and change records: filter for modification time stamp in the main node > [last max date]
b) delete records: On a file this size, its much faster to first to an "inventory" parse to pull out the unique keys and modification time stamps for all the records. I can then easily identify which of the step a) records are adds vs. changes and also which ID's no longer exist in the file so I can delete them from the master.
