XML Parse - performance strategies

Hi -

I have a large (1.7million records, 14GB) XML file. It has a lot of nested node structures. I've set up a parsing routine that does most of the work, but it takes ~3 hours to run on a pretty decent machine. I'll be running this daily, and I'm looking for strategies to improve performance. I was thinking to try some / all of the following but wanted to see if anyone has some good advice first:

1. split the file and run several smaller but identical jobs

2. since I've added a record id, after the main parse slit the resulting major OuterXML sections to their own jobs

3. somehow inspect the modified_time element within the XML and then (on day 2+) parse only the records that have been modified

Before I do this, I wonder if I'm missing a much more basic / fundamental approach change.

Also for #3, any tips on how to inspect and use an element like that would be appreciated.

Thanks!

Pete

Best Practices

Transformation

Preparation

Accepted answers

PeterGoldey

As an FYI, I haven't found anything easier / quicker. The XML is actually more complex than indicated in the original post. A couple strategies that I am using are:

Transposing child nodes: Some of the parent nodes (photos for one) contain n child nodes (photos1, photos2, etc.) each of which contain info for a single - in this case - image. By transposing the parent node parse results and extracting the numerical (1, 2, 3, etc.) from the name field (the child XML node name) i now have a record for each image for each original parent node record. I think filter out NULL "values" to reduce the record count and XML parse this vertical table in a single step.

Without the transpose step, there are as many as 140 child nodes each of which would need its own XML Parse tool.

Inspecting the main node and filtering

Other than the first full run, each successive run will only need to:

a) identify and parse new records

b) identify and parse change records

c) identify "missing" records so I can delete them from the master file

a) new and change records: filter for modification time stamp in the main node > [last max date]

b) delete records: On a file this size, its much faster to first to an "inventory" parse to pull out the unique keys and modification time stamps for all the records. I can then easily identify which of the step a) records are adds vs. changes and also which ID's no longer exist in the file so I can delete them from the master.

All comments

MarqueeCrew

Pm me with your email and I will setup a WebEx to discuss.

PeterGoldey

As an update, while the original method I was using, the parsing you started, and a couple other approaches all work, I think that its simply easier to handle complex XML (lots of nested nodes, lots of non-required elements, lots of records, big file size) in other tools and then use the command line tool. Haven't set up anything final yet, but that's what I'm looking at.

SophiaF

Just throwing this out there, but have you tried the Parse XML app up on the Gallery? One of our Engineers built it out (as both a standalone Analytic app, and a downloadable macro) and he's actually been looking for some complex XML files to break it

XML Parse App

Parse XML Macro

We have used it internally as a good way to start doing some data discovery on workflows and understanding the overall file XML structure, so I would be curious to see how others rate its performance.

PeterGoldey

Hi Sophia,

Thank you for responding!

I tried the app and it doesn't seem to work on the file. Do you think Alex would be willing to try to run it and provide some insight? If so, how would I get a sample XML to him?

Thanks!

Pete

SophiaF

@PeterGoldey I'll PM you about the app.

I know the file itself is huge (and I'm sure the parsing makes it larger) but were you able to get the macro version of Alex's app working at all?

PeterGoldey

Yes, I am able to get the same output as posted in that image. I'm not
sure about tree analytics (new topic for me) and whether that tool would be
useful to my project.

PeterGoldey

As an FYI, I haven't found anything easier / quicker. The XML is actually more complex than indicated in the original post. A couple strategies that I am using are:

Without the transpose step, there are as many as 140 child nodes each of which would need its own XML Parse tool.

Inspecting the main node and filtering

Other than the first full run, each successive run will only need to:

a) identify and parse new records

b) identify and parse change records

c) identify "missing" records so I can delete them from the master file

a) new and change records: filter for modification time stamp in the main node > [last max date]

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15866

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703