Identify all tags used in XML file

I've been reading some posts on XML parsing, but just not finding what I'm looking for.

Basic use case: I receive XML files from several sources, they all contain the following pieces of data at a minimum:

Standard GUID

Grade Level

Subject

DOK

However, each file uses a different tag for the above data and each file has the tag in a different place, with different numbers of child elements involved. And from one vendor, this can change from package to package.

My immediate main concern was the Standad GUID, and just tonight it dawned on me to use RegEx Parse tool to extract all GUIDs. While this is definitely useful and can get me to end-of-job, the problem is here that the label isn't included. And there are some cases where they use GUIDs to identify things other than a Standard, or they have multiple standards and the label would be of help for me to determine the source that GUID came from.

But - the real question here is - given all the other data points in the XML file listed above, and the fact that the files will contain many additional data points that I'm not even aware of, but would like to consume. Is there a method in Alteryx to first supply a list of ALL tags used, no matter how deeply embedded, that I could then pass through some other tool like XML Pars to use that list of selected tags to pull out the actual data?

@andrewkim80916 - this was part of what we were discussing tonight, thought I'd throw it out there in writing in case you had any additional thoughts.

Parse

Accepted answers

KaneG

Hi @serendipitytech,

An easy way to get all the tags in an XML file, will be to bring it in as a Text File (CSV or Flat-file) and then use a Regex tool to Tokenise to new lines on '<.*?>'. You can then use a summarize tool to group by. That could then be effected to remove the closing tags by StartsWith([Field1],'</') and a lot more.....

REGEX XML Tags.PNG

All comments

KaneG

Hi @serendipitytech,

REGEX XML Tags.PNG

serendipitytech

Hi @KaneG

This sounds like a very interesting approach. But a coupel roadblocks I've run into (I shoule mention I've only used the program for a few weeks now):

1. Since the input tool has no option for TXT, I'm trying to just label the xml file as a CSV in the options there. While that does seem to work to some degree, this leads to the next issue:

2. The RegEx tool reqquires a field to parse, which the renamed file has no fields that can be identified.

I tried renaming the file to .txt, but having trouble delsecting options that would allow it to be read as it doesn't actually match any expected delimiator.

I'm going to keep tweaking the idea, as I think there's something here I just need to put my finger on.

serendipitytech

I Think I got it!

So, I kept the manifest file as a xml file, since each "item" is wrapped witha. "resource" tag, I set the file input to return outer XML, then set the regex parse tool to parse the resource_OuterXML field with the regular expression:

<(\w+)

The one you suggested didn't seem to pick anything up, so a little googlie mooglie helped find a different one.

From here I chose to Tokenize (I must admit I don't fully understand what that does/means) and then realized I had to use the "Spliot to Rows" option, since I had no idea how many tags I'd be dealing with.

Then feed that to a Summarize tool, grouped on the newly created resource_OuterXML field, and PRESTO , a list of all tags used in the XML file!

I think now I can feed that list back into a parse tool to identify the specific fields I want. and extract the actual values of the fields.

Quick Links

This months top contributors

atcodedog05 19598

Qiu 15867

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703