how to parse a specific element out of an HTML when the element might show up 0:N times

Hello,

I am streaming in a large HTML as a string value into the workflow.

The HTML string has an unknown number of a specific element embedded in it. Every element I want to find will look like this:

<div>Endorsements: <span class='gen-ai-file-name'>XXXXXX</span></div>

the XXXXXX will not always be the same length, might be a few characters, might be a long sentence, might be another embedded div or span.

what I really want to get out of this is the XXXXXX value, but I'd settle for getting the entire <div></div> substring

This could appear 1 time in the HTML, it might appear 100 times in the HTML. It might not show up in the HTML at all.

Ideally, what I'd like to return out of the parse is 1 row per occurrence with either the XXXXXX value or the entire "<div>Endorsements: <span class='gen-ai-file-name'>XXXXXX</span></div>" value. So if it shows up 1 time there will be 1 row, if it shows up 100 times, there will be 100 rows, if its not in there there won't be any rows.

I'm sure I can use XMLparse to do this, but I'm not very skilled in it. And this particular <div> element may be a parent element, a child element or seventeen layers deep buried in stacked divs and spans and whatnot.

It's proprietary so I can't post a sample, but hopefully I've been clear enough someone who does understand parsing and text mining can help.

Thanks in advance.

Developer

Parse

Text Mining

Data Investigation

Accepted answers

Yoshiro_Fujimori

@rfoster7

I would use RegEx tool as below.

You may want to try the expression with your data (may need a little more tweaks).

I hope this helps. Good luck.

Input Data

Field1

abc<div>Endorsements: <span class='gen-ai-file-name'>abc</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>def</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>ghi</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>jkl</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>mno</span></div>xyz

abcdefg

RegEx Tool configuration

Regular Expression

<div>Endorsements: <span class='gen-ai-file-name'>.*?</span></div>

Output Method

Tokenize

Split to Rows

Output Data

Field1

<div>Endorsements: <span class='gen-ai-file-name'>abc</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>def</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>ghi</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>jkl</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>mno</span></div>

rfoster7.yxmd

All comments

rfoster7

Never mind. figured it using a simple regex tokenize

doesn't seem to be a delete option, so my shame will live here forever.

apathetichell

@rfoster7 --- we've all been there.

Yoshiro_Fujimori

@rfoster7

I would use RegEx tool as below.

You may want to try the expression with your data (may need a little more tweaks).

I hope this helps. Good luck.

Input Data

Field1

abc<div>Endorsements: <span class='gen-ai-file-name'>abc</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>def</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>ghi</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>jkl</span></div>xyz

abcdefg

abc<div>Endorsements: <span class='gen-ai-file-name'>mno</span></div>xyz

abcdefg

RegEx Tool configuration

Regular Expression

<div>Endorsements: <span class='gen-ai-file-name'>.*?</span></div>

Output Method

Tokenize

Split to Rows

Output Data

Field1

<div>Endorsements: <span class='gen-ai-file-name'>abc</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>def</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>ghi</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>jkl</span></div>

<div>Endorsements: <span class='gen-ai-file-name'>mno</span></div>

rfoster7.yxmd

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15866

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703