Hello,
I am streaming in a large HTML as a string value into the workflow.
The HTML string has an unknown number of a specific element embedded in it. Every element I want to find will look like this:
<div>Endorsements: <span class='gen-ai-file-name'>XXXXXX</span></div>
the XXXXXX will not always be the same length, might be a few characters, might be a long sentence, might be another embedded div or span.
what I really want to get out of this is the XXXXXX value, but I'd settle for getting the entire <div></div> substring
This could appear 1 time in the HTML, it might appear 100 times in the HTML. It might not show up in the HTML at all.
Ideally, what I'd like to return out of the parse is 1 row per occurrence with either the XXXXXX value or the entire "<div>Endorsements: <span class='gen-ai-file-name'>XXXXXX</span></div>" value. So if it shows up 1 time there will be 1 row, if it shows up 100 times, there will be 100 rows, if its not in there there won't be any rows.
I'm sure I can use XMLparse to do this, but I'm not very skilled in it. And this particular <div> element may be a parent element, a child element or seventeen layers deep buried in stacked divs and spans and whatnot.
It's proprietary so I can't post a sample, but hopefully I've been clear enough someone who does understand parsing and text mining can help.
Thanks in advance.
Solved! Go to Solution.
Never mind. figured it using a simple regex tokenize
doesn't seem to be a delete option, so my shame will live here forever.
@rfoster7 --- we've all been there.
I would use RegEx tool as below.
You may want to try the expression with your data (may need a little more tweaks).
I hope this helps. Good luck.
Input Data
Field1 |
abc<div>Endorsements: <span class='gen-ai-file-name'>abc</span></div>xyz abcdefg abc<div>Endorsements: <span class='gen-ai-file-name'>def</span></div>xyz abcdefg abc<div>Endorsements: <span class='gen-ai-file-name'>ghi</span></div>xyz abcdefg abc<div>Endorsements: <span class='gen-ai-file-name'>jkl</span></div>xyz abcdefg abc<div>Endorsements: <span class='gen-ai-file-name'>mno</span></div>xyz abcdefg |
RegEx Tool configuration
Regular Expression | <div>Endorsements: <span class='gen-ai-file-name'>.*?</span></div> |
Output Method | Tokenize Split to Rows |
Output Data
Field1 |
<div>Endorsements: <span class='gen-ai-file-name'>abc</span></div> |
<div>Endorsements: <span class='gen-ai-file-name'>def</span></div> |
<div>Endorsements: <span class='gen-ai-file-name'>ghi</span></div> |
<div>Endorsements: <span class='gen-ai-file-name'>jkl</span></div> |
<div>Endorsements: <span class='gen-ai-file-name'>mno</span></div> |