Alteryx Designer Desktop Discussions

hellyars · ‎03-08-2019

I am trying to parse paragraphs of text that appear in a larger HTML document.

The target paragraphs are in embedded in the middle of an html document. See example below.
The target paragraphs are always preceded by a record that only contains .
The target paragraphs always end in

I want to extract the target paragraphs as rows. I assume I need a multi-row formula tool. But, I don't know how to write the expression and then parse.

other html.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

other html

Thableaus · ‎03-08-2019

Hi @hellyars

Are the rows coming in this format?

Like in one row,

and the other the whole paragraph?

Could you share a part of your original html document? Where are you getting it from?

Cheers,

hellyars · ‎03-08-2019

OBE

Thableaus · ‎03-08-2019

@hellyars

If it's only this:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

You could use a Replace Function ( to "") and then Filter out empty records, and you'd have your paragraphs into rows. Plain and simple.

Cheers,

hellyars · ‎03-08-2019

@Thableaus

But, it is not only that. There is other html before and after the target paragraphs. That's what I tried to represent in the sample table with the first and last records that are labeled "other html". ..which reflects hundreds of lines after the initial post download parse.

I need to isolate the paragraphs. The problem is the paragraphs don't start with a tag. It's just straight text. I need something that plays off the fact that the target paragraphs are always preceded by a record that only contains . That's how I can isolate the target paragraphs from al the html and random text.

Thableaus · ‎03-08-2019

@hellyars

Would that work?

Create a Flag that the row before is

Filter paragraphs with that Flag.

Cheers,

hellyars · ‎03-08-2019

@Thableaus

Cool. I have to remember this little trick. I paired it with another if statement so that I can capture both the standard and non-standard constructs. Thanks!

Alteryx Designer Desktop Discussions

Find and extract paragraphs of text from HTML based on preceding html