Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Find and extract paragraphs of text from HTML based on preceding html

hellyars
13 - Pulsar

I am trying to parse paragraphs of text that appear in a larger HTML document.

 

  • The target paragraphs are in embedded in the middle of an html document.  See example below.
  • The target paragraphs are always preceded by a record that only contains <br />.
  • The target paragraphs always end in <br />

I want to extract the target paragraphs as rows.  I assume I need a multi-row formula tool.  But, I don't know how to write the expression and then parse.

 

other html.
<br />
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <br />
<br />
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. <br />
<br />
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <br />
other html
 

 

 

6 REPLIES 6
Thableaus
17 - Castor
17 - Castor

Hi @hellyars 

 

Are the rows coming in this format?

Like <br /> in one row,

and the other the whole paragraph?

Could you share a part of your original html document? Where are you getting it from?

Cheers,

hellyars
13 - Pulsar

 

OBE

 

Thableaus
17 - Castor
17 - Castor

@hellyars 

 

If it's only this:

<br />
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. <br />
<br />
Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. <br />
<br />
Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. <br />

 

You could use a Replace Function (<br /> to "") and then Filter out empty records, and you'd have your paragraphs into rows. Plain and simple.

 

Snip.PNG

Cheers,

hellyars
13 - Pulsar

@Thableaus 

 

But, it is not only that.   There is other html before and after the target paragraphs.  That's what I tried to represent in the sample table with the first and last records that are labeled "other html". ..which reflects hundreds of lines after the initial post download parse.

 

I need to isolate the paragraphs.  The problem is the paragraphs don't start with a tag.  It's just straight text.  I need something that plays off the fact that the target paragraphs are always preceded by a record that only contains <br />.  That's how I can isolate the target paragraphs from al the html and random text.  

Thableaus
17 - Castor
17 - Castor

@hellyars 

 

Would that work?

 

Flag.PNG

 

 

Create a Flag that the row before is <br />

Filter paragraphs with that Flag.

 

Cheers,

hellyars
13 - Pulsar

@Thableaus 

 

Cool. I have to remember this little trick.  I paired it with another if statement so that I can capture both the standard and non-standard constructs.  Thanks!

Labels