Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extract Information from HTML tags via RegEx

Ahmad_S
7 - Meteor

Hi Everyone,

 

I have HTML tags in single column (each row). I want to extract information present between <p> tags (the HTML contains inline CSS as well so class references are there):

 

Column 1

<p class="abc">Something to extract</p>

<p class="xyz">Something extra </p>

Plain text - want to ignore this

<DOCTYPE!.....><p>Something</p>

 

I tried to use RegEx but I am beginner in RegEx so I am unable to get anywhere.

 

Appreciate if you can help.

 

Regards

6 REPLIES 6
Thableaus
17 - Castor
17 - Castor

Hi @Ahmad_S 

 

You can try something like this

 

REGEX_Replace([Field], ".*<p[^>]*>(.*)</p>", "$1") 

 

Cheers,

wdavis
Alteryx
Alteryx

Hi @Ahmad_S 

 

If you want to achieve this without using Regex, you could use a filter tool to just pull the rows you require - [Field] Contains "p class="

 

Then use a Text to Columns tool with the delimeter of '>' this well then separate out just the text you are looking to parse out.

 

Let me know if that makes sense and works for you!

 

Thanks

Will

Ahmad_S
7 - Meteor

Hi @Thableaus.

 

It does not seems to work:

 

Ahmad_S_0-1574952442575.png

Ahmad_S
7 - Meteor

Hi @wdavis 

 

Unfortunately, it have too many html tags in single row and if I do text to column, I am pretty sure, I would've easily 20+ columns to deal with.

 

Similarly, I want to keep the data where there is no HTML tag as it is. If I use Filter, it will exclude those field.

 

Regards

Ahmad_S
7 - Meteor

@Thableaus 

 

This worked after bit of a tweaking. Thanks a lot. Life saver!

Thableaus
17 - Castor
17 - Castor

@Ahmad_S 

 

Use this in a Formula Tool, not in the REGEX tool.


Cheers,

Labels