Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extract Information from HTML tags via RegEx

Ahmad_S
7 - Meteor

Hi Everyone,

 

I have HTML tags in single column (each row). I want to extract information present between <p> tags (the HTML contains inline CSS as well so class references are there):

 

Column 1

<p class="abc">Something to extract</p>

<p class="xyz">Something extra </p>

Plain text - want to ignore this

<DOCTYPE!.....><p>Something</p>

 

I tried to use RegEx but I am beginner in RegEx so I am unable to get anywhere.

 

Appreciate if you can help.

 

Regards

6 REPLIES 6
Thableaus
17 - Castor
17 - Castor

Hi @Ahmad_S 

 

You can try something like this

 

REGEX_Replace([Field], ".*<p[^>]*>(.*)</p>", "$1") 

 

Cheers,

wdavis
Alteryx Alumni (Retired)

Hi @Ahmad_S 

 

If you want to achieve this without using Regex, you could use a filter tool to just pull the rows you require - [Field] Contains "p class="

 

Then use a Text to Columns tool with the delimeter of '>' this well then separate out just the text you are looking to parse out.

 

Let me know if that makes sense and works for you!

 

Thanks

Will

Ahmad_S
7 - Meteor

Hi @Thableaus.

 

It does not seems to work:

 

Ahmad_S_0-1574952442575.png

Ahmad_S
7 - Meteor

Hi @wdavis 

 

Unfortunately, it have too many html tags in single row and if I do text to column, I am pretty sure, I would've easily 20+ columns to deal with.

 

Similarly, I want to keep the data where there is no HTML tag as it is. If I use Filter, it will exclude those field.

 

Regards

Ahmad_S
7 - Meteor

@Thableaus 

 

This worked after bit of a tweaking. Thanks a lot. Life saver!

Thableaus
17 - Castor
17 - Castor

@Ahmad_S 

 

Use this in a Formula Tool, not in the REGEX tool.


Cheers,

Labels