Alteryx Designer Desktop Discussions

Ahmad_S · ‎11-28-2019

Hi Everyone,

I have HTML tags in single column (each row). I want to extract information present between tags (the HTML contains inline CSS as well so class references are there):

Column 1

Something to extract

Something extra

Plain text - want to ignore this

<DOCTYPE!.....>Something

I tried to use RegEx but I am beginner in RegEx so I am unable to get anywhere.

Appreciate if you can help.

Regards

Thableaus · ‎11-28-2019

Hi @Ahmad_S

You can try something like this

REGEX_Replace([Field], ".*<p[^>]*>(.*)", "$1")

Cheers,

wdavis · ‎11-28-2019

Hi @Ahmad_S

If you want to achieve this without using Regex, you could use a filter tool to just pull the rows you require - [Field] Contains "p class="

Then use a Text to Columns tool with the delimeter of '>' this well then separate out just the text you are looking to parse out.

Let me know if that makes sense and works for you!

Thanks

Will

Ahmad_S · ‎11-28-2019

Hi @Thableaus.

It does not seems to work:

Ahmad_S · ‎11-28-2019

Hi @wdavis

Unfortunately, it have too many html tags in single row and if I do text to column, I am pretty sure, I would've easily 20+ columns to deal with.

Similarly, I want to keep the data where there is no HTML tag as it is. If I use Filter, it will exclude those field.

Regards

Ahmad_S · ‎11-28-2019

@Thableaus

This worked after bit of a tweaking. Thanks a lot. Life saver!

Thableaus · ‎11-28-2019

@Ahmad_S

Use this in a Formula Tool, not in the REGEX tool.

Cheers,

Alteryx Designer Desktop Discussions

Extract Information from HTML tags via RegEx