Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Need help stripping out(parsing) text from HTML

csh8428
10 - Fireball

I've tried a myriad of other similar posts, but I'm not familar at all with HTML and an certainly no expert using RegEx, so I am at a loss for trying to figure this out.

I have 2 fields prjct.Objective and prjct.Background that have HTML(see attached sample) and I only need the actual text that would be displayed in the browser.

I entered this as "I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS" to those parts of the code in each field.

The text can appear numerous times in each field. I think the overall HTML would be the same all the way through the data-set, but it's hard to tell given the size of it.

 

prjct.Objective

"<HTML><head><meta http-equiv=""Content-Type"" content=""text/html; charset=utf-8"" /><title>Untitled</title><style type=""text/css""> 
p { margin-top: 0px;margin-bottom: 0px;line-height: 1.15; } 
body { font-family: 'Segoe UI';font-style: Normal;font-weight: normal;font-size: 13.3333333333333px; } 
.Normal { telerik-style-type: paragraph;telerik-style-name: Normal;border-collapse: collapse; } 
.TableNormal { telerik-style-type: table;telerik-style-name: TableNormal;border-collapse: collapse; } 
.NormalWeb { telerik-style-type: paragraph;telerik-style-name: NormalWeb;margin-top: 6.66px;margin-bottom: 6.66px;border-collapse: collapse; } 
.p_A43897F6 { telerik-style-type: local;text-align: left; } 
.s_1858219 { telerik-style-type: local;font-family: 'Arial';font-style: Normal;font-weight: normal;font-size: 16px;color: #222222;background-color: #FFFFFF; } 
.s_4D7243C3 { telerik-style-type: local;font-family: 'Segoe UI';font-size: 13.3333333333333px;color: #000000; } </style></head><body><p class=""NormalWeb p_A43897F6""><span class=""s_1858219"">I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS</span></p><p class=""Normal ""><span class=""s_4D7243C3"">&nbsp;</span></p></body></HTML>"

 

prjct.Background

"<HTML><head><meta http-equiv=""Content-Type"" content=""text/html; charset=utf-8"" /><title>Untitled</title><style type=""text/css""> 
p { margin-top: 0px;margin-bottom: 0px;line-height: 1.15; } 
body { font-family: 'Segoe UI';font-style: Normal;font-weight: normal;font-size: 14.6666666666667px; } 
.Normal { telerik-style-type: paragraph;telerik-style-name: Normal;border-collapse: collapse; } 
.TableNormal { telerik-style-type: table;telerik-style-name: TableNormal;border-collapse: collapse; } 
.p_3207D3C4 { telerik-style-type: local;font-family: 'Verdana';font-style: Normal;font-weight: normal;font-size: 16px;color: #000000; } 
.li_8F34398 { telerik-style-type: local;margin-left: 24px;text-indent: 0px;font-family: 'Symbol';font-style: Normal;font-weight: normal;font-size: 14.6666666666667px;color: #000000; } 
.s_2CC9B3CB { telerik-style-type: local;font-family: 'Segoe UI';font-size: 14.6666666666667px;color: #000000; } </style></head><body><ul style=""list-style-type:disc""><li value=""1"" class=""li_8F34398""><p class=""Normal p_3207D3C4"">1. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS</p></li></ul><p class=""Normal "">2. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS.</p><p class=""Normal "">3. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS.</p><p class=""Normal ""4. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS</p><p class=""Normal "">5. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS</p><p class=""Normal ""><span class=""s_2CC9B3CB""></span></p></body></HTML>"

 

Based on the snippets above the output should look like this

prjct.IDprjct.Titleprjct.Codeprjct.ActStartDateprjct.ActEndDateprjct.Objectiveprjct.Background
1779Alpha202001828/31/202012/3/2020I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS1. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS. 2. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS. 3. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS. 4. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS. 5. I NEED THIS PART AND IT CAN BE ANY TEXT OR CHARACTERS. 

Any help is greatly appreciated!

 

-Craig

5 REPLIES 5
patrick_mcauliffe
14 - Magnetar
14 - Magnetar

For something like this, I usually look for the pattern and then adjust or make macros around finding the specific tags.

First, I'll split all tags into new lines.  Then figure out what HTML text came before and after the one I want.  Usually if you're going to crawl a whole site it stays somewhat consistent.

You may also get things like unicode markup that requires some conversion to get rid of (assuming you don't want it).

Let me know if this attached flow makes sense.

csh8428
10 - Fireball

AARRGGH.. Forgot, my company hasn't deployed the most recent version of Designer desktop. I don't supposed you could save/export it as a V 2020.2.3?

patrick_mcauliffe
14 - Magnetar
14 - Magnetar

Sure.  But technically there's no need.

To downgrade a workflow version (works as long as all tools used in the workflow are available in both the source and target version), unzip that package with WinZip, 7Zip, etc.

Open the yxmd in Notepad and change the version:

patrick_mcauliffe_0-1612386139989.png

 

Then, click save and it should open just fine.

 

patrick_mcauliffe_1-1612386170147.png

 

csh8428
10 - Fireball

Thanks for the Version trick! That was very helpful. The parsing doesn't quite work though.

 

csh8428
10 - Fireball

A Co-worker was able to help me out. 

This worked

REGEX_Replace(REGEX_Replace(REGEX_Replace(REGEX_Replace([FIELD],'&[^&]*;',''),'p[^*]*; \}',''),'Untitled',''),'<[^>]*>','')
Labels