Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Isolating string (html)

Ikebonamin
7 - Meteor

Hi, people. Help needed and appreciated! :)

 

I have a column, named HTML. Then, many lines like this:

 

<div class="_2b06"><div class="_2b05"><a href="/luiz.bonamin?fref=nf&amp;rc=p&amp;__tn__=R-R">Fernanda Borges</a></div><div data-commentid="776884669184957_779428782263879" data-sigil="comment-body">Em qualquer raia do Paraná ou só para Unimed Curitiba? Uma fez fui ver em uma farmácia raia, e disseram que era só para Unimed Ctba...</div></div>

 

I need to isolate /luiz.bonamin into one collum, and the text after "comment-body"> Both are highlighted in the example.

Any thoughts?

 

 

5 REPLIES 5
danrh
13 - Pulsar

I'd advise a RegEx tool with ...

 

.*<a href="(.*?)\?.*"comment-body">(.*?)<.*

... as your expression and Parse as your Output Method.  This will create two additional fields with the strings you're after.

 

Hope it helps!

Ikebonamin
7 - Meteor

Thank you so much! :)

It worked.

Ikebonamin
7 - Meteor

Dear Danrh;

 

Altough the solution you provided really worked, I failed to describe de correct scenario.

Here´s another example of the xcel file that needs to be cleaned:

 

1.<div class="_2b06"><div class="_2b05"><a href="/ferborges.enf?fref=nf&amp;rc=p&amp;__tn__=R-R">Fernanda Borges</a></div><div data-commentid="776884669184957_779428782263879" data-sigil="comment-body">Esse desconto na Droga Raia é em todas as cidades do Paraná? ?</div></div>

 

In this case, it´s needed to capture the bold text, into to two separated columns. That worked fine using:

.*<a href="(.*?)\?.*"comment-body">(.*?)<.*

BUT, in the same xcel file, we have a second type of syntaxe:

 

2. <div class="_2b06"><div class="_2b05"><a href="/profile.php?id=100008871755952&amp;fref=nf&amp;rc=p&amp;__tn__=R-R">Celina Baraviera</a></div><div data-commentid="776884669184957_781634255376665" data-sigil="comment-body">Esse desconto na Droga Raia é em todas as cidades do Paraná? ?</div></div>

 

These two kinds of lines alternate hundreds of times...

 

-> as you can see, I cant clean both type of lines at the same time. I tryed, bu as a newbie it seems impossible!!! . If you could help, once more, it would be great. And I promisse this is the last time....

 

Thanks!

 

 

danrh
13 - Pulsar

This is getting pretty specific, so this might not work for all your cases, but try:

 

.*<a href="(.*?)(?:\?|&amp;)fref.*"comment-body">(.*?)<.*

The second part is the same, but the first part is now anchored to the "fref" tag in the code.  If that tag changes locations this won't work, but for the 3 examples you provided it gets the job done.  Give it a go and see if it works!  If not, we might want to break these up based on which way they need to be parsed.  Is there a distinction between the two formats?  For instance, is "id=..." never in the first format and always in the second?

Ikebonamin
7 - Meteor

It did an excelent Job.

I attached the excel file, if you wish to check it for yourself.

The last thing I understand necessary to get done, is to delete the Nulls fields, that appeared after trying the command.

But, I feel I have annoyed you too much already.

Thank you. And, for sure. As soon as I get some practice, I´ll provide help to other, just as you did.

Cheers!!!!

 

 

Labels