Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Data clean td HTML, REPLACECHAR

LindaD
7 - Meteor

Can anyone help on how to get rid of these <td>. I am not familiar with Regex how would I use that. But will take anything that will help. I AM NEW to all of this..UPDATE I added the workflow

 

Alteryx1.JPG

16 REPLIES 16
LindaD
7 - Meteor

Sure I'm sorry, does this help?

echuong1
Alteryx Alumni (Retired)

You can actually make this even simpler using the text to columns. Use both < and > as delimiters.

 

echuong1_0-1606945745032.png

 

joshbennett
11 - Bolide
11 - Bolide

@echuong1 has already given some great recommendations, so while I may not be able to add much here, I figured I could share a quick method I use to get rid of such unwanted characters with a single formula expression. It may seem more complex initially than it actually is, so I will try to add some contextual explanation as well - but it is a very useful method that you can rinse and repeat quickly on other datasets as applicable once you've used it once or twice.

 

Using the Multi-Field Formula, you can use the REGEX_REPLACE and REPLACE functions (see here for syntax help) to accomplish what you need quickly and easily. After selecting each of your target fields within the tool's configuration, you can chain these expressions together to accomplish what you want with a single formula.

 

  1. First, use REGEX_REPLACE with '[[:punct:]]' specified as the pattern parameter, as shown below. '[[:punct:]]' is what is referred to in RegEx as a "POSIX Bracket Expression", which you can read more about here if desired (though not required for your purpose). You can essentially think of this as a variable or placeholder that looks for any punctuation characters, all of which you are trying to remove in your use case. Thus, you can start with the following expression:
regex_replace(tostring([_CurrentField_], '[[:punct:]]', '')

 

In the expression above, 

  • The first parameter [_CurrentField_] is the variable used by the Multi-Field Formula tool reference each of your selected fields against which it applies the expression. We have to encapsulate it in a simple TOSTRING function to let the Multi-Tool Formula know to treat the data as a text string.
  • The second parameter '[[:punct:]]' tells the tool to look for any punctuation characters in the selected fields.
  • The third parameter '' just signifies an empty quoted string, indicating you want to replace the punctuation characters with nothing (effectively removing the punctuation)

 

With your particular dataset, this formula by itself would still leave your "td" characters, so we can wrap the above formula in a REPLACE function, with the initial formula specified as the first parameter, like so:

 

replace(regex_replace(tostring([_CurrentField_]), '[[:punct:]]', ''),'td', '')

 

Doing so removes all of your unwanted characters, leaving only numeric characters which you can convert to a numeric data type with a subsequent Select tool or simply by using the "Change Output Type to" setting in the Multi-Field Formula (which would just require wrapping the above formula in a TONUMBER function, similar to how the first formula was wrapped in the second in the above steps).

 

joshbennett_0-1606961949178.png

 

 

Let me know if that helps or if you have any questions.

 

LindaD
7 - Meteor

OMG is that all I had to do. Thanks so much I hope to become a great Analyst I enjoy what I am learning.

LindaD
7 - Meteor

Thanks for the detail explanation. I will read about this and from my understanding this is advance but I am up for the challenge. I have a long way to go but I am moving forward. Thanks again!

Thomas
5 - Atom

Thanks @ joshbennett for your detailed answer! I'm working on the same problem, however after using the regex replace function I still end up with unwanted characters in two cells (see picture) starting with sup..

 

Could you please advise how to solve this?  

 

Thanks Thomas

joshbennett
11 - Bolide
11 - Bolide

Hi @Thomas 

 

Chaining another REGEX_REPLACE should do it, this time removing any non-digit character:

REGEX_REPLACE(tostring([_CurrentField_], '[^\d]', '')

 

So, chaining together with the previous two would be something like:

REPLACE(REGEX_REPLACE(REGEX_REPLACE(tostring([_CurrentField_]), '[[:punct:]]', ''), '[^\d]', ''),'td', '')

 

Example attached - hope that helps.

 

Josh

Labels