Alteryx Designer Ideas

Share your Designer product ideas - we're listening!

Robustly remove invisible characters and other whitespace

Often as I am scraping web sites, some clever developer has put an invisible character (ASCII or Unicode) in the data which causes terrible trouble.

I've identified 89 instances of zero-width or non-zero-width glyphs that are not visible and/or Alteryx does not classify as whitespace. There are probably more, but Unicode is big y'all.

Unfortunately, the Trim() string function only removes 4 of these characters (Tab, Newline, Carriage Feed, and Space).
REGEX_REPLACE with the \s option (which is what the Cleanse macro uses) is a little better but still only removes 20. And it removes all instances, not just leading and trailing.

I've attached a workflow which proves this issue.


@APolly: this is what I mentioned at GKO.

And I did see this post (https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Elegantly-remove-all-ASCII-characters-...), but it's too brute force. Especially as Alteryx is localized and more users need those Unicode characters.

8 Comments
Alteryx Certified Partner
11 - Bolide
11 - Bolide

Thanks for the link, @Ruud 

Alteryx Certified Partner

I've given you a star and linked back as this is a broader solution that'd solve my idea here 🙂 

https://community.alteryx.com/t5/Alteryx-Designer-Ideas/Remove-Zero-Width-Spaces-with-the-Data-Clean...

Alteryx Alumni (Retired)
Status changed to: Under Review

Thanks for submitting this idea. I'll investigate! 

Alteryx Partner

Heartily agree with this. I had a dataset that I had to bring into Excel and compare using the EXACT formula to see something was wrong which non of the trimming tools in Alteryx would correct, causing misleading values when categorizing data with the offending defect.

16 - Nebula
16 - Nebula

This one is super-important - thank you for raising @Hollingsworth  - we cannot limit to ASCII only because we have an international user base; and at the same time we do need to deal with non-visible characters more robustly because we constantly work with platforms like Mainframe or various flavours of Unix/Linux that use different non-visible characters.

 

Super important update to a tool like the Data Cleanse tool and to the Trim formula!

Moderator
Moderator
Status changed to: Not Planned

Thank you for your post! This idea is interesting to us, however we've determined that we're unable to include this idea on our road map for the product at this point due to several outstanding issues and factors. However, should we be able to return to your idea in the future we will update the status back to Under Review.

16 - Nebula
16 - Nebula

@Hollingsworth - we may have to do a shared project instead to build this out in the Python SDK since it's not planned for the product.