Hi, We have a data feed which contains a free text field which members of the public can type comments into. The feed contains thousands of records. We need to remove any personally identifiable information (PII) from this field before it is loaded into a database table. This includes email addresses, phone numbers and UK postcodes. We have three regex expressions to identify these and replace the PII with a placeholder. The email and phone regexes are working fine. However, the UK postcode regex throws some false positive matches which we don't want to replace. We also have a list of valid UK postcodes held in a database table. UK postcodes can have a few different formats and our regex is pretty good at identifying all these formats but inevitably, some false positives are cropping up. We would like to * Iterate over any regex matches in the field and for every match:- * Lookup the regex match against the UK postcode table and if it is a valid UK postcode :- * Replace the regex match with the placeholder "**postcode removed**" in the input field. For example, the input field could look like: "V34ry happy that the news is now back on C4 9pm. Viewer in SW1P 2TX" We want to change it to the following before loading into the destination table: "V34ry happy that the news is now back on C4 9pm. Viewer in **postcode removed**" N.B. "SW1P 2TX" is a valid UK postcode. "V34ry" (mistyped "very") and "C4 9pm" are false positives from our UK postcode regex but are not valid UK postcodes. I'm new to Alteryx so am not sure how to achieve what we want here. Do I need to write a custom component to do this (which I am happy to do) or can this be achieved out of the box? Alternatively we could probably do this inside the DB with a stored procedure or similar but we'd prefer to perform the redaction before hitting the DB. Thanks in advance for your assistance! Tim

Redacting UK Postcodes within a free text field

Hi,

We have a data feed which contains a free text field which members of the public can type comments into. The feed contains thousands of records. We need to remove any personally identifiable information (PII) from this field before it is loaded into a database table. This includes email addresses, phone numbers and UK postcodes. We have three regex expressions to identify these and replace the PII with a placeholder. The email and phone regexes are working fine.

However, the UK postcode regex throws some false positive matches which we don't want to replace. We also have a list of valid UK postcodes held in a database table. UK postcodes can have a few different formats and our regex is pretty good at identifying all these formats but inevitably, some false positives are cropping up.

We would like to

Iterate over any regex matches in the field and for every match:-
Lookup the regex match against the UK postcode table and if it is a valid UK postcode :-
Replace the regex match with the placeholder "**postcode removed**" in the input field.

For example, the input field could look like:

"V34ry happy that the news is now back on C4 9pm. Viewer in SW1P 2TX"

We want to change it to the following before loading into the destination table:

"V34ry happy that the news is now back on C4 9pm. Viewer in **postcode removed**"

N.B. "SW1P 2TX" is a valid UK postcode. "V34ry" (mistyped "very") and "C4 9pm" are false positives from our UK postcode regex but are not valid UK postcodes.

I'm new to Alteryx so am not sure how to achieve what we want here. Do I need to write a custom component to do this (which I am happy to do) or can this be achieved out of the box? Alternatively we could probably do this inside the DB with a stored procedure or similar but we'd prefer to perform the redaction before hitting the DB.

Thanks in advance for your assistance!

Tim

Workflow

Accepted answers

Philip

Instead of using a RegEx, what about using the Find Replace tool under the Join tab directly and feed the list of actual postal codes into the right side? This should minimize false positives. The only problem would occur if the zip code is not in its official format.

UK Zip Code Replacement.png

UK Zip Code Replacement.yxmd

All comments

Philip

UK Zip Code Replacement.png

UK Zip Code Replacement.yxmd

MarqueeCrew

@TimMott,

Regex_Replace([Text],"\u\d\s\d\u\u|\u\d\d\s\d\u\u|\u\u\d\s\d\u\u|\u\u\d\d\s\d\u\u|\u\d\u\d\u\u|\u\u\d\u\s\d\u\u",'**Postcode Removed**',0)

According to wikipedia, these are your postcode formats:

Format	Coverage	Example
AA9A 9AA	WC postcode area; EC1–EC4, NW1W, SE1P, SW1	EC1A 1BB
A9A 9AA	E1W, N1C, N1P	W1A 0AX
A9 9AA	B, E, G, L, M, N, S, W	M1 1AE
A99 9AA	B, E, G, L, M, N, S, W	B33 8TH
AA9 9AA	All other postcodes	CR2 6XH
AA99 9AA	All other postcodes	DN55 1PT

I converted each postcode into a regex option. The formula checks for any of the formats and REQUIRES case sensitivity. Using that formula, I get:

Output
V34ry happy that the news is now back on C4 9pm. Viewer in **Postcode Removed**

No database required.

Cheers,

Mark

TimMott

Thanks Philip,

I have tested your approach by creating a macro which encapsulates the redaction functionality using Find & Replace. I was initially concerned that performance would not be great (the postcode table has 1m+ entries) but in fact, performance seems to be OK - the postcode table is read once and presumably cached locally for the duration.

Thanks again for your input.

Tim

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15866

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703