I need a way to find ALL matches in ONE string field that MAY contain personal data.
Ive got to redact personal data from a field that may contain names, postcodes, phone nums etc
The workflow checks for each of these using regex and a file of first names.
If theres a potential match, then a field is flagged with the content matched eg "Susan" and the message "check this" for a person to then check that the data is in fact personal and to be redacted.
This is because there are so many first names that are also things like "Autumn" or "St" which are not names but references to streets or seasons etc and these shouldnt be redacted.
Anyway, the match for first names uses find/replace tools that append the match as a column
But it only finds and appends on the FIRST match. I need a way to find ALL matches in a field..
for example "provide support to Susan and Mary" should find Susan and Mary, but at moment only finds the very first match, Susan.
Its got to be able to find a match for thousands of potential first names
Any ideas?
Solved! Go to Solution.
Rather than simply using Regex to flag up potential sensitive data you could replace it with 'XXX' for example:
REGEX_Replace([String], '[A-Z][a-z]+', 'XXX', 0)
This would convert "provide support to Susan and Mary" to "provide support to XXX and XXX"
If this solves your issue please mark the answer as correct, if not let me know! I've attached my workbook for you to download if needed.
Regards,
Jonathan
Hi @markstephens here's what I came up with - use the find and replace tool to replace the relevant part of the string with something distinctive, which you can then tokenise to rows. Then cross tab it to count how many flags appear in each string.
Hope that helps, Ollie
hi, thanks for that, its raising some good approaches i hadnt thought of.
problem is ive got a column for every single match now! so if one record out of the 15,000 matches "Carol" than the column is there in every record.
i should have said, the redaction is so the personal data can be identified and then redacted before publishing online in the original format but with personal data redacted.
thanks for that, but i should have been clearer, the string could contain ANYTHING, theres no fixed format, its free-form text designed to describe briefly what an invoice is for
Ah @markstephens, how about this then, which gives you a row for each match rather than a column. This means that your original data could be duplicated, but you won't have the extra columns with information in them.
edit: attached the correct file this time
THATS THE ONE!
ive added a summarise tool at the end to group by record ID, and concatenate the matches, so theres one record with field with all the matches in, this is perfect, thanks a lot!
@markstephens Nice! Glad I could help 🙂