Be sure to review our Idea Submission Guidelines for more information!
Submission GuidelinesIn addition to the existing functionality, it would be good if the below functionality can also be provided.
1) Pattern Analysis
This will help profile the data in a better way, help confirm data to a standard/particular pattern, help identify outliers and take necessary corrective action.
Sample would be - for emails translating 'abc@gmail.com' to 'nnn@nnnn.nnn', so the outliers might be something were '@' or '.' are not present.
Other example might be phone numbers, 12345-678910 getting translated to 99999-999999, 123-456-78910 getting translated to 999-999-99999, (123)-(456):78910 getting translated to (999)-(999):99999 etc.
It would also help to have the Pattern Frequency Distribution alongside.
So from the above example we can see that there are 3 different patterns in which phone numbers exist and hence it might call for relevant standadization rules.
2) More granular control of profiling
It would be good, that, in the tool, if the profiling options (like Unique, Histogram, Percentile25 etc) can be selected differently across fields.
A sub-idea here might also be to check data against external third party data providers for e.g. USPS Zip validation etc, but it would be meaningful only for selected address fields, hence if there is a granular control to select type of profiling across individual fields it will make sense.
Note - When implementing the granular control, would also need to figure out how to put the final report in a more user friendly format as it might not conform to a standard table like definition.
3) Uniqueness
With on-going importance of identifying duplicates for the purpose of analytic results to be valid, some more uniqueness profiling can be added.
For example - Soundex, which is based on how similar/different two things sound.
Distance, which is based on how much traversal is needed to change one value to another, etc.
So along side of having Unique counts, we can also have counts if the uniqueness was to factor in Soundex, Distance and other related algorithms.
For example if the First Name field is having the following data -
Jerry
Jery
Nick
Greg
Gregg
The number of Unique records would be 5, where as the number of soundex unique might be only 3 and would open more data exploration opportunities to see if indeed - Jerry/Jery, Greg/Gregg are really the same person/customer etc.
4) Custom Rule Conformance
I think it would also be good if some functionality similar to multi-row formula can be provided, where we can check conformance to some custom business rules.
For e.g. it might be more helpful to check how many Age Units (Days/Months/Year) are blank/null where in related Age Number(1,10,50) etc are populated, rather than having vanila count of null and not null for individual (but related) columns.
Thanks,
Rohit
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.