Working on Healthcare Import data annually, and most of the columns are comparably in good state such as date, total amount, price, origin etc... Beside those values, the main column consisting the important information such as dosage, package No., type, trademark, manufacturer, usage sector etc... are all written in 1 cell of a column for relating rows and the great problem is each row is different than others around 300-500 paths respectively. Example:
Column A
Row 1 - Dosage, Package No, Type, Trademark, Manufacturer, Usage
Row 2 - Trademark, Size, Type, Dosage, Usage
Row 3 - Package No, Usage, Dosage, Type
Row 4 - Manufacturer, Usage, Size, Dosage
Alternative version:
Row 1 - A, B, C, D, E
Row 2 - B, E, D, C, J
Row 3 - L, M, P, A, .....
Most of the row details are delimited by colon ",". And when spliting by delimitor the details are splited but you can see that the column which i want name Dosage has trademark, package No, Manufacturer details in it.
Is there a solution to make them in correct column positions?
Solved! Go to Solution.
Hi Tsolmon, Do you have a way of identifying what each row value is, whether it's a pattern, lookup table, etc.? For example, if each component of a row was a unique cell value, do you have a way of identifying that a dosage is a dosage, a package number is a package number and not a trademark, a manufacturer is a manufacturer and not a type, etc.?
If you could provide an example row, that would help me as well!
Hello, Mr. David
Thanks for reply,
In attached please find the sample data for reference.
In sample file, there are English and Local versions plus the expected result when using Trifacta.
Please kindly note that data entrying source is man made, not a scanned bar code. Because of it, there are many peoples making data entry to the main system by each of their perspectives and output data is similar to this or more complex.
There is no identifier for noticing patterns and dosage is not a type or package etc.. But when seeing it, it's identifiable.
My process after split and extract process:
Am i need to export file to Excel etc to need a manual wrangle and again to import to Trifacta and so on?
Any reply would be appreciated.
Hi Tsolmon, thanks for sending this!
I'll use your product column as an example, and then hopefully it will give you an idea of how you might apply this to the rest of your data
Here are the steps I would recommend (there are quite possibly many solutions out there, this is the one that comes to mind for me).
Step 1: Create a lookup table mapping each possible value to a number representing the order you would like the resulting columns to be in.
Step 2: In Trifacta, Extract all of the values in the product column between the commas as a list.
This step will look like this:
Transformation: Extractlist
Col: Product
On: `{any}+`
Delimiter: ','
As: Product_List
This will create a list of all of the values present in each cell.
Step 3: Flatten the list
This step will take each value in the Product_List array and create a new unique row for it, duplicating the values from the other columns
Transformation: Flatten
Col: Product_List
Step 4: Now, you want to do a lookup on the now Flattened Product_List column. Before doing the lookup I trimmed leading and trailing whitespace, then made both the extractlist_product column and the product column in the lookup table uppercase.
This will return a number for each value in the product_list column listed under the Mapping column
Step 5: Pivot the Mapping column, use the Any() function to return the Product values, group by all of the other columns in your dataset to return the correct number of rows.
Transformation: Pivot
Column: Mapping
Functions: Any(Product_List)
Group by: Product, columnA, columnB, etc.
which will give the following result:
This there may need to be some extra work done, and the lookup table may need to be more complex (especially for the indication column), but hopefully this will serve as a guide for how you could possibly accomplish this.
Let me know if you have any additional questions!
David
Hello, David
Thanks for the very detailed guide.
Followed the all steps and managed.
It gives me more options and insights for manipulating in wrangle process. I would like ask one more question about the lookup table manually created, in Pro or Enterprise versions is manually created lookup table necessary or there is other solutions to manipulate it in the wrangling process?
Thanks for your kind support on guiding,
Hi Tsolmon,
There aren't any differences in the functions or transformations available between the free, pro and enterprise versions. For more info on how they differ you can look at Triafcta's products page on the website https://www.trifacta.com/products/.
Another way to accomplish the above without having to create a lookup table outside of Trifacta would be the following:
Transformation: Derive
Formula: CASE([IN(extractlist_Product, ["Medicine"]), 1, IN(extractlist_Product, ["injection", "solution for injection", "drops"]), 2, ...., 5)
As: Mapping
What the above says is if the current row in extractlist_Product is in the array containing "Medicine", output 1, if it's in the array containing "injection", "solution for injection", or "drops", output 2 and so on.
It might be easier for you to do it this way, it is a matter of personal preference.
But to your original question, this would be available in all versions.
Best,
David
Hello, David
Thanks for the alternate option.
Please check the following issues:
On above figure i couldn't find the function. I have tried to change the types of column to String, Date/Time, # etc... to find the STANDARDIZE function but no success.
But on the youtube video there is STANDARDIZE function below the LOOKUP function
2. The PATTERNS option is not available on Free Wrangler
It's also confusing, and that is the reason i thought that Free, PRO, Enterprise versions are different in functions. And some hypothesis that this pattern is locked and is it somehow influencing the notifying patterns in the process?
Hi Tsolmon, the video you are referencing is from a much older version of the product. The standardize feature was deprecated after that version. We have tried to hide all the videos that make reference to that feature but apparently we have missed some. Apologies for that being confusing!
As for Patterns, yes, I forgot that feature did not make it into the free version this past release. So yes, there is actually a slight difference in the wrangling capabilities of Free vs Pro vs. Enterprise for this current release. Again, apologies for the confusion!
Hello, David
Thanks for the clarification.
Now it's clear.
Thanks for your support on guided steps.