Hi everyone!
I'm wrangling a dataset that most columns are divided in multiple lines. In the image I posted, for example, the number 13.7500 and 13.9024 should be in the same column, but in the first and second row respectively. And the values between these numbers should be on separated columns. So, each blue box in the image represents a single row.
Anyone knows how to wrangle this?
Thanks in advance!
Solved! Go to Solution.
Hi @Daniel Sacramento? , thanks for posting this question.
If you could elaborate a little on the issue.
In general - it looks like selecting the multiple spaces between 2 values, then hitting the suggestion option of "split upon" should solve this, just wanting to make sure I understand
Thanks,
Amit.
Hi Amit, thanks for your answer!
Your first point is correct. All the records are in "column1". After the wrangling, the blue box should be "flattened", so each value corresponds to a column.
I will post another screenshot, I hope this helps!
Thanks for the clarification @Daniel Sacramento? .
So now we see that, not only all the values are in 1 column - but what you refer to as "1 record" is spread across multiple records.
Is this how Trifacta shows the data to you, the moment you open the file?
The reason I'm asking that is because defaultly - Trifacta create a new line each time it's bumped into "newline" (\n).
So, you'd like it to drop to new line only... after "a few newLines".
In order to understand better how your file is structured, you can -
Thanks,
Amit.
Thanks again @Amit Miller? .
When I import a new file into Trifacta, by default I uncheck the "Detect file structure", or something along those lines. I will send you the file here. When you open it, you will see that the first rows are informations about the measurement and the data itself starts at row 54.
Best regards.
Thanks for the file @Daniel Sacramento?
You don't necessarily need to "unstructure" the file. Simply create a rownum() column (out of the $sourcerownum metadata parameter) and then remove all records which are of rownum=53 and lower (where the numbers start).
Now, this has been very interesting - there are 4 possible solutions (at least). I will present you with all of them. Guessing the first one is the most straight-forward.
The main goal here is to make sure all relevant numbers are in the same record.
In order to do that, we observe how many records actually belong to the same original row.
As per your explanation, each "original row" is composed out of 5 records, presented in Trifacta.
In order to emphasis that each 5 consequent records are "1 row", we can create this column:
ceiling(rownumber() / 5)
ordered by: $sourcerownum
so the first 5 records will have value of 1, next 5 records will have value of 2, etc.
So now, each 5 records have same value in rownum column.
Now we can use the pivot() transformation to change the structure of the table.
now for each record, we have all the numbers we need! (in an array)
option A
Now, do
Option B
instead of mergingArrayElements - create a new array, with each element representing a number.
so we'll do "extract matches to array" from the array we've created, while the element pattern to extract will be `{number}`. Using Trifacta Pattern to our aid!
Now that we have an array of 25 numbers - we can use "unnest", and you can manually write all the elements you want to extract (so elements 0-24).
Options C+D are a bit less straightforward (revolving around range() function, creating objects, and unpivoting the table))
If you're curious, I can elaborate on those aswell, either through here or through a short zoom session.
Hope this makes sense, feel free to ask questions!
Thanks,
Amit.
Sorry for the delayed response, I was working on another project. But I can't thank you enough @Amit Miller? , very helpful and clever solution! I used the option A and worked like a charm!
Great, glad to hear that @Daniel Sacramento? !
Happy wrangling.