Hi all,
I have a data-set with relationships that is best described with the following example:
When a user installs an app, a unique user-id is generated.
Each time the user uses the app, a unique session-id is generated.
For each session of the user, every screen the user visits is recorded together with some other information.
So the raw data could look something like so:
User | Session | Screen |
Alpha_123 | Ses_1 | f1 |
Alpha_123 | Ses_1 | f2 |
Alpha_123 | Ses_1 | b1 |
Alpha_123 | Ses_1 | b2 |
Alpha_123 | Ses_1 | e1 |
Alpha_123 | Ses_2 | f1 |
Alpha_123 | Ses_2 | e1 |
In the data-set, all information beside what I showed in the table above is initially on the "Screen"-level, information could for instance be time of day, date etc.
Now, I have made a flow where, in PowerBI, I have the opportunity to view information both on a user level (For instance how many purchases did a specific user make), a session level (For instance at which time of the day does most session take place) and at screen level (For instance, showing how a user moved through screens with how long time spend on each screen).
My problem is that the current model was made in many iterations and is therefor quite fragile, so my question is:
What are the best practices, when designing a flow for multiple levels (here three) where detail levels of all three levels must be conserved?
My initial idea was to simply split my input file into 3, and then export 3 tables to PowerBI.
I would like to hear your thoughts, perhabs explained with a couple of examples.
BR Mikkel 🙂
Solved! Go to Solution.
Hi @miqlliot
I would suggest moving as much of the data prep/analysis as possible into Alteryx. As an example, this workflow calculates summary data and appends the results to each input row while retaining the detail of the input data
The results look like this
Every input row has the summary data appended to it. In the screen shot above you can see the User summary data and some of the Session summary. The workflow produces some Screen level summary data as well. Within Power BI, you have the aggregate data available at the row level, so it's a question of displaying the fields you need based on grouping level you want to display. If you're at the session level, filter for a distinct list of session IDs and display the Sessionxxx fields.
This method moves the calculations for known groupings out of Power BI and the resultant data set can be visualized in almost any tool. Since you still have the original data, you can also build any custom transformations in the visualization platform as well.
Dan
Hi @danilang,
Thank you so much for the detailed answer, it looks exactly like the type of help I was looking for.
Let me ask some clarifying questions to make sure I understand exactly what you mean, so I can improve how I think about Alteryx.
What you are essentially proposing is to keep everything "in one big table", instead of splitting streams into user level, session level and screen level.
What I see happening when you do this, is that you will have some columns (for instance purchases) where there will be a lot of "Null" fields.
Similarly, you will have some columns (for instance UserTotalPurchases) where there will be a lot of duplicated values.
Here of course, one would have to be careful only to use "UserTotalPurchases" together with the "User" field, and not the "Session" field for instance; I see you have made this quite obvious with naming conventions, giving everything relating to the "Screen"-level no prefix, everything related to the "Session"-level a "Session" prefix and everything related to the "User"-level a "User" prefix.
Do you think this is better than for instance splitting it into three separate tables? so the user-table only has user level data, session level only has session level data etc. Or is it better to keep it all in one big master-table? What do you see as pros/cons here?
Again, thank you so much for the knowledge sharing 🙂
BR Mikkel
Hi @miqlliot
The amount of detail you decide to put in the master file and the decision to split the file by grouping is influenced by 2 main factors, the size of the data file and also by what you want to do with the data in your visualization platform. If you have billions of input records, then appending the summary info to your input data will increase the amount of RAM needed in your visualization platform and generally slow down the process. So a large input data set would benefit from having only the summary information and foregoing the detail. Once you decide to go down this route, there's no reason to get the summary info in one big file. The summary data for each of the groupings is independent from the others. Indeed, you'd have to add in some kind of artificial key to merge all the data into one file. Your final file would be sparse as well since the defined columns would have little overlap.
If on the other hand, you want to have a dynamic dashboard with the ability to drill down all the way to the detail level, you'll need to either have one file with the summary data attached as in my example, or 4 files, the raw detail and the 3 summary files. Having one file makes the logic in the visualization a little simpler to implement for the reasons outlined before. Having 4 files decreases the overall size of the transferred files, since you don't duplication in the summary files, but increases the complexity and slows down the speed of the visualization program. Indeed, if in the visualization program, you create intermediate tables joining the 4 files using the detail info as a key to link the various summary tables, your memory requirements will be similar to those for the one large file.
It's always a trade off. The final implementation method you choose will depend the size of the data and what you want to do with in the final platform.
Dan
@danilang You are a saint. You are answering my questions spot on! I don't know how to thank you.
Based on what you say, what I think I will do is to follow your example and create "one big table" with all information on all levels and do all my calculations and features here. Then, based on what is wanted in the visualisation program, I can always modify the output format down the line.
For instance, if detailed drilldowns are wanted I can output it to fit that, or if only a top-level view of user behavior is wanted i can output it to fit that.
To me this seems like a best-practice way of going about it, as the dataset is being manipulated only once, and then depending on the output needed only small modifications will need to be made in the end. Would you agree on this?
BR Mikkel 🙂
I do agree with this approach in your case, but evaluate all future situations using the criteria we discussed here. There is no one size fits all solution.
On an unrelated note, if you mark one of my answers as a solution, it will help the Alteryx community search engine bubble this to the top of the results if anyone asks similar questions
Dan