Engine Works

awrangler · ‎07-20-2023

Performance tuning is a constant work in progress as we continue to design, develop, and deploy our flows. We have compiled 3 various scenarios of complex flows with possible solutions. The aim is to make our flows more simplified, organized for troubleshooting, and more efficient. The 3 complex flow scenarios will introduce ways to restructure your recipes. Exploring these possible solutions will introduce ways to reduce computational costs and reduce redundancy in developing your data pipeline!

Scenario 1: Lots of Complex Transformation Steps in a Recipe Object Slowing Job Runs

Diagram 1

In our first complex flow scenario conceptually depicted above in Diagram 1, we have a recipe object with lots of different transformations. Below in Figure 1 is an example image of this flow scenario in Designer Cloud Trifacta Classic (DCTC).

Figure 1

And below is an example image, Figure 2, of what the Transformer page may look like in this first complex scenario.

Figure 2

Generally, having 20-30 simple transformation steps is fine. But there is a fair amount of complex transformation steps. Having multiple different complex transformations in one recipe object can slow down job run performance. To understand the different levels of transformation complexities, we have created a Complexity of Transformations figure showing the breakdown of how much resource is allocated to execute the transformations from least (in green) to most (in red) complex and most resources needed as depicted below.

Diagram 2

A possible solution is to break up the different transformation steps into their respective logic pieces, as shown conceptually below in Diagram 3.

Diagram 3

We can break the original recipe object out into 5 smaller recipe objects like in Figure 3.

Figure 3

Breaking up into smaller recipe objects based on the different logic pieces and levels of transformation complexities will make the flow more understandable for any user and will optimize the job run performance.

Scenario 2: Direct Unions from Data Sources Slowing Job Runs

Screen Shot 2023-07-19 at 10.53.51 AM.png

Diagram 4

In our second complex flow scenario conceptually depicted above in Diagram 4, there are multiple data sources directly unioned with other data sources. Below in Figure 4 is an example image of this flow scenario in DCTC.

Figure 4

Data sources without an intermediary recipe before introducing a union will slow down job run performance. Whenever a recipe is created, DCTC will generate an initial cached sample prior to any transformation steps. This initial cached sample can be leveraged every time a job run. Without an intermediary recipe of some sort to leverage the initial cached sample generated upon creating a recipe, DCTC will dynamically collect a sample to perform the transformation steps every job run. Yes, every job runs without an intermediary recipe after the first job run.

A possible solution to reduce that job run time is by adding a blank recipe as an intermediary before a union between data sources, as shown conceptually below in Diagram 5.

Screen Shot 2023-07-19 at 12.54.53 PM.png

Diagram 5

A blank recipe would be a recipe with no transformation steps. Adding a blank recipe would be a quick solution, like in Figure 5.

Figure 5

And sure, a recipe with simple transformations (extracts, renames, basic math operators, or cleansing transformations) works as an intermediary recipe.

Resource(s):

See Enriching Data documentation for descriptions of unions, joins, lookups, and aggregations.
See Union Similar Datasets documentation for more information on using a union to combine two or more similar datasets.

Scenario 3: Redundant Manual Unions from Data Sources Slowing Development

Screen Shot 2023-07-19 at 12.59.32 PM.png

Diagram 6

In our third complex flow scenario conceptually depicted above in Diagram 5, there are lots of unions with an intermediary recipe between data sources of the same schema. Below in Figure 6 is an example image of this flow scenario in DCTC.

Figure 6

Manually unioning lots of data sources with the same table schema can be redundant as you develop your flow.

A possible solution would be to leverage dataset parameters as shown conceptually below in Diagram 7.

Screen Shot 2023-07-19 at 1.03.31 PM.png

Diagram 7

Dataset parameters will filter for specified files to be concatenated as if unioned into one input on your flow, freeing up time from redundant tasks like in Figure 7.

Figure 7

When using dataset parameters, it is ideal to standardize a naming convention for filtering purposes, like in Figure 8.

Figure 8

If you do find yourself working with data sources with the same table schema that lack a standardized naming convention, you can dump all the data sources in a folder and add the folder instead of the individual files. You may find yourself in this scenario when working with archaic files.

Resource(s):

See Overview of Parameterization documentation for more information on limitations, parameter types, ways of overriding parameters, hierarchy of overrides, and a general overview.
See Feature Deep Dive - Parameters interactive lesson for more guided information from an introduction to parameter types to managing parameters to ways of overriding parameters to the hierarchy of overrides to limitations to best practices.

Engine Works

Performance Tuning with Recipes in Designer Cloud, Trifacta Classic

Scenario 1: Lots of Complex Transformation Steps in a Recipe Object Slowing Job Runs

Scenario 2: Direct Unions from Data Sources Slowing Job Runs

Scenario 3: Redundant Manual Unions from Data Sources Slowing Development