Engine Works

awrangler · ‎07-27-2023

Continuous performance tuning is an ongoing endeavor as we persistently work on designing, developing, and deploying our flows. We have assembled 3 different complex flow scenarios along with potential solutions. The objective is to streamline our flows, enhance troubleshooting capabilities, and improve overall efficiency. The 3 complex flow scenarios will demonstrate how you can restructure your flows by leveraging the Plans feature to orchestrate your job runs. By exploring these potential solutions, you can discover methods to minimize computational expenses and eliminate redundancy when developing your data pipeline!

Scenario 3: Direct Unions from Data Sources Slowing Job Runs

Diagram 1

In our fourth complex flow scenario conceptually depicted above in Diagram 1, there are a couple of different logic pieces present in this flow. Below in Figure 1 is an example image of this flow scenario in DCTC.

Figure 1

These different logic pieces can be broken into smaller components or flows for a smoother and more orchestrated execution. Additionally, breaking up the different logic pieces into one monolithic complex flow will make the logic more easily understandable and ease future troubleshooting.

A possible solution to simplify this complex flow with different logic pieces is to be broken into 3 smaller flows, as shown conceptually below in Diagram 2.

Diagram 2

Flow 1 can involve every logic up to the join recipe, like in Figure 2.

Figure 2

Flow 2 can involve every logic after the join recipe up to the common transformation recipe by creating a Reference Dataset like in Figure 3.

Figure 3

And Flow 3 can involve every logic after the common transformation recipe up to publishing by using an intermediate file. The intermediate file would be the output metadata of Flow 2. It would be helpful to leverage parameters in Flow 3 for specifying which metadata to pull in dynamically each run, as shown below in Figure 4.

Figure 4

With those 3 flows containing the different logic pieces, we can leverage the Plans feature to orchestrate the execution conceptually reflected in Diagram 3 below.

Diagram 3

For a more concrete DCTC plan example, see Figure 5 provided here.

Figure 5

Note: You can also use a reference dataset for Flow 1 if you plan to use the first portion of logic separated into Flow 1 for other use cases. As for Flow 2, you can also use an intermediary file instead of rerunning the Flow 2 logic if you do not plan on dynamically replacing the data sources. Using an intermediary file can help reduce computational costs. An intermediary file would be the published output of Flow 2 in this scenario.

Resource(s):

See Build Sequence of Datasets documentation for more information on how to chain recipes in the same flow for creating reference objects and imported datasets from outputs.
See View of Reference Datasets documentation for more information on creating and adding reference datasets to another flow.
See References Page documentation for more information.
See Plans documentation for a general overview.
See Plans Page documentation for more information on the Plans page.
See Create a Plan documentation for more information on creating a plan.

Scenario 2: Redundancy in Flows Sharing Same Initial Logic Slowing Development

Diagram 4

In our fifth complex flow scenario depicted below in Diagram 4, there are multiple flows that share the same initial logic piece but differ downstream with customer-specific transformations.

Below in Figure 6 is an example image of this flow scenario in DCTC.

Figure 6

Running 3 flows in this scenario presents an opportunity to save on computational costs by reducing the number of flows and an opportunity to ease troubleshooting. By easing the troubleshooting process, less time is spent in development.

As shown conceptually below in Diagram 5, a possible solution would be to separate the shared initial logic piece in all 3 flows here into Flow 1.

Diagram 5

And have the different downstream logic pieces with customer-specific transformations into Flow 2, reducing our 3 flows down to 2 flows like in Figure 7.

Figure 7

Here in Flow 2, like in Figure 8, leveraging parameters would be helpful for specifying which metadata to pull in dynamically for each run.

Figure 8

Additionally, we can leverage the Plans feature to orchestrate the execution conceptually reflected in Diagram 6 below.

Diagram 6

For a more concrete DCTC plan example, see Figure 9 provided here.

Figure 9

Scenario 3: Redundant Manual Unions from Data Sources Slowing Development

Diagram 7

In our sixth complex flow scenario conceptually depicted above in Diagram 7, there are various complex unions between data sources with different table schemas. Below in Figure 10 is an example image of this flow scenario in DCTC.

Figure 10

Each union between a pair of data sources differs in the number of columns and what data are present in each column. When replacing data sources or troubleshooting, it may be frustrating navigating different logic pieces in our monolithic complex flow. Less time will be spent in development If we can ease the ease troubleshooting process. Here we have 3 complex unions in our complex flow that we want to split organize.

As shown conceptually below in Diagram 8, a possible solution would be to break up the 3 complex unions into Flow 1, Flow 2, and Flow 3 based on their respective table schema.

Diagram 8

Flow 4 can involve every logic from the complex unions up to publishing. So, a more concrete DCTC flow example of Flow 1 would look like this image in Figure 11.

Figure 11

A more concrete DCTC flow example of Flow 2 would look like this image in Figure 12.

Figure 12

As a more concrete DCTC flow example of Flow 3, here is Figure 13.

Figure 13

Finally, here is a more concrete DCTC flow example of Flow 4 provided below as Figure 14.

Figure 14

In Flow 4, leveraging parameters would be helpful for specifying which metadata to pull in dynamically for each run.

Then we can leverage the plans feature to orchestrate the execution conceptually reflected in Diagram 9 below.

Diagram 9

Our slowest flow, Flow 3 in this scenario, is the last flow task to successfully execute before starting Flow 4. For a more concrete DCTC plan example, see Figure 15 provided here.

Figure 15

Engine Works

Performance Tuning With Plans in Designer Cloud, Trifacta Classic

Scenario 3: Direct Unions from Data Sources Slowing Job Runs

Scenario 2: Redundancy in Flows Sharing Same Initial Logic Slowing Development

Scenario 3: Redundant Manual Unions from Data Sources Slowing Development