Engine Works

MikeSp · ‎07-26-2022

Alteryx Architectures_Banner_999x240-02.png

Alteryx Architectures - Introduction

Alteryx Architectures - Starter Architectures

Alteryx Architectures - SAML SSO Authentication

Alteryx Architectures - Workload Management

Alteryx Architectures - Resiliency and High Availability (you are here)

Alteryx Architectures - Alteryx Server Demo Environment

Welcome to the next installment in the Alteryx Architectures blog series.  In this edition, we’ll take a look at some resiliency and high availability techniques from an architecture perspective that can help ensure Alteryx Server environments are able to stay online for your users and ensure the least disruption to your business process. As mentioned in previous blogs, we recommend working with your sales representative through a sizing exercise and review to ensure the Alteryx Server environment is setup in an appropriate way to make sure that it fits your specific needs for resiliency or high availability. This blog will go over some of the basics as well as best practices.

Definitions

Let’s start by defining some terminology around high availability and resiliency which will help us understand the scenarios we’ll be discussing in this post, as sometimes these terms can be difficult to distinguish from one another, and in some cases they overlap.

High Availability - Maintaining acceptable continuous performance despite temporary load fluctuations or failures in services, hardware, or data centers. Availability means the system or application is accessible and able to be used by the end user. Therefore "High Availability" simply means that the system or application is almost always available. Typically measured in uptime and involves built-in redundancy.
Resiliency - The ability of a system or application to recover and continue operations even during hardware or software failures.
Fault Tolerance – A system’s ability to continue operating properly when one or more of its components fails.

Resiliency in Alteryx Server

Let’s go into some of the common deployments and how they react to failures in the environment.

No Fault Tolerance / Resiliency

We’ll start with the most basic starter deployment, which is a single Alteryx Server with all components built in. This deployment has no resiliency and thus cannot tolerate a failure, whether that be the machine hardware itself, the software on the machine, or the data center the machine resides in. If one of these mentioned services fails, the environment becomes unavailable. When the Alteryx Server is critical to business operations, this could be a big issue for an organization and result in multiple hours or longer of downtime as the environment would be down until the single machine becomes operational again.

Partial Fault Tolerance / Resiliency

Partial fault tolerance allows for some resiliency when a machine or service fails in an Alteryx Server deployment. A typical setup for many customers that achieves partial fault tolerance is displayed below, where there are multiple Worker machines within a single data center. In this scenario, if one or two of the Workers were to fail, overall Worker capacity would be reduced, and workflows may be queued for longer periods of time, but workflows would continue to run in the environment if at least one Worker is still online within the environment. This is due to the hub and spoke model of the Worker component.

In another typical setup pictured below, if multiple Gallery machines are also set up behind a load balancer, a single failure of a Gallery machine would result in no downtime for an end-user, assuming the load balancer is set up with appropriate health checks to determine which Gallery machine(s) are online. Some of the basics for setting up with a load balancer can be found here.

However, even with these types of architecture, there are still some scenarios that might result in a failure. If any of the other components were to fail, such as the Controller or MongoDB, the environment would become unavailable for end-users. A data center failure, in this case, would still also result in the complete unavailability of the environment, as all the machines would become unavailable.

Highly Available

An ideal baseline high availability environment configuration is shown in the image below. This type of deployment is ideal as it can accommodate at least one failure of each component in the environment. Resiliency becomes even greater as additional machines are added to the individual data centers or to additional data centers. A large benefit of a highly available environment as pictured is that all the components, except for the Controller component, can also run simultaneously to increase the capacity, performance, and/or resiliency of the environment.

Notably, in this scenario, we can see that our MongoDB is scaled out to a replica set of at least three machines. This type of setup requires a user-managed MongoDB deployment and so may require some MongoDB expertise in your organization. This ensures that the database remains available even if one of the MongoDB machines or a data center were to fail. As an alternative to a user-managed MongoDB, MongoDB Atlas – a cloud-based MongoDB service hosted by Mongo directly – is also supported by Alteryx Server to provide a turn-key replica set option that requires much less MongoDB expertise.

The pictured setup can accommodate a single complete data center failure, and the data centers pictured below could also be set up in multiple Availability Zones (or equivalent) in a cloud deployment to survive a regional outage.

What is also new in this setup that we haven’t mentioned before are additional Controller machines named “HA Controller.” These are passive instances of the Controller component, which are set up through Microsoft Failover Clustering. This automation is included with Windows Server and has been tested by Alteryx as a viable option. Other automation options also exist, such as custom PowerShell scripts or manual failover. Any of these solutions can achieve our end goal of starting up another Controller when the primary Controller machine fails or becomes otherwise unavailable. As a very important consideration, only one Controller machine can be active in a single Alteryx Server environment at any given time. The image below shows the HA Controllers as “greyed out,” which means they are in a “warm” standby state.

Choosing a Setup Type

When working with your sales representative to choose a model for your specific use case, there are a few questions you may wish to consider:

How long is it acceptable for the Alteryx Server environment to be offline in the event of a failure, or is it acceptable for the environment to be offline at all?
Should your Alteryx Server environment be able to survive one or more failures? Which component(s) do you want to ensure will survive an outage?
Should you have an environment that can survive a potential outage of a data center, availability zone, or both?

Summary

In this blog, we’ve introduced you to some of the typical high availability as well as resilient architecture types that Alteryx Server can support. There are plenty of factors that may influence the number of nodes or configuration of the environment that might work best for you. If you need help making a decision on an environment configuration that would best suit your organization’s needs, reach out to your sales representative, and they can help get the right resources to help you design the right Alteryx Server environment for your needs.