Engine Works

SteveA · ‎03-28-2016

Motivation

There’s nothing better than a fresh Server installation. Savoring that New-Server smell, the distinct crinkle of cellophane unboxing a shiny-new AWS instance, the anticipation as the installer initializes, the building excitement as R packages are given a new home. Waiting anxiously for the smooth hum of the Alteryx Engine as the first completed jobs start rolling off the queue...

But, not everything is unicorns and roses, and sometimes things get a bit sideways. Perhaps the Server doesn’t initialize correctly, or throughput isn’t what you expect, or certain modules cause issues with a particular dataset. These things happen, and I’d be remiss if I claimed they didn’t. Luckily, however, most issues leave a trail of breadcrumbs behind and are therefore resolvable.

There’s no magic bullet for diagnosing a troubled Server, but knowing where to look for clues and how to interpret those clues is a critical first step. What follows is a general set of guidelines to lean on in the event of a Server failure to facilitate diagnosis and bring the Server online as quickly as possible.

Tip 1: Believe the evidence and keep it simple

When a Server misbehaves, it typically leaves a trail behind it that can be used to reconstruct and understand the root cause(s) leading to the failure event. The Server usually knows why it failed and it will do its best to explain to you why it’s unhappy, so the single most important diagnostic tip is to believe what the Server is telling you.

If it complains about permissions on a particular directory, double-check that first before digging in deeper. If it mentions a possible resource conflict with a particular port or file, ask yourself what other programs might also be trying to access those resources. And, if you see a clear explanation such as this, don’t try to explain it away:

"Unable to start Queue service, insufficient license"

In the spirit of Occam’s Razor, most Server failures are due to simple environmental and/or configuration issues, so consider those factors first. Collect the hints the Server gave you and move forward with a simplest-first mantra, and avoid going down the proverbial rabbit hole unless it’s absolutely necessary.

Tip 2: Evaluate brokenness

Before digging too deeply into diagnosing a system, it’s important to establish just “how” broken the system actually is and determine the most optimal course of action.

1. Is the AlteryxService responding?

To start, open a browser and point to http://localhost/AlteryxService/status. A properly configured system should respond, showing the currently running version as follows:

2. Is the Gallery responding?

Similarly, a Server configured with the Gallery should respond on the Gallery root URL, which by default is http://localhost/gallery. When loaded into a browser, a properly-configured system should serve the Gallery home page:

3. Does the Scheduler connect?

The Scheduler is an excellent diagnostic tool for a Server configured with some combination of Scheduler or Gallery. From the Alteryx Designer main menu, navigate to the Options menu, choose View Schedules, and connect to your Server. A properly-configured and running Server should display uploaded modules, current schedules, queued modules and results from previous module execution:

4. What does Task Manager tell us?

In the event there is no Server response to one of the above operations, check to see if the AlteryxService [service] is running in the Windows Task Manager (taskmgr.exe). To start, launch taskmgr.exe, select the Processes or Details tab (windows 7 and 8, respectively), and select Show processes from all users if the option is available. If the Server is running correctly, you should see the primary Server process called AlteryxService.exe running along with one or more other related processes prefaced with “Alteryx”.

For example, a standalone Server configured as a Gallery will likely have at least four types of Server-related processes running including the AlteryxService.exe (Controller), AlteryxCloudCmd.exe (Gallery), map renderers and the Mongo Controller:

If the diagnostic checks fail and/or you fail to see any Server-related processes, try flipping to the Services tab. The AlteryxService service should be listed in the collection of available services. A non-responsive Server may have the AlteryxService service in a stopped state:

As you’re looking at taskmgr, are there one or more AlteryxService-related processes that are taking a large amount of CPU time? Or memory? If so, that might indicate a failure condition. Often times, just restarting the AlteryxService [service] will bring the Server back up.

Tip 3: Know where to find evidence

In the event of Server failure, there are several places on the system to look for evidence. These include the Server’s startup error file, log files and the System Event Viewer.

1. Server startup error file

If Server startup fails, it will attempt to create a file called LastStartupError.txt containing the final error message associated with a failure event. By default, the file will be written to C:\ProgramData\Alteryx\Service\LastStartupError.txt:

This file will contain the exact reason the Server refused to start correctly, so it should be an early stop on your diagnostic journey. And, heeding the advice given previously about believing what the Server is trying to tell you, this file may also be the last stop on the diagnostic train.

For example, consider the following Server error in LastStartupError.txt. If you find an error such as this, would you know what to do next?

2. AlteryxService log(s)

By default, the Server is configured with logging enabled. The default log level is sufficient to capture warnings and errors, and should be suitable for diagnosing most failures.

Logs for the AlteryxService.exe process will be located in the Logging folder specified on the Controller | General page in the Alteryx System Settings. By default, the most recent log file will be C:\ProgramData\Alteryx\Service\AlteryxServiceLog.log and others be named with a timestamp after log rotation:

3. Gallery log(s)

Logs for the Gallery service (the AlteryxCloudCmd.exe process) will be located in the Logging Directory folder specified on the Gallery | General page in the Alteryx System Settings. By default, the path is C:\ProgramData\Alteryx\Gallery\Logs, which contains individual log files named by the current date:

4. System Event Viewer

Logs from the System Event Viewer in Windows may be useful in some scenarios, especially hard-to-diagnose startup issues. To access the logs, run the eventvwr.exe process as an Administrator. In the main tree view, choose Windows Logs | Application and search for Server-related processes, such as “AlteryxService” and “AlteryxCloud” to find events related to the Server:

In addition to events related to the Server itself, pay attention to System alerts such as Windows Updates, application crashes and system restarts. Correlating events from the System Event Viewer temporally with the AlteryxService and Gallery logs is a worthwhile exercise.

Tip 4: Know what to look for in the logs

The Server logs are a vital diagnostic resource, but sifting through chaff can be challenging. The log(s) most likely hold "The Answer," and the key to efficient, successful log mining is keeping focus on the task at hand, which is root-cause failure analysis.

The first scan through a log should be just that, a scan. Start by temporarily putting your blinders on and look for log entries with the highest signal-to-noise ratio. For both the AlteryxService.exe (Controller) and AlteryxCloudCmd.exe (Gallery) logs, this means searching for log entries whose severity is “Error” or above (so “Error”, “Alert” and “Critical”).

Once we’ve identified possible high-value events in the log, it’s time to remove the blinders and focus on establishing context for the failure. Scanning up in the log to a time before the first error, are there other events that foreshadow the failure? Or, scanning down in the log after the failure event, what are downstream consequences? If other resources are available (for example, System Event Viewer logs), are there any temporal correlations that might point toward a root cause?

Detailed log parsing and correlation are beyond the scope of this post, but let’s discuss some general tips for first-pass log parsing of the two primary Server logs.

1. AlteryxService log

When scanning an AlteryxService log, I prefer a lightweight text editor, so examples here are from Notepad.

Since a single log file may contain events from multiple AlteryxService runs, the first step is to identify the point in the log where the Server last started, which is marked by log entry containing the phrase “AlteryxService starting”. Finding that entry is simple: After opening the log, move the cursor to the bottom of the log, and search up for “AlteryxService starting”:

Now that we have a starting point in the log, let’s search down for the first error entry with the level moniker of “,ERROR,” and see what we find:

In this particular example, the Mongo service is not accepting a connection, and the next step would be to establish context around the error to establish root cause.

2. AlteryxCloudCmd (Gallery) log

In addition to Notepad, Excel is also a good choice for looking at Gallery logs, so the examples here are from Excel.

As with the AlteryxService log, the first step in scanning the AlteryxCloudCmd log is to identify the point in the log where the Server last started, which is marked by a log entry containing the log header monikers. After opening the log in Excel, I will search down for the last occurrence of “LogLevel”, which is a unique ane easy to find marker:

Now that we have our starting point in the log, search down for “,ERROR,” to see what the first failure event is. Note how I’ve deliberately hidden columns C-N inclusive in Excel to increase signal during this first-pass log scan:

In this example, the Gallery experienced an error retrieving map data from the AlteryxService and a quick log correlation confirmed the suspicion that datasets were not correctly installed:

Tip 5: Practice

When it comes to diagnosing Server failures, an exceedingly useful tool is practice. The more familiar you are with log locations and contents, the easier it will be to use them in a time of need. Similarly, learning more complex tools as the System Event Viewer (including how to filter by process name and how to export/consume the log) will save time.

Many Server issues are related in some way to its configuration, and becoming familiar with the Server configuration is critical. Recognizing potentially invalid (or “less valid”) configurations often short-circuits the need for advanced diagnosis.

If you are setting up a Server for the first time, or you have access to a sandboxed environment, consider experimenting with different configurations to explore how the system behaves and what the log signatures are. For example, how would your Server behave when configured with 20 logical workers and 32GB of sort/join memory? And what errors might you expect in the logs?

Tip 6: Know when to ask for help

If you find yourself at a dead end diagnosing a Server failure, take advantage of other available help resources. Clearly, you’re already aware of the Alteryx Community site, where no question goes unanswered and helpful resources such as our knowledgebase abound.

Harder to solve issues may require help from our support team who are always available for help at clientservices@alteryx.com. Your Alteryx contact will likely use some of the diagnostic aids discussed here.

Final thoughts

Each failure case is unique and some are more challenging to root out than others, but hopefully the general guidelines described help to streamline diagnosis. Thank you kindly for reading and stay tuned for more Server-related posts.