Alteryx Promote Knowledge Base

JohnPo · ‎10-17-2018

Welcome to part 3 of the Supporting Promote series. In this series, we will tackle some common issues and questions, and provide best practices for troubleshooting. This article will step through the process of restoring the Promote web app.

The Promote web app has crashed, and users can't interact with the UI.

First, try to SSH onto the leader node (this is where the promote_app service will be):

ssh user@IP_address_of_leader

Confirm you're the Leader node by running the following command:

docker node ls

In the results, note the * and the Leader tag are on the same line:

ID                            HOSTNAME            STATUS              AVAILABILITY        MANAGER STATUS
6iscs5yrjgo1qa8ty1gk1otp7 *   den-cs-cent-01      Ready               Active              Reachable
dzys60hss1jgcfy13qu3lbgxs     den-cs-cent-02      Ready               Active              Leader
qx4cu0k4hnfjdppn5739doizj     den-cs-cent-03      Ready               Active              Reachable

Now, check the status of the services by running this command:

docker service ls

The results of the command show us that there are 0 promote_app replications!

ID            NAME            MODE        REPLICAS  IMAGE                              PORTS
md7is7ptrduj  promote_app     replicated  0/1       quay.io/yhat/promote-app:latest    *:3001->3001/tcp
t5rdsnnfjava  promote_consul  replicated  1/1       quay.io/yhat/consul:latest         *:8500->8500/tcp
6bi0mwz1v8bn  promote_db      replicated  1/1       quay.io/yhat/promote-db:latest     *:5432->5432/tcp

Next, see if you can view the logs of the promote_app service:

docker service logs promote_app

In this example, the command comes back blank because the logs are trimmed from crashed services after a period of time.

As a next step, you can run a command to list the tasks of our promote_app service:

docker service ps promote_app

This shows us that the service tried to state 3 times, and then crashed.

ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR 
j23ds8741rjm promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 hours ago "task: non-zero exit (1)"
sohhecv8jro3 \_ promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 hours ago "task: non-zero exit (1)"
u017otkjuupf \_ promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 hours ago "task: non-zero exit (1)"
ncjcpb5dqakp \_ promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 days ago "task: non-zero exit (1)"

Now run the following command to trigger the system to update and spawn a new container build to tail the logs:

docker service update promote_app --detach=false --force

promote_app
overall progress: 0 out of 1 tasks
1/1: running [=======================================>      ]
verify: Service failed to converged

In this case, we can see in the results of the following command that the app is failing to start because of a database password error:

docker service logs promote_app --tail 300

Results:

Spoiler

promote_app.1.q7n@machine-01 | wait-for-it-linux.sh: waiting 15 seconds for promote_db:5432
promote_app.1.gnj@machine-01 | wait-for-it-linux.sh: waiting 15 seconds for promote_db:5432
promote_app.1.gnj@machine-01 | wait-for-it-linux.sh: promote_db:5432 is available after 0 seconds
promote_app.1.q7n@machine-01 | wait-for-it-linux.sh: promote_db:5432 is available after 0 seconds
promote_app.1.q7n@machine-01 |
promote_app.1.gnj@machine-01 |
promote_app.1.gnj@machine-01 | > promote-app@4.1.0 db:migrate:latest /promote-app
promote_app.1.q7n@machine-01 | > promote-app@4.1.0 db:migrate:latest /promote-app
promote_app.1.q7n@machine-01 | > knex migrate:latest
promote_app.1.gnj@machine-01 | > knex migrate:latest
promote_app.1.q7n@machine-01 |
promote_app.1.gnj@machine-01 |
promote_app.1.gnj@machine-01 | Using environment: production
promote_app.1.q7n@machine-01 | Using environment: production
promote_app.1.gnj@machine-01 | error: password authentication failed for user "promote"
promote_app.1.gnj@machine-01 | at Connection.parseE (/promote-app/node_modules/pg/lib/connection.js:545:11)
promote_app.1.q7n@machine-01 | error: password authentication failed for user "promote"

promote_app.1.q7n@machine-01 | wait-for-it-linux.sh: waiting 15 seconds for promote_db:5432promote_app.1.gnj@machine-01 | wait-for-it-linux.sh: waiting 15 seconds for promote_db:5432promote_app.1.gnj@machine-01 | wait-for-it-linux.sh: promote_db:5432 is available after 0 secondspromote_app.1.q7n@machine-01 | wait-for-it-linux.sh: promote_db:5432 is available after 0 secondspromote_app.1.q7n@machine-01 |promote_app.1.gnj@machine-01 |promote_app.1.gnj@machine-01 | > promote-app@4.1.0 db:migrate:latest /promote-apppromote_app.1.q7n@machine-01 | > promote-app@4.1.0 db:migrate:latest /promote-apppromote_app.1.q7n@machine-01 | > knex migrate:latestpromote_app.1.gnj@machine-01 | > knex migrate:latestpromote_app.1.q7n@machine-01 |promote_app.1.gnj@machine-01 |promote_app.1.gnj@machine-01 | Using environment: productionpromote_app.1.q7n@machine-01 | Using environment: productionpromote_app.1.gnj@machine-01 | error: password authentication failed for user "promote"promote_app.1.gnj@machine-01 | at Connection.parseE (/promote-app/node_modules/pg/lib/connection.js:545:11)promote_app.1.q7n@machine-01 | error: password authentication failed for user "promote"

The specific error: password authentication failed for user "promote"

Aha! For this, we will need to connect to the Promote PostgreSQL database. This lives on the Leader node (the same node as the web app).

To connect: First, find the ID of the promote_db container:

docker ps | grep promote-db

Sample result:

8737823734

Now, copy the database password:

cat /var/promote/credentials/db.txt

b0e9eaa549ecghety7904f04314a6a7e

Connect to the container and log in to the database:

docker exec -it 8737823734 bash

root@8011dc63b950:/# psql postgres://promote:b0e9eaa549ecghety7904f04314a6a7e@promote_db/promote
psql (9.6.7)
Type "help" for help.

promote=# \dt

 List of relations
 Schema | Name | Type | Owner
--------+-----------------------------+-------+---------
 public | migrations | table | promote
 public | migrations_lock | table | promote
 public | model_environment_variables | table | promote
 public | model_versions | table | promote
 public | models | table | promote
 public | system_settings | table | promote
 public | users | table | promote
(7 rows)

The password should now be authenticated for the promote user and PostgreSQL database. If not, run:

promote.stop

Search for the original Leader node when the swarm cluster was originally created and run:

sudo promote.start

This should then fix the problem.

To back up or restore your database, please see part 4 of this series. Keep calm and Promote on!

DavidM · ‎11-09-2018

Hey guys,

great post ;-)

just wanted to share this piece on Promote - spent tons of time on it today, probably because of my inexperience.

while testing my Promote installation, models published from pretty much anywhere (Rstudio, Designer, …) were successfully built but always stayed offline.

when i published the same models to a different Promote server, they were all good.

on my Promote server, all services looked good over SSH running all kinds of status commands.

in the end, simply going for promote.stop and sudo promote.start helped

i guess don’t always trust the status commands.

Alteryx Promote Knowledge Base

Supporting Promote Series Part 3 - The Promote Web App