on
10-17-2018
08:54 AM
- edited on
02-27-2020
01:28 PM
by
KylieF
Welcome to part 3 of the Supporting Promote series. In this series, we will tackle some common issues and questions, and provide best practices for troubleshooting. This article will step through the process of restoring the Promote web app.
The Promote web app has crashed, and users can't interact with the UI.
First, try to SSH onto the leader node (this is where the promote_app service will be):
ssh user@IP_address_of_leader
Confirm you're the Leader node by running the following command:
docker node ls
In the results, note the * and the Leader tag are on the same line:
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 6iscs5yrjgo1qa8ty1gk1otp7 * den-cs-cent-01 Ready Active Reachable dzys60hss1jgcfy13qu3lbgxs den-cs-cent-02 Ready Active Leader qx4cu0k4hnfjdppn5739doizj den-cs-cent-03 Ready Active Reachable
Now, check the status of the services by running this command:
docker service ls
The results of the command show us that there are 0 promote_app replications!
ID NAME MODE REPLICAS IMAGE PORTS md7is7ptrduj promote_app replicated 0/1 quay.io/yhat/promote-app:latest *:3001->3001/tcp t5rdsnnfjava promote_consul replicated 1/1 quay.io/yhat/consul:latest *:8500->8500/tcp 6bi0mwz1v8bn promote_db replicated 1/1 quay.io/yhat/promote-db:latest *:5432->5432/tcp
Next, see if you can view the logs of the promote_app service:
docker service logs promote_app
In this example, the command comes back blank because the logs are trimmed from crashed services after a period of time.
As a next step, you can run a command to list the tasks of our promote_app service:
docker service ps promote_app
This shows us that the service tried to state 3 times, and then crashed.
ID NAME IMAGE NODE DESIRED STATE CURRENT STATE ERROR j23ds8741rjm promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 hours ago "task: non-zero exit (1)" sohhecv8jro3 \_ promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 hours ago "task: non-zero exit (1)" u017otkjuupf \_ promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 hours ago "task: non-zero exit (1)" ncjcpb5dqakp \_ promote_app.1 quay.io/yhat/promote-app:latest ip-172-198.us-west-2 Shutdown Failed 2 days ago "task: non-zero exit (1)"
Now run the following command to trigger the system to update and spawn a new container build to tail the logs:
docker service update promote_app --detach=false --force
promote_app overall progress: 0 out of 1 tasks 1/1: running [=======================================> ] verify: Service failed to converged
In this case, we can see in the results of the following command that the app is failing to start because of a database password error:
docker service logs promote_app --tail 300
Results:
promote_app.1.q7n@machine-01 | wait-for-it-linux.sh: waiting 15 seconds for promote_db:5432
promote_app.1.gnj@machine-01 | wait-for-it-linux.sh: waiting 15 seconds for promote_db:5432
promote_app.1.gnj@machine-01 | wait-for-it-linux.sh: promote_db:5432 is available after 0 seconds
promote_app.1.q7n@machine-01 | wait-for-it-linux.sh: promote_db:5432 is available after 0 seconds
promote_app.1.q7n@machine-01 |
promote_app.1.gnj@machine-01 |
promote_app.1.gnj@machine-01 | > promote-app@4.1.0 db:migrate:latest /promote-app
promote_app.1.q7n@machine-01 | > promote-app@4.1.0 db:migrate:latest /promote-app
promote_app.1.q7n@machine-01 | > knex migrate:latest
promote_app.1.gnj@machine-01 | > knex migrate:latest
promote_app.1.q7n@machine-01 |
promote_app.1.gnj@machine-01 |
promote_app.1.gnj@machine-01 | Using environment: production
promote_app.1.q7n@machine-01 | Using environment: production
promote_app.1.gnj@machine-01 | error: password authentication failed for user "promote"
promote_app.1.gnj@machine-01 | at Connection.parseE (/promote-app/node_modules/pg/lib/connection.js:545:11)
promote_app.1.q7n@machine-01 | error: password authentication failed for user "promote"
The specific error: password authentication failed for user "promote"
Aha! For this, we will need to connect to the Promote PostgreSQL database. This lives on the Leader node (the same node as the web app).
To connect: First, find the ID of the promote_db container:
docker ps | grep promote-db
8737823734
Now, copy the database password:
cat /var/promote/credentials/db.txt
b0e9eaa549ecghety7904f04314a6a7e
Connect to the container and log in to the database:
docker exec -it 8737823734 bash
root@8011dc63b950:/# psql postgres://promote:b0e9eaa549ecghety7904f04314a6a7e@promote_db/promote psql (9.6.7) Type "help" for help. promote=# \dt
List of relations
Schema | Name | Type | Owner
--------+-----------------------------+-------+---------
public | migrations | table | promote
public | migrations_lock | table | promote
public | model_environment_variables | table | promote
public | model_versions | table | promote
public | models | table | promote
public | system_settings | table | promote
public | users | table | promote
(7 rows)
The password should now be authenticated for the promote user and PostgreSQL database. If not, run:
promote.stop
Search for the original Leader node when the swarm cluster was originally created and run:
sudo promote.start
This should then fix the problem.
To back up or restore your database, please see part 4 of this series. Keep calm and Promote on!
Hey guys,
great post ;-)
just wanted to share this piece on Promote - spent tons of time on it today, probably because of my inexperience.
while testing my Promote installation, models published from pretty much anywhere (Rstudio, Designer, …) were successfully built but always stayed offline.
when i published the same models to a different Promote server, they were all good.
on my Promote server, all services looked good over SSH running all kinds of status commands.
in the end, simply going for promote.stop and sudo promote.start helped
i guess don’t always trust the status commands.