psql: FATAL: the database system is starting up

Started by Tom Kalmost 7 years ago37 messagesgeneral
Jump to latest
#1Tom K
tomkcpr@gmail.com

Hey Guy's,

I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
RHEL 7.6 VM's. Every now and then the underlying storage crashes taking
out the cluster. On recovery, PostgreSQL tends to come up while other
databases just blow up. That is until now.

After the most recent crash 2-3 weeks ago, the cluster is now running into
this message but I'm not able to make heads or tails out of why it's
throwing this:

-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf2019-05-23 08:26:34.561 EDT
[10101]: HINT: Future log output will appear in directory "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:26:45.707 EDT [10105] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT
54322019-05-23 08:26:34.565 EDT [10101]HINT: Future log output will appear in directory "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:26:45.707 EDT [10105] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT LOG: listening on Unix socket
"./.s.PGSQL.5432"2019-05-23 08:26:34.591 EDT [10101]HINT: Future log output will appear in directory "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:26:45.707 EDT [10105] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT LOG: redirecting
log output to logging collector process2019-05-23 08:26:34.591 EDT
[10101]: HINT: Future log output will appear in directory "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:26:45.707 EDT [10105] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT
"log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=1002019-05-23
08:26:45.707 EDT [10105]LOG: redirecting log output to logging collector process2019-05-23 08:26:45.740 EDT [10105] HINT: Future log output will appear in directory "log".^C-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.1 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:29:36.014 EDT [10188] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:29:36.017 EDT [10188] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:29:36.040 EDT [10188] LOG: redirecting log output to logging collector process2019-05-23 08:29:36.040 EDT LOG: listening on IPv4 address
"192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105]LOG: redirecting log output to logging collector process2019-05-23 08:26:45.740 EDT [10105] HINT: Future log output will appear in directory "log".^C-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.1 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:29:36.014 EDT [10188] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:29:36.017 EDT [10188] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:29:36.040 EDT [10188] LOG: redirecting log output to logging collector process2019-05-23 08:29:36.040 EDT LOG:
listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT
[10105]: LOG: redirecting log output to logging collector process2019-05-23 08:26:45.740 EDT [10105] HINT: Future log output will appear in directory "log".^C-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.1 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:29:36.014 EDT [10188] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:29:36.017 EDT [10188] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:29:36.040 EDT [10188] LOG: redirecting log output to logging collector process2019-05-23 08:29:36.040 EDT
process2019-05-23 08:26:45.740 EDT [10105]LOG: redirecting log output to logging collector process2019-05-23 08:26:45.740 EDT [10105] HINT: Future log output will appear in directory "log".^C-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.1 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:29:36.014 EDT [10188] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:29:36.017 EDT [10188] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:29:36.040 EDT [10188] LOG: redirecting log output to logging collector process2019-05-23 08:29:36.040 EDT HINT: Future log output
will appear in directory "log".^C-bash-4.2$ /usr/pgsql-10/bin/postgres
-D /data/patroni --config-file=/data/patroni/postgresql.conf
--hot_standby=on --listen_addresses=192.168.0.1
--max_worker_processes=8 --max_locks_per_transaction=64
--wal_level=replica --cluster_name=postgres --wal_log_hints=on
--max_wal_senders=10 --track_commit_timestamp=off
--max_prepared_transactions=0 --port=5432 --max_replication_slots=10
--max_connections=1002019-05-23 08:29:36.014 EDT [10188]HINT: Future log output will appear in directory "log". LOG:
listening on IPv4 address "192.168.0.108", port 54322019-05-23
08:29:36.017 EDT [10188]HINT: Future log output will appear in directory "log". LOG: listening on Unix socket
"./.s.PGSQL.5432"2019-05-23 08:29:36.040 EDT [10188]HINT: Future log output will appear in directory "log". LOG: redirecting
log output to logging collector process2019-05-23 08:29:36.040 EDT
[10188]: HINT: Future log output will appear in directory "log".

The database logs only print out this:

2019-05-23 08:25:44.476 EDT [10088] HINT: The database server will
regularly poll the pg_wal subdirectory to check for files placed
there.2019-05-23 08:25:44.477 EDT [10088] LOG: entering standby
mode2019-05-23 08:25:50.209 EDT [10086] LOG: received fast shutdown
request2019-05-23 08:25:50.215 EDT [10089] LOG: shutting
down2019-05-23 08:25:50.221 EDT [10086] LOG: database system is shut
down2019-05-23 08:26:11.560 EDT [10094]WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:11.560 EDT [10094] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.2019-05-23 08:26:11.562 EDT [10094] LOG: entering standby mode2019-05-23 08:26:17.817 EDT LOG: database system was shut
down in recovery at 2019-05-23 08:25:50 EDT2019-05-23 08:26:11.560 EDT
[10094]: WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:11.560 EDT [10094] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.2019-05-23 08:26:11.562 EDT [10094] LOG: entering standby mode2019-05-23 08:26:17.817 EDT
neither primary_conninfo nor restore_command2019-05-23 08:26:11.560
EDT [10094]WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:11.560 EDT [10094] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.2019-05-23 08:26:11.562 EDT [10094] LOG: entering standby mode2019-05-23 08:26:17.817 EDT HINT: The database server will regularly poll the pg_wal
subdirectory to check for files placed there.2019-05-23 08:26:11.562
EDT [10094]WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:11.560 EDT [10094] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.2019-05-23 08:26:11.562 EDT [10094] LOG: entering standby mode2019-05-23 08:26:17.817 EDT LOG: entering standby mode2019-05-23 08:26:17.817 EDT
[10092]: LOG: received fast shutdown request2019-05-23 08:26:17.824 EDT [10095] LOG: shutting down2019-05-23 08:26:17.831 EDT [10092] LOG: database system is shut down2019-05-23 08:26:34.596 EDT [10103] LOG: database system was shut down in recovery at 2019-05-23 08:26:17 EDT2019-05-23 08:26:34.596 EDT [10103] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:34.596 EDT [10103] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.2019-05-23 08:26:34.598 EDT [10103] LOG: entering standby mode2019-05-23 08:26:43.025 EDT [10101] LOG: received fast shutdown request2019-05-23 08:26:43.032 EDT [10104] LOG: shutting down2019-05-23 08:26:43.038 EDT [10101] LOG: database system is shut down2019-05-23 08:26:45.748 EDT [10107] LOG: database system was shut down in recovery at 2019-05-23 08:26:43 EDT2019-05-23 08:26:45.748 EDT
EDT [10095] LOG: shutting down2019-05-23 08:26:17.831 EDT [10092]LOG: received fast shutdown request2019-05-23 08:26:17.824 EDT [10095] LOG: shutting down2019-05-23 08:26:17.831 EDT [10092] LOG: database system is shut down2019-05-23 08:26:34.596 EDT [10103] LOG: database system was shut down in recovery at 2019-05-23 08:26:17 EDT2019-05-23 08:26:34.596 EDT [10103] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:34.596 EDT [10103] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.2019-05-23 08:26:34.598 EDT [10103] LOG: entering standby mode2019-05-23 08:26:43.025 EDT [10101] LOG: received fast shutdown request2019-05-23 08:26:43.032 EDT [10104] LOG: shutting down2019-05-23 08:26:43.038 EDT [10101] LOG: database system is shut down2019-05-23 08:26:45.748 EDT [10107] LOG: database system was shut down in recovery at 2019-05-23 08:26:43 EDT2019-05-23 08:26:45.748 EDT
LOG: database system is shut down2019-05-23 08:26:34.596 EDT [10103]
LOG: database system was shut down in recovery at 2019-05-23 08:26:17
EDT2019-05-23 08:26:34.596 EDT [10103] WARNING: recovery command file
"recovery.conf" specified neither primary_conninfo nor
restore_command2019-05-23 08:26:34.596 EDT [10103] HINT: The database
server will regularly poll the pg_wal subdirectory to check for files
placed there.2019-05-23 08:26:34.598 EDT [10103] LOG: entering
standby mode2019-05-23 08:26:43.025 EDT [10101]HINT: Future log output will appear in directory "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:26:45.707 EDT [10105] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT LOG: received fast
shutdown request2019-05-23 08:26:43.032 EDT [10104] LOG: shutting
down2019-05-23 08:26:43.038 EDT [10101]HINT: Future log output will appear in directory "log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni --config-file=/data/patroni/postgresql.conf --hot_standby=on --listen_addresses=192.168.0.108 --max_worker_processes=8 --max_locks_per_transaction=64 --wal_level=replica --cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 --track_commit_timestamp=off --max_prepared_transactions=0 --port=5432 --max_replication_slots=10 --max_connections=1002019-05-23 08:26:45.707 EDT [10105] LOG: listening on IPv4 address "192.168.0.108", port 54322019-05-23 08:26:45.715 EDT [10105] LOG: listening on Unix socket "./.s.PGSQL.5432"2019-05-23 08:26:45.740 EDT LOG: database system is shut
down2019-05-23 08:26:45.748 EDT [10107]WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:45.748 EDT [10107] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there. LOG: database system was shut
down in recovery at 2019-05-23 08:26:43 EDT2019-05-23 08:26:45.748 EDT
[10107]: WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:45.748 EDT [10107] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there.
neither primary_conninfo nor restore_command2019-05-23 08:26:45.748
EDT [10107]WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command2019-05-23 08:26:45.748 EDT [10107] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there. HINT: The database server will regularly poll the pg_wal
subdirectory to check for files placed there.

and when connecting, I get this cryptic message:

[root@psql01 log]# psql -U postgres -h 192.168.0.108
psql: FATAL: the database system is starting up[root@psql01 log]#

There's no space issue here. The cluster stays in recovery. This state
doesn't change and the DB is inaccessible.

So I ran the above in debug mode ( -d 5 ) and got this:

DEBUG: mmap(148897792) with MAP_HUGETLB failed, huge pages disabled:
Cannot allocate memory

so I freed memory on the underlying physical this VM sits on but no luck.
Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes) I
get the message:

[root@psql01 ~]# ps -ef|grep -Ei "patroni|postgres"
root 2217 2188 0 00:38 pts/1 00:00:00 tail -f postgresql-Thu.log
postgres 2512 1 4 00:42 ? 00:00:01 /usr/bin/python2
/bin/patroni /etc/patroni.yml
postgres 2533 1 0 00:42 ? 00:00:00
/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100
postgres 2535 2533 0 00:42 ? 00:00:00 postgres: postgres:
logger process
postgres 2536 2533 0 00:42 ? 00:00:00 postgres: postgres:
startup process waiting for 000000010000000000000008
root 2664 2039 0 00:42 pts/0 00:00:00 grep --color=auto -Ei
patroni|postgres[root@psql01 ~]#

The system prints "postgres: postgres: startup process waiting for
000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace on
the file reveals that it's looking for the following:

read(5, 0x7fff9cb4eb87, 1) = -1 EAGAIN (Resource
temporarily unavailable)read(5, 0x7fff9cb4eb87, 1) = -1
EAGAIN (Resource temporarily
unavailable)open("pg_wal/00000098.history", O_RDONLY) = -1 ENOENT (No
such file or directory)
epoll_create1(EPOLL_CLOEXEC) = 3

But this file doesn't exist in the directory on this particular cluster
node (psql01). I copy the missing history files off another cluster node
to this one and it does move forward but just get's stuck on another
history file. But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02). But this
yields this error:

PANIC: replication checkpoint has wrong magic 0 instead of 307747550

and I'm not able to find very much to move me forward on this one either.

So again I give up after a few hours of trying to recover psql02 and move
on to the third node. On the third cluster node (pgsql03) I get this error
which I'm also not able to find much info on:

FATAL: syntax error in history file: f2W

Hoping I'm able to recover the data from at least one of the nodes, make it
primary, and get it to replicate to the others.

Thx, TK

#2Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#1)
Re: psql: FATAL: the database system is starting up

On 5/27/19 9:59 PM, Tom K wrote:

Hey Guy's,

I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
RHEL 7.6 VM's.  Every now and then the underlying storage crashes taking
out the cluster.  On recovery, PostgreSQL tends to come up while other
databases just blow up.  That is until now.

The above is one master and two standby servers connected via streaming
replication, correct?

Is there WAL archiving set up?

After the most recent crash 2-3 weeks ago, the cluster is now running
into this message but I'm not able to make heads or tails out of why
it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that more recent?

When you refer to history files below are you talking about WAL files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

|-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
[10101]LOG:listening onIPv4 address "192.168.0.108",port
54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
output tologging collector process 2019-05-2308:26:34.591EDT
[10101]HINT:Future log output will appear indirectory
"log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10
--max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
[10105]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
output tologging collector process 2019-05-2308:26:45.740EDT
[10105]HINT:Future log output will appear indirectory
"log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.1 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10
--max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
[10188]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
output tologging collector process 2019-05-2308:29:36.040EDT
[10188]HINT:Future log output will appear indirectory "log".|

The database logs only print out this:

|2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
poll the pg_wal subdirectory tocheckforfiles placed
there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
2019-05-2308:25:50.215EDT [10089]LOG:shutting down
2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
[10094]WARNING:recovery command file"recovery.conf"specified neither
primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
[10094]HINT:The databaseserver will regularly poll the pg_wal
subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
[10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
[10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
[10095]LOG:shutting down 2019-05-2308:26:17.831EDT
[10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
[10103]LOG:databasesystem was shut down inrecovery at
2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
command file"recovery.conf"specified neither primary_conninfo nor
restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
will regularly poll the pg_wal subdirectory tocheckforfiles placed
there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
2019-05-2308:26:43.032EDT [10104]LOG:shutting down
2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
[10107]WARNING:recovery command file"recovery.conf"specified neither
primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
[10107]HINT:The databaseserver will regularly poll the pg_wal
subdirectory tocheckforfiles placed there.|

and when connecting, I get this cryptic message:

|[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
databasesystem isstarting up [root@psql01 log]#|

There's no space issue here.  The cluster stays in recovery. This state
doesn't change and the DB is inaccessible.

So I ran the above in debug mode ( -d 5 ) and got this:

|DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
allocate memory|

so I freed memory on the underlying physical this VM sits on but no
luck. Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes)
I get the message:

|[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100postgres
25352533000:42?00:00:00postgres:postgres:logger process postgres
25362533000:42?00:00:00postgres:postgres:startup process waiting
for000000010000000000000008root 26642039000:42pts/000:00:00grep
--color=auto -Ei patroni|postgres[root@psql01 ~]#|

The system prints "postgres: postgres: startup process waiting for
000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace
on the file reveals that it's looking for the following:

|read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|

But this file doesn't exist in the directory on this particular cluster
node (psql01).  I copy the missing history files off another cluster
node to this one and it does move forward but just get's stuck on
another history file.  But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02).  But this
yields this error:

|PANIC:replicationcheckpointhas wrong magic 0instead of307747550|

and I'm not able to find very much to move me forward on this one either.

So again I give up after a few hours of trying to recover psql02 and
move on to the third node.  On the third cluster node (pgsql03) I get
this error which I'm also not able to find much info on:

|FATAL:syntax error inhistory file:f2W|

Hoping I'm able to recover the data from at least one of the nodes, make
it primary, and get it to replicate to the others.

Thx, TK

--
Adrian Klaver
adrian.klaver@aklaver.com

#3Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#2)
Re: psql: FATAL: the database system is starting up

On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 5/27/19 9:59 PM, Tom K wrote:

Hey Guy's,

I'm running Patroni w/ PostgreSQL 10, ETCD, Haproxy and Keepalived on 3
RHEL 7.6 VM's. Every now and then the underlying storage crashes taking
out the cluster. On recovery, PostgreSQL tends to come up while other
databases just blow up. That is until now.

The above is one master and two standby servers connected via streaming
replication, correct?

Is there WAL archiving set up?

Correct. Master election occurs through Patroni. WAL level is set to:

wal_level = 'replica'

So no archiving.

After the most recent crash 2-3 weeks ago, the cluster is now running
into this message but I'm not able to make heads or tails out of why
it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that more recent?

Haven't been able to bring this PostgresSQL cluster up ( run the cluster )
since 2-3 weeks ago. Tried quite a few combinations of options to recover
this. No luck. Had storage failures earlier, even with corrupted OS
files, but this PostgreSQL cluster w/ Patroni was able to come up each time
without any recovery effort on my part.

When you refer to history files below are you talking about WAL files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

True. recovery.conf is controlled by Patroni. Contents of this file
remained the same for all the cluster nodes with the exception of the
primary_slot_name:

[root@psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql01 postgresql-patroni-etcd]#

[root@psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql02 postgres-backup]#

[root@psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location so
when troubleshooting, I can always revert to the first state the cluster
was in when it failed.

Thx,
TK

Show quoted text

|-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf2019-05-2308:26:34.561EDT
[10101]LOG:listening onIPv4 address "192.168.0.108",port
54322019-05-2308:26:34.565EDT [10101]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:26:34.591EDT [10101]LOG:redirecting log
output tologging collector process 2019-05-2308:26:34.591EDT
[10101]HINT:Future log output will appear indirectory
"log".^C-bash-4.2/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10
--max_connections=1002019-05-2308:26:45.707EDT [10105]LOG:listening
onIPv4 address "192.168.0.108",port 54322019-05-2308:26:45.715EDT
[10105]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:26:45.740EDT [10105]LOG:redirecting log
output tologging collector process 2019-05-2308:26:45.740EDT
[10105]HINT:Future log output will appear indirectory
"log".^C-bash-4.2$/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.1 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10
--max_connections=1002019-05-2308:29:36.014EDT [10188]LOG:listening
onIPv4 address "192.168.0.108",port 54322019-05-2308:29:36.017EDT
[10188]LOG:listening onUnix socket
"./.s.PGSQL.5432"2019-05-2308:29:36.040EDT [10188]LOG:redirecting log
output tologging collector process 2019-05-2308:29:36.040EDT
[10188]HINT:Future log output will appear indirectory "log".|

The database logs only print out this:

|2019-05-2308:25:44.476EDT [10088]HINT:The databaseserver will regularly
poll the pg_wal subdirectory tocheckforfiles placed
there.2019-05-2308:25:44.477EDT [10088]LOG:entering standby mode
2019-05-2308:25:50.209EDT [10086]LOG:received fast shutdownrequest
2019-05-2308:25:50.215EDT [10089]LOG:shutting down
2019-05-2308:25:50.221EDT [10086]LOG:databasesystem isshut down
2019-05-2308:26:11.560EDT [10094]LOG:databasesystem was shut down
inrecovery at 2019-05-2308:25:50EDT 2019-05-2308:26:11.560EDT
[10094]WARNING:recovery command file"recovery.conf"specified neither
primary_conninfo nor restore_command 2019-05-2308:26:11.560EDT
[10094]HINT:The databaseserver will regularly poll the pg_wal
subdirectory tocheckforfiles placed there.2019-05-2308:26:11.562EDT
[10094]LOG:entering standby mode 2019-05-2308:26:17.817EDT
[10092]LOG:received fast shutdownrequest 2019-05-2308:26:17.824EDT
[10095]LOG:shutting down 2019-05-2308:26:17.831EDT
[10092]LOG:databasesystem isshut down 2019-05-2308:26:34.596EDT
[10103]LOG:databasesystem was shut down inrecovery at
2019-05-2308:26:17EDT 2019-05-2308:26:34.596EDT [10103]WARNING:recovery
command file"recovery.conf"specified neither primary_conninfo nor
restore_command 2019-05-2308:26:34.596EDT [10103]HINT:The databaseserver
will regularly poll the pg_wal subdirectory tocheckforfiles placed
there.2019-05-2308:26:34.598EDT [10103]LOG:entering standby mode
2019-05-2308:26:43.025EDT [10101]LOG:received fast shutdownrequest
2019-05-2308:26:43.032EDT [10104]LOG:shutting down
2019-05-2308:26:43.038EDT [10101]LOG:databasesystem isshut down
2019-05-2308:26:45.748EDT [10107]LOG:databasesystem was shut down
inrecovery at 2019-05-2308:26:43EDT 2019-05-2308:26:45.748EDT
[10107]WARNING:recovery command file"recovery.conf"specified neither
primary_conninfo nor restore_command 2019-05-2308:26:45.748EDT
[10107]HINT:The databaseserver will regularly poll the pg_wal
subdirectory tocheckforfiles placed there.|

and when connecting, I get this cryptic message:

|[root@psql01 log]#psql -U postgres -h 192.168.0.108psql:FATAL:the
databasesystem isstarting up [root@psql01 log]#|

There's no space issue here. The cluster stays in recovery. This state
doesn't change and the DB is inaccessible.

So I ran the above in debug mode ( -d 5 ) and got this:

|DEBUG:mmap(148897792)withMAP_HUGETLB failed,huge pages disabled:Cannot
allocate memory|

so I freed memory on the underlying physical this VM sits on but no
luck. Same memory error and same result as above.

When patroni is started on just this one node (troubleshooting purposes)
I get the message:

|[root@psql01 ~]#ps -ef|grep -Ei "patroni|postgres"root
22172188000:38pts/100:00:00tail -f postgresql-Thu.log postgres
25121400:42?00:00:01/usr/bin/python2 /bin/patroni /etc/patroni.yml
postgres 25331000:42?00:00:00/usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf --hot_standby=on
--listen_addresses=192.168.0.108 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100postgres
25352533000:42?00:00:00postgres:postgres:logger process postgres
25362533000:42?00:00:00postgres:postgres:startup process waiting
for000000010000000000000008root 26642039000:42pts/000:00:00grep
--color=auto -Ei patroni|postgres[root@psql01 ~]#|

The system prints "postgres: postgres: startup process waiting for
000000010000000000000008"

Seems it's trying to replay the history log but can't? Running an strace
on the file reveals that it's looking for the following:

|read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
unavailable)read(5,0x7fff9cb4eb87,1)=-1EAGAIN (Resource temporarily
unavailable)open("pg_wal/00000098.history",O_RDONLY)=-1ENOENT (No such
fileordirectory)epoll_create1(EPOLL_CLOEXEC)=3|

But this file doesn't exist in the directory on this particular cluster
node (psql01). I copy the missing history files off another cluster
node to this one and it does move forward but just get's stuck on
another history file. But same error message as above on psql01.

So I give up and try to recover the secondary node (psql02). But this
yields this error:

|PANIC:replicationcheckpointhas wrong magic 0instead of307747550|

and I'm not able to find very much to move me forward on this one either.

So again I give up after a few hours of trying to recover psql02 and
move on to the third node. On the third cluster node (pgsql03) I get
this error which I'm also not able to find much info on:

|FATAL:syntax error inhistory file:f2W|

Hoping I'm able to recover the data from at least one of the nodes, make
it primary, and get it to replicate to the others.

Thx, TK

--
Adrian Klaver
adrian.klaver@aklaver.com

#4Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#3)
Re: psql: FATAL: the database system is starting up

On 5/28/19 6:59 PM, Tom K wrote:

On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

Correct.  Master election occurs through Patroni.  WAL level is set to:

wal_level = 'replica'

So no archiving.

After the most recent crash 2-3 weeks ago, the cluster is now

running

into this message but I'm not able to make heads or tails out of why
it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that  more recent?

Haven't been able to bring this PostgresSQL cluster up ( run the cluster
) since 2-3 weeks ago.  Tried quite a few combinations of options to
recover this.  No luck.  Had storage failures earlier, even with
corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
come up each time without any recovery effort on my part.

When you refer to history files below are you talking about WAL
files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

True. recovery.conf is controlled by Patroni.  Contents of this file
remained the same for all the cluster nodes with the exception of the
primary_slot_name:

[root@psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql01 postgresql-patroni-etcd]#

[root@psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql02 postgres-backup]#

[root@psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location
so when troubleshooting, I can always revert to the first state the
cluster was in when it failed.

I have no experience with Patroni so I will be of no help there. You
might get more useful information from:

https://github.com/zalando/patroni
Community

There are two places to connect with the Patroni community: on github,
via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
you're using Patroni, or just interested, please join us.

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Thx,
TK

--
Adrian Klaver
adrian.klaver@aklaver.com

#5Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#4)
Re: psql: FATAL: the database system is starting up

On Wed, May 29, 2019 at 10:28 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 5/28/19 6:59 PM, Tom K wrote:

On Tue, May 28, 2019 at 9:53 AM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

Correct. Master election occurs through Patroni. WAL level is set to:

wal_level = 'replica'

So no archiving.

After the most recent crash 2-3 weeks ago, the cluster is now

running

into this message but I'm not able to make heads or tails out of

why

it's throwing this:

So you have not been able to run the cluster the past 2-3 weeks or is
that more recent?

Haven't been able to bring this PostgresSQL cluster up ( run the cluster
) since 2-3 weeks ago. Tried quite a few combinations of options to
recover this. No luck. Had storage failures earlier, even with
corrupted OS files, but this PostgreSQL cluster w/ Patroni was able to
come up each time without any recovery effort on my part.

When you refer to history files below are you talking about WAL
files or
something else?

Is this:

"recovery command file "recovery.conf" specified neither
primary_conninfo nor restore_command"

true?

True. recovery.conf is controlled by Patroni. Contents of this file
remained the same for all the cluster nodes with the exception of the
primary_slot_name:

[root@psql01 postgresql-patroni-etcd]# cat recovery.conf
primary_slot_name = 'postgresql0'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql01 postgresql-patroni-etcd]#

[root@psql02 postgres-backup]# cat recovery.conf
primary_slot_name = 'postgresql1'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql02 postgres-backup]#

[root@psql03 postgresql-patroni-backup]# cat recovery.conf
primary_slot_name = 'postgresql2'
standby_mode = 'on'
recovery_target_timeline = 'latest'
[root@psql03 postgresql-patroni-backup]#

I've made a copy of the root postgres directory over to another location
so when troubleshooting, I can always revert to the first state the
cluster was in when it failed.

I have no experience with Patroni so I will be of no help there. You
might get more useful information from:

https://github.com/zalando/patroni
Community

There are two places to connect with the Patroni community: on github,
via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
you're using Patroni, or just interested, please join us.

Will post there as well. Thank you. My thinking was to post here first
since I suspect the Patroni community will simply refer me back here given
that the PostgreSQL errors are originating directly from PostgreSQL.

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Yes, that is something I have been trying to do actually. But I hit a dead
end with the three errors above.

So what I did is to copy a single node's backed up copy of the data files
to */data/patroni* of the same node ( this is the psql data directory as
defined through patroni ) of the same node then ran this ( psql03 =
192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

This resulted in one of the 3 messages above. Hence the post here. If I
can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize all
the databases then restore the dump.

Using the above procedure I get one of three error messages when using the
data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.

Show quoted text

Thx,
TK

--
Adrian Klaver
adrian.klaver@aklaver.com

#6Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#5)
Re: psql: FATAL: the database system is starting up

On 5/31/19 7:53 PM, Tom K wrote:

There are two places to connect with the Patroni community: on github,
via Issues and PRs, and on channel #patroni in the PostgreSQL Slack. If
you're using Patroni, or just interested, please join us.

Will post there as well.  Thank you.  My thinking was to post here first
since I suspect the Patroni community will simply refer me back here
given that the PostgreSQL errors are originating directly from PostgreSQL.

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Yes, that is something I have been trying to do actually.  But I hit a
dead end with the three errors above.

So what I did is to copy a single node's backed up copy of the data
files to */data/patroni* of the same node ( this is the psql data
directory as defined through patroni ) of the same node then ran this (
psql03 = 192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?
That should be covered in postgresql.conf, no?

This resulted in one of the 3 messages above.  Hence the post here.  If
I can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize
all the databases then restore the dump.

What if you move the recovery.conf file out?

The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Using the above procedure I get one of three error messages when using
the data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of  307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.

Thx,
TK

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#7Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#6)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 5/31/19 7:53 PM, Tom K wrote:

There are two places to connect with the Patroni community: on

github,

via Issues and PRs, and on channel #patroni in the PostgreSQL Slack.

If

you're using Patroni, or just interested, please join us.

Will post there as well. Thank you. My thinking was to post here first
since I suspect the Patroni community will simply refer me back here
given that the PostgreSQL errors are originating directly from

PostgreSQL.

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Yes, that is something I have been trying to do actually. But I hit a
dead end with the three errors above.

So what I did is to copy a single node's backed up copy of the data
files to */data/patroni* of the same node ( this is the psql data
directory as defined through patroni ) of the same node then ran this (
psql03 = 192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?
That should be covered in postgresql.conf, no?

This resulted in one of the 3 messages above. Hence the post here. If
I can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize
all the databases then restore the dump.

What if you move the recovery.conf file out?

Will try.

The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Storage disappeared from underneath these clusters. The OS was of course
still in memory making futile attempts to write to disk, which would never
complete.

My best guess is that Patroni or postgress was in the middle of some writes
across the clusters when the failure occurred.

Show quoted text

Using the above procedure I get one of three error messages when using
the data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.

Thx,
TK

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#8Tom K
tomkcpr@gmail.com
In reply to: Tom K (#7)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 3:32 PM Tom K <tomkcpr@gmail.com> wrote:

On Sat, Jun 1, 2019 at 9:55 AM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 5/31/19 7:53 PM, Tom K wrote:

There are two places to connect with the Patroni community: on

github,

via Issues and PRs, and on channel #patroni in the PostgreSQL

Slack. If

you're using Patroni, or just interested, please join us.

Will post there as well. Thank you. My thinking was to post here

first

since I suspect the Patroni community will simply refer me back here
given that the PostgreSQL errors are originating directly from

PostgreSQL.

That being said, can you start the copied Postgres instance without
using the Patroni instrumentation?

Yes, that is something I have been trying to do actually. But I hit a
dead end with the three errors above.

So what I did is to copy a single node's backed up copy of the data
files to */data/patroni* of the same node ( this is the psql data
directory as defined through patroni ) of the same node then ran this (
psql03 = 192.168.0.118 ):

# sudo su - postgres
$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5

Why all the options?
That should be covered in postgresql.conf, no?

This resulted in one of the 3 messages above. Hence the post here. If
I can start a single instance, I should be fine since I could then 1)
replicate over to the other two or 2) simply take a dump, reinitialize
all the databases then restore the dump.

What if you move the recovery.conf file out?

Will try.

The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Storage disappeared from underneath these clusters. The OS was of course
still in memory making futile attempts to write to disk, which would never
complete.

My best guess is that Patroni or postgress was in the middle of some
writes across the clusters when the failure occurred.

Of note are the characters f2W below. I see nothing in the postgres source
code to indicate this is any recognizable postgres message. A part of me
suspects that the postgres binaries got corrupted. Had this case occur
with glib-common and a reinstall fixed it. However the postgres binaries
csum matches a standalone install perfectly so that should not be an issue.

Show quoted text

Using the above procedure I get one of three error messages when using
the data files of each node:

[ PSQL01 ]
postgres: postgres: startup process waiting for 000000010000000000000008

[ PSQL02 ]
PANIC:replicationcheckpointhas wrong magic 0 instead of 307747550

[ PSQL03 }
FATAL:syntax error inhistory file:f2W

And I can't start any one of them.

Thx,
TK

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#9Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#7)
Re: psql: FATAL: the database system is starting up

On 6/1/19 12:32 PM, Tom K wrote:

What if you move the recovery.conf file out?

Will try.

The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Storage disappeared from underneath these clusters.  The OS was of
course still in memory making futile attempts to write to disk, which
would never complete.

My best guess is that Patroni or postgress was in the middle of some
writes across the clusters when the failure occurred.

So to be clear all three clusters where writing to the same storage
medium and there was no WAL archiving to some other storage?

--
Adrian Klaver
adrian.klaver@aklaver.com

#10Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#8)
Re: psql: FATAL: the database system is starting up

On 6/1/19 12:42 PM, Tom K wrote:

Of note are the characters f2W below.  I see nothing in the postgres
source code to indicate this is any recognizable postgres message.  A
part of me suspects that the postgres binaries got corrupted.   Had this
case occur with glib-common and a reinstall fixed it.  However the
postgres binaries csum matches a standalone install perfectly so that
should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.

--
Adrian Klaver
adrian.klaver@aklaver.com

#11Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#9)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 4:11 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 6/1/19 12:32 PM, Tom K wrote:

What if you move the recovery.conf file out?

Will try.

The below looks like missing/corrupted/incorrect files. Hard to tell
without knowing what Patroni did?

Storage disappeared from underneath these clusters. The OS was of
course still in memory making futile attempts to write to disk, which
would never complete.

My best guess is that Patroni or postgress was in the middle of some
writes across the clusters when the failure occurred.

So to be clear all three clusters where writing to the same storage
medium and there was no WAL archiving to some other storage?

Yep, cheap LAB hardware with no power redundancy ( yet ) .

Show quoted text

--
Adrian Klaver
adrian.klaver@aklaver.com

#12Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#11)
Re: psql: FATAL: the database system is starting up

On 6/1/19 2:08 PM, Tom K wrote:

Yep, cheap LAB hardware with no power redundancy ( yet ) .

I don't suppose a pg_dump was done anytime recently?

--
Adrian Klaver
adrian.klaver@aklaver.com

#13Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#10)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 6/1/19 12:42 PM, Tom K wrote:

Of note are the characters f2W below. I see nothing in the postgres
source code to indicate this is any recognizable postgres message. A
part of me suspects that the postgres binaries got corrupted. Had this
case occur with glib-common and a reinstall fixed it. However the
postgres binaries csum matches a standalone install perfectly so that
should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.

Nope. Here's the full set of lines in the postgres logs when running the
above line:

2019-06-01 17:13:03.261 EDT [14909] LOG: database system was shut down at
2019-05-22 01:55:13 EDT
2019-06-01 17:13:03.261 EDT [14909] DEBUG: primary_slot_name =
'postgresql2'
2019-06-01 17:13:03.261 EDT [14909] DEBUG: standby_mode = 'on'
2019-06-01 17:13:03.261 EDT [14909] DEBUG: recovery_target_timeline =
latest
2019-06-01 17:13:03.261 EDT [14909] WARNING: recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 17:13:03.261 EDT [14909] HINT: The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 17:13:03.263 EDT [14909] LOG: entering standby mode
2019-06-01 17:13:03.263 EDT [14909] FATAL: syntax error in history file:
f2W
2019-06-01 17:13:03.263 EDT [14909] HINT: Expected a numeric timeline ID.
2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 1
before_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 4 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG: proc_exit(1): 2 callbacks to
make
2019-06-01 17:13:03.263 EDT [14909] DEBUG: exit(1)
2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:13:03.265 EDT [14907] DEBUG: reaping dead processes
2019-06-01 17:13:03.265 EDT [14907] LOG: startup process (PID 14909)
exited with exit code 1
2019-06-01 17:13:03.265 EDT [14907] LOG: aborting startup due to startup
process failure
2019-06-01 17:13:03.265 EDT [14907] DEBUG: shmem_exit(1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG: shmem_exit(1): 5 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG: cleaning up dynamic shared
memory control segment with ID 1669075945
2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(1): 2 callbacks to
make
2019-06-01 17:13:03.268 EDT [14907] LOG: database system is shut down
2019-06-01 17:13:03.268 EDT [14907] DEBUG: exit(1)
2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: logger shutting down
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(0): 0 callbacks to
make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: exit(0)
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(-1): 0 callbacks to
make
^C
-bash-4.2$

What's interesting is that f2W isn't a string you'd expect to be printed as
part of the code logic ( I mean, what is f2W? ).

postgresql10-server-10.8-1PGDG.rhel7.x86_64
postgresql10-libs-10.8-1PGDG.rhel7.x86_64
postgresql10-contrib-10.8-1PGDG.rhel7.x86_64
postgresql10-10.8-1PGDG.rhel7.x86_64

Without the recovery.conf file it actually started up. Going to try to get
a dump then continue to try and recover in place.

2019-06-01 17:17:29.777 EDT [14936] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.777 EDT [14936] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.778 EDT [14937] DEBUG: received inquiry for database 0
2019-06-01 17:17:29.778 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:17:29.779 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.790 EDT [15013] DEBUG: InitPostgres
2019-06-01 17:17:29.790 EDT [15013] DEBUG: my backend ID is 3
2019-06-01 17:17:29.791 EDT [15013] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG: autovacuum: processing database
"postgres"
2019-06-01 17:17:29.801 EDT [14937] DEBUG: received inquiry for database
13806
2019-06-01 17:17:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_13806.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.811 EDT [15013] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.811 EDT [15013] DEBUG: pg_statistic: vac: 0 (threshold
129), anl: 0 (threshold 89)
2019-06-01 17:17:29.811 EDT [15013] DEBUG: pg_type: vac: 0 (threshold
125), anl: 0 (threshold 88)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_authid: vac: 0 (threshold
51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_attribute: vac: 0 (threshold
566), anl: 0 (threshold 308)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_proc: vac: 0 (threshold
629), anl: 0 (threshold 339)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_class: vac: 0 (threshold
118), anl: 0 (threshold 84)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_index: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_operator: vac: 0 (threshold
207), anl: 0 (threshold 129)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_opclass: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_am: vac: 0 (threshold 51),
anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_amop: vac: 0 (threshold
192), anl: 0 (threshold 121)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_amproc: vac: 0 (threshold
132), anl: 0 (threshold 91)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_aggregate: vac: 0 (threshold
78), anl: 0 (threshold 64)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_rewrite: vac: 0 (threshold
74), anl: 0 (threshold 62)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_cast: vac: 0 (threshold 94),
anl: 0 (threshold 72)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_namespace: vac: 0 (threshold
51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_database: vac: 0 (threshold
50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_tablespace: vac: 0
(threshold 50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_toast_2618: vac: 0
(threshold 95), anl: 0 (threshold 73)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(0): 1
before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(0): 7 on_shmem_exit
callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: proc_exit(0): 2 callbacks to
make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: exit(0)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:17:29.814 EDT [14930] DEBUG: reaping dead processes
2019-06-01 17:17:29.814 EDT [14930] DEBUG: server process (PID 15013)
exited with exit code 0
2019-06-01 17:18:29.785 EDT [14936] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14936] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14937] DEBUG: received inquiry for database 0
2019-06-01 17:18:29.785 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:18:29.786 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.797 EDT [15020] DEBUG: InitPostgres
2019-06-01 17:18:29.797 EDT [15020] DEBUG: my backend ID is 3
2019-06-01 17:18:29.798 EDT [15020] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG: autovacuum: processing database
"postgres"
2019-06-01 17:18:29.800 EDT [14937] DEBUG: received inquiry for database
13806
2019-06-01 17:18:29.800 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:18:29.800 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_13806.stat"
2019-06-01 17:18:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.811 EDT [15020] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_statistic: vac: 0 (threshold
129), anl: 0 (threshold 89)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_type: vac: 0 (threshold
125), anl: 0 (threshold 88)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_authid: vac: 0 (threshold
51), anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_attribute: vac: 0 (threshold
566), anl: 0 (threshold 308)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_proc: vac: 0 (threshold
629), anl: 0 (threshold 339)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_class: vac: 0 (threshold
118), anl: 0 (threshold 84)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_index: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_operator: vac: 0 (threshold
207), anl: 0 (threshold 129)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_opclass: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_am: vac: 0 (threshold 51),
anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_amop: vac: 0 (threshold
192), anl: 0 (threshold 121)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_amproc: vac: 0 (threshold
132), anl: 0 (threshold 91)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_aggregate: vac: 0 (threshold
78), anl: 0 (threshold 64)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_rewrite: vac: 0 (threshold
74), anl: 0 (threshold 62)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_cast: vac: 0 (threshold 94),
anl: 0 (threshold 72)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_namespace: vac: 0 (threshold
51), anl: 0 (threshold 51)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_database: vac: 0 (threshold
50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_tablespace: vac: 0
(threshold 50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_toast_2618: vac: 0
(threshold 95), anl: 0 (threshold 73)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(0): 1
before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(0): 7 on_shmem_exit
callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: proc_exit(0): 2 callbacks to
make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: exit(0)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:18:29.814 EDT [14930] DEBUG: reaping dead processes
2019-06-01 17:18:29.814 EDT [14930] DEBUG: server process (PID 15020)
exited with exit code 0
^C
[root@psql03 log]#

The point of the POC and the LAB is to test these things across failures as
well as various configurations. To that end, I'm just as curious how to
recover from these error conditions as I am just getting things to work.

Show quoted text

--
Adrian Klaver
adrian.klaver@aklaver.com

#14Tom K
tomkcpr@gmail.com
In reply to: Tom K (#13)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 5:30 PM Tom K <tomkcpr@gmail.com> wrote:

On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 6/1/19 12:42 PM, Tom K wrote:

Of note are the characters f2W below. I see nothing in the postgres
source code to indicate this is any recognizable postgres message. A
part of me suspects that the postgres binaries got corrupted. Had

this

case occur with glib-common and a reinstall fixed it. However the
postgres binaries csum matches a standalone install perfectly so that
should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.

Nope. Here's the full set of lines in the postgres logs when running the
above line:

2019-06-01 17:13:03.261 EDT [14909] LOG: database system was shut down at
2019-05-22 01:55:13 EDT
2019-06-01 17:13:03.261 EDT [14909] DEBUG: primary_slot_name =
'postgresql2'
2019-06-01 17:13:03.261 EDT [14909] DEBUG: standby_mode = 'on'
2019-06-01 17:13:03.261 EDT [14909] DEBUG: recovery_target_timeline =
latest
2019-06-01 17:13:03.261 EDT [14909] WARNING: recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2019-06-01 17:13:03.261 EDT [14909] HINT: The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2019-06-01 17:13:03.263 EDT [14909] LOG: entering standby mode
2019-06-01 17:13:03.263 EDT [14909] FATAL: syntax error in history file:
f2W
2019-06-01 17:13:03.263 EDT [14909] HINT: Expected a numeric timeline ID.
2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 1
before_shmem_exit callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG: shmem_exit(1): 4 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.263 EDT [14909] DEBUG: proc_exit(1): 2 callbacks to
make
2019-06-01 17:13:03.263 EDT [14909] DEBUG: exit(1)
2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
2019-06-01 17:13:03.264 EDT [14909] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:13:03.265 EDT [14907] DEBUG: reaping dead processes
2019-06-01 17:13:03.265 EDT [14907] LOG: startup process (PID 14909)
exited with exit code 1
2019-06-01 17:13:03.265 EDT [14907] LOG: aborting startup due to startup
process failure
2019-06-01 17:13:03.265 EDT [14907] DEBUG: shmem_exit(1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG: shmem_exit(1): 5 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.266 EDT [14907] DEBUG: cleaning up dynamic shared
memory control segment with ID 1669075945
2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(1): 2 callbacks to
make
2019-06-01 17:13:03.268 EDT [14907] LOG: database system is shut down
2019-06-01 17:13:03.268 EDT [14907] DEBUG: exit(1)
2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
2019-06-01 17:13:03.268 EDT [14907] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: logger shutting down
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(0): 0 on_shmem_exit
callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(0): 0 callbacks to
make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: exit(0)
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
2019-06-01 17:13:03.270 EDT [14908] DEBUG: proc_exit(-1): 0 callbacks to
make
^C
-bash-4.2$

What's interesting is that f2W isn't a string you'd expect to be printed
as part of the code logic ( I mean, what is f2W? ).

postgresql10-server-10.8-1PGDG.rhel7.x86_64
postgresql10-libs-10.8-1PGDG.rhel7.x86_64
postgresql10-contrib-10.8-1PGDG.rhel7.x86_64
postgresql10-10.8-1PGDG.rhel7.x86_64

Without the recovery.conf file it actually started up. Going to try to
get a dump then continue to try and recover in place.

2019-06-01 17:17:29.777 EDT [14936] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.777 EDT [14936] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.778 EDT [14937] DEBUG: received inquiry for database 0
2019-06-01 17:17:29.778 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:17:29.779 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.790 EDT [15013] DEBUG: InitPostgres
2019-06-01 17:17:29.790 EDT [15013] DEBUG: my backend ID is 3
2019-06-01 17:17:29.791 EDT [15013] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.801 EDT [15013] DEBUG: autovacuum: processing
database "postgres"
2019-06-01 17:17:29.801 EDT [14937] DEBUG: received inquiry for database
13806
2019-06-01 17:17:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_13806.stat"
2019-06-01 17:17:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:17:29.811 EDT [15013] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.811 EDT [15013] DEBUG: pg_statistic: vac: 0
(threshold 129), anl: 0 (threshold 89)
2019-06-01 17:17:29.811 EDT [15013] DEBUG: pg_type: vac: 0 (threshold
125), anl: 0 (threshold 88)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_authid: vac: 0 (threshold
51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_attribute: vac: 0
(threshold 566), anl: 0 (threshold 308)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_proc: vac: 0 (threshold
629), anl: 0 (threshold 339)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_class: vac: 0 (threshold
118), anl: 0 (threshold 84)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_index: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_operator: vac: 0 (threshold
207), anl: 0 (threshold 129)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_opclass: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_am: vac: 0 (threshold 51),
anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_amop: vac: 0 (threshold
192), anl: 0 (threshold 121)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_amproc: vac: 0 (threshold
132), anl: 0 (threshold 91)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_aggregate: vac: 0
(threshold 78), anl: 0 (threshold 64)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_rewrite: vac: 0 (threshold
74), anl: 0 (threshold 62)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_cast: vac: 0 (threshold
94), anl: 0 (threshold 72)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_namespace: vac: 0
(threshold 51), anl: 0 (threshold 51)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_database: vac: 0 (threshold
50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_tablespace: vac: 0
(threshold 50), anl: 0 (threshold 50)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: pg_toast_2618: vac: 0
(threshold 95), anl: 0 (threshold 73)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(0): 1
before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(0): 7 on_shmem_exit
callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: proc_exit(0): 2 callbacks to
make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: exit(0)
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
2019-06-01 17:17:29.812 EDT [15013] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:17:29.814 EDT [14930] DEBUG: reaping dead processes
2019-06-01 17:17:29.814 EDT [14930] DEBUG: server process (PID 15013)
exited with exit code 0
2019-06-01 17:18:29.785 EDT [14936] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14936] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.785 EDT [14937] DEBUG: received inquiry for database 0
2019-06-01 17:18:29.785 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:18:29.786 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.797 EDT [15020] DEBUG: InitPostgres
2019-06-01 17:18:29.797 EDT [15020] DEBUG: my backend ID is 3
2019-06-01 17:18:29.798 EDT [15020] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.800 EDT [15020] DEBUG: autovacuum: processing
database "postgres"
2019-06-01 17:18:29.800 EDT [14937] DEBUG: received inquiry for database
13806
2019-06-01 17:18:29.800 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/global.stat"
2019-06-01 17:18:29.800 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_13806.stat"
2019-06-01 17:18:29.801 EDT [14937] DEBUG: writing stats file
"pg_stat_tmp/db_0.stat"
2019-06-01 17:18:29.811 EDT [15020] DEBUG: StartTransaction(1) name:
unnamed; blockState: DEFAULT; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_statistic: vac: 0
(threshold 129), anl: 0 (threshold 89)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_type: vac: 0 (threshold
125), anl: 0 (threshold 88)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_authid: vac: 0 (threshold
51), anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_attribute: vac: 0
(threshold 566), anl: 0 (threshold 308)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_proc: vac: 0 (threshold
629), anl: 0 (threshold 339)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_class: vac: 0 (threshold
118), anl: 0 (threshold 84)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_index: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_operator: vac: 0 (threshold
207), anl: 0 (threshold 129)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_opclass: vac: 0 (threshold
77), anl: 0 (threshold 63)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_am: vac: 0 (threshold 51),
anl: 0 (threshold 51)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_amop: vac: 0 (threshold
192), anl: 0 (threshold 121)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_amproc: vac: 0 (threshold
132), anl: 0 (threshold 91)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_aggregate: vac: 0
(threshold 78), anl: 0 (threshold 64)
2019-06-01 17:18:29.811 EDT [15020] DEBUG: pg_rewrite: vac: 0 (threshold
74), anl: 0 (threshold 62)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_cast: vac: 0 (threshold
94), anl: 0 (threshold 72)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_namespace: vac: 0
(threshold 51), anl: 0 (threshold 51)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_database: vac: 0 (threshold
50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_tablespace: vac: 0
(threshold 50), anl: 0 (threshold 50)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: pg_toast_2618: vac: 0
(threshold 95), anl: 0 (threshold 73)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: CommitTransaction(1) name:
unnamed; blockState: STARTED; state: INPROGR, xid/subid/cid: 0/1/0
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(0): 1
before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(0): 7 on_shmem_exit
callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: proc_exit(0): 2 callbacks to
make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: exit(0)
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
2019-06-01 17:18:29.812 EDT [15020] DEBUG: proc_exit(-1): 0 callbacks to
make
2019-06-01 17:18:29.814 EDT [14930] DEBUG: reaping dead processes
2019-06-01 17:18:29.814 EDT [14930] DEBUG: server process (PID 15020)
exited with exit code 0
^C
[root@psql03 log]#

The point of the POC and the LAB is to test these things across failures
as well as various configurations. To that end, I'm just as curious how to
recover from these error conditions as I am just getting things to work.

Spoke too soon. There's no tables when it's started without the
recovery.conf file.

Show quoted text

--
Adrian Klaver
adrian.klaver@aklaver.com

#15Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#13)
Re: psql: FATAL: the database system is starting up

On 6/1/19 2:30 PM, Tom K wrote:

On Sat, Jun 1, 2019 at 4:52 PM Adrian Klaver <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:

On 6/1/19 12:42 PM, Tom K wrote:

Of note are the characters f2W below.  I see nothing in the postgres
source code to indicate this is any recognizable postgres

message.  A

part of me suspects that the postgres binaries got corrupted.

Had this

case occur with glib-common and a reinstall fixed it.  However the
postgres binaries csum matches a standalone install perfectly so

that

should not be an issue.

It comes from timeline.c:

https://doxygen.postgresql.org/bin_2pg__rewind_2timeline_8c.html

pg_log_error("syntax error in history file: %s", fline);

...

There should be another error message after the above.

Nope.  Here's the full set of lines in the postgres logs when running
the above line:

2019-06-01 17:13:03.263 EDT [14909] FATAL:  syntax error in history
file: f2W
2019-06-01 17:13:03.263 EDT [14909] HINT:  Expected a numeric timeline ID.

Actually the above HINT is what I was looking for.

^C
-bash-4.2$

What's interesting is that f2W isn't a string you'd expect to be printed
as part of the code logic   ( I mean, what is f2W? ).

As the HINT said neither was Postgres. That is probably down to file
corruption.

The point of the POC and the LAB is to test these things across failures
as well as various configurations.  To that end, I'm just as curious how
to recover from these error conditions as I am just getting things to work.

I think what it proved was that a single point of failure is not good
and that there needs to be steps taken to prevent or deal with it e.g.
second location backup of some sort.

--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>

--
Adrian Klaver
adrian.klaver@aklaver.com

#16Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#14)
Re: psql: FATAL: the database system is starting up

On 6/1/19 2:31 PM, Tom K wrote:

Spoke too soon.  There's no tables when it's started without the
recovery.conf file.

Where there any errors in the start up?

Are there databases in the clusters system(template1, postgres, etc) or
user?

Did you start against the correct PG_DATA directory?

--
Adrian Klaver
adrian.klaver@aklaver.com

#17Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#16)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 6/1/19 2:31 PM, Tom K wrote:

Spoke too soon. There's no tables when it's started without the
recovery.conf file.

Where there any errors in the start up?

Nothing I would discern as a startup error:

[root@psql03 patroni]# sudo su - postgres
Last login: Sat Jun 1 17:30:31 EDT 2019 on pts/1
-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
2019-06-01 17:41:27.477 EDT [15209] DEBUG: postgres: PostmasterMain:
initial environment dump:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
-----------------------------------------
2019-06-01 17:41:27.477 EDT [15209] DEBUG: XDG_SESSION_ID=219
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOSTNAME=psql03.nix.mds.xyz
2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHELL=/bin/bash
2019-06-01 17:41:27.477 EDT [15209] DEBUG: TERM=xterm
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTSIZE=1000
2019-06-01 17:41:27.477 EDT [15209] DEBUG: USER=postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
MAIL=/var/spool/mail/postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-01 17:41:27.477 EDT [15209] DEBUG: PWD=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG: LANG=en_US.UTF-8
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTCONTROL=ignoredups
2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHLVL=1
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOME=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG: LOGNAME=postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
PGDATA=/var/lib/pgsql/10/data
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-01 17:41:27.478 EDT [15209] DEBUG: _=/usr/pgsql-10/bin/postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_COLLATE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_CTYPE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MESSAGES=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MONETARY=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_NUMERIC=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_TIME=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
-----------------------------------------
2019-06-01 17:41:27.494 EDT [15209] DEBUG: registering background worker
"logical replication launcher"
2019-06-01 17:41:27.495 EDT [15209] LOG: listening on IPv4 address
"192.168.0.118", port 5432
2019-06-01 17:41:27.498 EDT [15209] LOG: listening on Unix socket
"./.s.PGSQL.5432"
2019-06-01 17:41:27.499 EDT [15209] DEBUG: invoking
IpcMemoryCreate(size=148545536)
2019-06-01 17:41:27.499 EDT [15209] DEBUG: mmap(148897792) with
MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 17:41:27.514 EDT [15209] DEBUG: SlruScanDirectory invoking
callback on pg_notify/0000
2019-06-01 17:41:27.514 EDT [15209] DEBUG: removing file "pg_notify/0000"
2019-06-01 17:41:27.515 EDT [15209] DEBUG: dynamic shared memory system
will support 288 segments
2019-06-01 17:41:27.515 EDT [15209] DEBUG: created dynamic shared memory
control segment 834455595 (6928 bytes)
2019-06-01 17:41:27.518 EDT [15209] DEBUG: max_safe_fds = 985, usable_fds
= 1000, already_open = 5
2019-06-01 17:41:27.519 EDT [15209] LOG: redirecting log output to logging
collector process
2019-06-01 17:41:27.519 EDT [15209] HINT: Future log output will appear in
directory "log".
2019-06-01 17:41:27.550 EDT [15210] DEBUG: logger shutting down
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(0): 0 on_shmem_exit
callbacks to make
-bash-4.2$ 2019-06-01 17:41:27.550 EDT [15210] DEBUG: proc_exit(0): 0
callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: exit(0)
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(-1): 0 on_shmem_exit
callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: proc_exit(-1): 0 callbacks to
make

-bash-4.2$
-bash-4.2$

This message:

mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot
allocate memory

doesn't seem to have been a problem. I doubled the memory on the VM but
the message remained.

Are there databases in the clusters system(template1, postgres, etc) or
user?

No.

Did you start against the correct PG_DATA directory?

Yes.

Show quoted text

--
Adrian Klaver
adrian.klaver@aklaver.com

#18Tom K
tomkcpr@gmail.com
In reply to: Tom K (#17)
Re: psql: FATAL: the database system is starting up

On Sat, Jun 1, 2019 at 6:05 PM Tom K <tomkcpr@gmail.com> wrote:

On Sat, Jun 1, 2019 at 5:51 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

On 6/1/19 2:31 PM, Tom K wrote:

Spoke too soon. There's no tables when it's started without the
recovery.conf file.

Where there any errors in the start up?

Nothing I would discern as a startup error:

[root@psql03 patroni]# sudo su - postgres
Last login: Sat Jun 1 17:30:31 EDT 2019 on pts/1
-bash-4.2$ /usr/pgsql-10/bin/postgres -D /data/patroni
--config-file=/data/patroni/postgresql.conf
--listen_addresses=192.168.0.118 --max_worker_processes=8
--max_locks_per_transaction=64 --wal_level=replica
--track_commit_timestamp=off --max_prepared_transactions=0 --port=5432
--max_replication_slots=10 --max_connections=100 --hot_standby=on
--cluster_name=postgres --wal_log_hints=on --max_wal_senders=10 -d 5
2019-06-01 17:41:27.477 EDT [15209] DEBUG: postgres: PostmasterMain:
initial environment dump:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
-----------------------------------------
2019-06-01 17:41:27.477 EDT [15209] DEBUG: XDG_SESSION_ID=219
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOSTNAME=
psql03.nix.mds.xyz
2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHELL=/bin/bash
2019-06-01 17:41:27.477 EDT [15209] DEBUG: TERM=xterm
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTSIZE=1000
2019-06-01 17:41:27.477 EDT [15209] DEBUG: USER=postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.jpg=01;35:*.jpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.axv=01;35:*.anx=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.axa=01;36:*.oga=01;36:*.spx=01;36:*.xspf=01;36:
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
MAIL=/var/spool/mail/postgres
2019-06-01 17:41:27.477 EDT [15209] DEBUG:
PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/pgsql-10/bin/
2019-06-01 17:41:27.477 EDT [15209] DEBUG: PWD=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG: LANG=en_US.UTF-8
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HISTCONTROL=ignoredups
2019-06-01 17:41:27.477 EDT [15209] DEBUG: SHLVL=1
2019-06-01 17:41:27.477 EDT [15209] DEBUG: HOME=/var/lib/pgsql
2019-06-01 17:41:27.477 EDT [15209] DEBUG: LOGNAME=postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
PGDATA=/var/lib/pgsql/10/data
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
LESSOPEN=||/usr/bin/lesspipe.sh %s
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
_=/usr/pgsql-10/bin/postgres
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
PGLOCALEDIR=/usr/pgsql-10/share/locale
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
PGSYSCONFDIR=/etc/sysconfig/pgsql
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_COLLATE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_CTYPE=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MESSAGES=en_US.UTF-8
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_MONETARY=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_NUMERIC=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG: LC_TIME=C
2019-06-01 17:41:27.478 EDT [15209] DEBUG:
-----------------------------------------
2019-06-01 17:41:27.494 EDT [15209] DEBUG: registering background worker
"logical replication launcher"
2019-06-01 17:41:27.495 EDT [15209] LOG: listening on IPv4 address
"192.168.0.118", port 5432
2019-06-01 17:41:27.498 EDT [15209] LOG: listening on Unix socket
"./.s.PGSQL.5432"
2019-06-01 17:41:27.499 EDT [15209] DEBUG: invoking
IpcMemoryCreate(size=148545536)
2019-06-01 17:41:27.499 EDT [15209] DEBUG: mmap(148897792) with
MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2019-06-01 17:41:27.514 EDT [15209] DEBUG: SlruScanDirectory invoking
callback on pg_notify/0000
2019-06-01 17:41:27.514 EDT [15209] DEBUG: removing file "pg_notify/0000"
2019-06-01 17:41:27.515 EDT [15209] DEBUG: dynamic shared memory system
will support 288 segments
2019-06-01 17:41:27.515 EDT [15209] DEBUG: created dynamic shared memory
control segment 834455595 (6928 bytes)
2019-06-01 17:41:27.518 EDT [15209] DEBUG: max_safe_fds = 985, usable_fds
= 1000, already_open = 5
2019-06-01 17:41:27.519 EDT [15209] LOG: redirecting log output to
logging collector process
2019-06-01 17:41:27.519 EDT [15209] HINT: Future log output will appear
in directory "log".
2019-06-01 17:41:27.550 EDT [15210] DEBUG: logger shutting down
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(0): 0
before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(0): 0 on_shmem_exit
callbacks to make
-bash-4.2$ 2019-06-01 17:41:27.550 EDT [15210] DEBUG: proc_exit(0): 0
callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: exit(0)
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(-1): 0
before_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: shmem_exit(-1): 0
on_shmem_exit callbacks to make
2019-06-01 17:41:27.550 EDT [15210] DEBUG: proc_exit(-1): 0 callbacks to
make

-bash-4.2$
-bash-4.2$

This message:

mmap(148897792) with MAP_HUGETLB failed, huge pages disabled: Cannot
allocate memory

doesn't seem to have been a problem. I doubled the memory on the VM but
the message remained.

Are there databases in the clusters system(template1, postgres, etc) or
user?

No.

** Correction. There is postgres, template1 and template2 but none of the
other databases we had.

Show quoted text

Did you start against the correct PG_DATA directory?

Yes.

--
Adrian Klaver
adrian.klaver@aklaver.com

#19Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#18)
Re: psql: FATAL: the database system is starting up

On 6/1/19 3:14 PM, Tom K wrote:

** Correction.  There is postgres, template1 and template2 but none of
the other databases we had.

In a psql session do:

select oid, datname from pg_database ;

Then go to /data/patroni an drill down to the base directory. In that
directory there should be sub-directories with numbers equal to the oid
values from the query. If not then something is wrong. That could be:

1) Wrong data directory.

2) Copied over incorrect data directory.

Basically if the only sub-directories are for the system databases then
the user databases do not exist. You might want to look at some of the
other clusters to see what is there.

--
Adrian Klaver
adrian.klaver@aklaver.com

#20Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#18)
Re: psql: FATAL: the database system is starting up

On 6/1/19 3:14 PM, Tom K wrote:

** Correction.  There is postgres, template1 and template2 but none of
the other databases we had.

Just noticed, is that really template2 or is it actually template0?

--
Adrian Klaver
adrian.klaver@aklaver.com

#21Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#20)
#22Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#19)
#23Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#22)
#24Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#23)
#25Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#24)
#26Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#24)
#27Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#26)
#28Tom K
tomkcpr@gmail.com
In reply to: Tom K (#27)
#29Tom K
tomkcpr@gmail.com
In reply to: Tom K (#28)
#30Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#27)
#31Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#30)
#32Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#29)
#33Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#32)
#34Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#33)
#35Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#34)
#36Adrian Klaver
adrian.klaver@aklaver.com
In reply to: Tom K (#35)
#37Tom K
tomkcpr@gmail.com
In reply to: Adrian Klaver (#36)