BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
The following bug has been logged on the website:
Bug reference: 17392
Logged by: Alexander Ulaev
Email address: alexander.ulaev@rtlabs.ru
PostgreSQL version: Unsupported/Unknown
Operating system: CentOS Linux release 7.9.2009 (Core)
Description:
We have a some shards with patroni cluster over PG9.6 installed on VMs
Some problems on SAN side follow our kvm VMs was halted during 1 min
approximately by I\O disability and most of db shards with relatively low
load had a 40-60 seconds commits, but was survived
but two shard's masters with high application load (TPS = 2-3x from AVG
among the shard DBs) was unexpectedly shutdowned with the same errors:
2022-02-01 16:12:24 MSK [16959] LOG: received immediate shutdown request
and
2022-02-01 16:12:25 MSK [16959] LOG: archiver process (PID 117615) exited
with exit code 2 (I suppose this timestamp for LOG is incorrect and this
record really stands behind "shutdown" record by meaning)
among huge number of messages for user process "terminating connection
because of crash of another server process" like these
2022-02-01 16:12:24 MSK [151045] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 HINT: In a moment you should be able to reconnect to the
database and repeat your command.
2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 WARNING: terminating connection because of crash of another
server process
2022-02-01 16:12:24 MSK [152240] 127.0.0.1 PostgreSQL JDBC Driver
queue2@queue2 DETAIL: The postmaster has commanded this server process to
roll back the current transaction and exit, because another server process
exited abnormally and possibly corrupted shared memory.
I can't find anywhere what do this exit code 2 stand for and as I know of
the behavior of ARCHIVER process on abnormal termination it had to be
restarted by the postmaster, but no "entire instance is terminated
abnormally", or "All of the postgres process halts"
I found in source code that all functions relating to archiver are included
in pgarch.c having initial author: Simon Riggs simon@2ndquadrant.com, but I
cant found there any information related to "exit code 2"
Later when instance was starting and recovering wal logs since the last
checkpoint, then "invalid record length" arise:
2022-02-01 16:13:17 MSK [153401] LOG: invalid record length at
1CEC/C3BEBB50: wanted 24, got 0
2022-02-01 16:13:17 MSK [153401] LOG: consistent recovery state reached at
1CEC/C3BEBB50
2022-02-01 16:13:17 MSK [153397] LOG: database system is ready to accept
read only connections
but instance was started and patroni return it to master role, because sync
replica also was shutdowned by "invalid record length" when applied wal
logs
2022-02-01 16:12:25 MSK [16563] 127.0.0.1 [unknown] patroni@postgres LOG:
connection authorized: user=patroni database=postgres
WARNING: terminating connection because of crash of another server
process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.
2022-02-01 16:12:25 MSK [89015] FATAL: could not receive data from WAL
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2022-02-01 16:12:25 MSK [89009] LOG: invalid record length at
1CEC/C3BEBA90: wanted 24, got 0
2022-02-01 16:12:25 MSK [16564] FATAL: could not connect to the primary
server: FATAL: the database system is shutting down
2022-02-01 16:12:26 MSK [89006] LOG: received fast shutdown request
2022-02-01 16:12:26 MSK [89006] LOG: aborting any active transactions
Hello
2022-02-01 16:12:24 MSK [16959] LOG: received immediate shutdown request
Please prove by patroni logs, that not patroni commanded immediate shutdown. As far as I remember, there should be a mark about the request for stopping the database immediatly.
Seems unrelated to archiver. Patroni is able to command immediate shutdown, immediate shutdown is a kind of "terminated abnormally". The sequence of logs looks completely correct.
regards, Sergei
Hi, thanks for your cooperation
2022-02-01 16:11:34,964 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:11:44,950 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:11:54,677 INFO: Lock owner: db02.local; I am db02.local
2022-02-01 16:11:56,145 ERROR: Request to server http://IP_ETCD1:2379 failed: MaxRetryError('HTTPConnectionPool(host=\'IP_ETCD1\', port=2379): Max retries exceeded with url: /v2/keys/postgresql/db/leader (Caused by Read
TimeoutError("HTTPConnectionPool(host=\'IP_ETCD1\', port=2379): Read timed out. (read timeout=1.4283105651182788)",))',)
2022-02-01 16:11:56,146 INFO: Reconnection allowed, looking for another server.
2022-02-01 16:11:56,146 INFO: Retrying on http://IP_ETCD2:2379
2022-02-01 16:11:56,262 INFO: Selected new etcd server http://IP_ETCD2:2379
2022-02-01 16:11:56,475 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:12:04,950 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:12:14,960 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:12:24,669 INFO: Lock owner: db02.local; I am db02.local
2022-02-01 16:12:24,738 ERROR:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 562, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 692, in _update_leader
return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 443, in retry
return retry(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/patroni/utils.py", line 334, in __call__
return func(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/lib/python3.6/site-packages/patroni/dcs/etcd.py", line 256, in api_execute
return self._handle_server_response(response)
File "/usr/lib/python3.6/site-packages/etcd/client.py", line 987, in _handle_server_response
etcd.EtcdError.handle(r)
File "/usr/lib/python3.6/site-packages/etcd/__init__.py", line 306, in handle
raise exc(msg, payload)
etcd.EtcdKeyNotFound: Key not found : /postgresql/db/leader
2022-02-01 16:12:24,928 ERROR: failed to update leader lock
2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS
2022-02-01 16:12:27,079 INFO: closed patroni connection to the postgresql cluster
2022-02-01 16:12:28,685 INFO: postmaster pid=153397
2022-02-01 16:12:34,670 INFO: Lock owner: None; I am db02.local
2022-02-01 16:12:34,671 INFO: not healthy enough for leader race
2022-02-01 16:12:34,738 INFO: starting after demotion in progress
2022-02-01 16:12:44,672 INFO: Lock owner: None; I am db02.local
2022-02-01 16:12:44,672 INFO: not healthy enough for leader race
2022-02-01 16:12:44,741 INFO: starting after demotion in progress
2022-02-01 16:12:54,672 INFO: Lock owner: None; I am db02.local
2022-02-01 16:12:54,672 INFO: not healthy enough for leader race
2022-02-01 16:12:54,739 INFO: starting after demotion in progress
2022-02-01 16:13:04,672 INFO: Lock owner: None; I am db02.local
2022-02-01 16:13:04,672 INFO: not healthy enough for leader race
2022-02-01 16:13:04,746 INFO: starting after demotion in progress
2022-02-01 16:13:14,673 INFO: Lock owner: None; I am db02.local
2022-02-01 16:13:14,674 INFO: not healthy enough for leader race
2022-02-01 16:13:14,743 INFO: starting after demotion in progress
2022-02-01 16:13:18,748 INFO: establishing a new patroni connection to the postgres cluster
2022-02-01 16:13:19,961 INFO: Got response from db01.local http://IP_DB01:8008/patroni: {"state": "starting", "role": "replica", "database_system_identifier": "9879384539485923423", "patroni": {"version": "2.1.0", "scope": "db"}}
2022-02-01 16:13:19,961 WARNING: Request failed to db01.local: GET http://IP_DB01:8008/patroni ('xlog')
2022-02-01 16:13:20,365 INFO: promoted self to leader by acquiring session lock
2022-02-01 16:13:20,380 INFO: cleared rewind state after becoming the leader
2022-02-01 16:13:21,816 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:13:24,830 INFO: no action. I am (db02.local) the leader with the lock
2022-02-01 16:13:34,826 INFO: no action. I am (db02.local) the leader with the lock
Best regards,
Ulaev Alexander
-----Original Message-----
From: Sergei Kornilov [mailto:sk@zsrv.org]
Sent: Thursday, February 3, 2022 11:52 AM
To: PG Bug reporting form <noreply@postgresql.org>
Cc: Улаев Александр Сергеевич <alexander.ulaev@rtlabs.ru>; pgsql-bugs@lists.postgresql.org
Subject: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
Hello
2022-02-01 16:12:24 MSK [16959] LOG: received immediate shutdown request
Please prove by patroni logs, that not patroni commanded immediate shutdown. As far as I remember, there should be a mark about the request for stopping the database immediatly.
Seems unrelated to archiver. Patroni is able to command immediate shutdown, immediate shutdown is a kind of "terminated abnormally". The sequence of logs looks completely correct.
regards, Sergei
Hello
2022-02-01 16:12:24,928 ERROR: failed to update leader lock
2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS
Between these two messages, an immediate shutdown is called: https://github.com/zalando/patroni/blob/v2.1.2/patroni/ha.py#L1045
Regards, Sergei
We found on ETCD1 such errors in the syslog:
Feb 1 16:12:01 etcd1 etcd: got unexpected response error (etcdserver: request timed out)
Feb 1 16:12:02 etcd1 etcd: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1.21s]
Feb 1 16:12:03 etcd1 etcd: got unexpected response error (etcdserver: request timed out) [merged 1 repeated lines in 1s]
Feb 1 16:12:20 etcd1 etcd: sync duration of 29.69857369s, expected less than 1s
Feb 1 16:26:55 etcd1 etcd: got unexpected response error (etcdserver: request timed out)
Feb 1 16:27:03 etcd1 etcd: got unexpected response error (etcdserver: request timed out)
Feb 1 16:27:17 etcd1 etcd: sync duration of 1m0.745329542s, expected less than 1s
So, this problem related to SAN => i\o freeze many VMs including DB's, ETCD nodes => etcd i\o long delay affect => patroni reaction to demoted self postgresql instance
Sergey, Thank you much for your support!
Best regards,
Ulaev Alexander
-----Original Message-----
From: Sergei Kornilov [mailto:sk@zsrv.org]
Sent: Thursday, February 3, 2022 1:06 PM
To: Улаев Александр Сергеевич <alexander.ulaev@rtlabs.ru>
Cc: pgsql-bugs@lists.postgresql.org; PG Bug reporting form <noreply@postgresql.org>
Subject: Re:BUG #17392: archiver process exited with exit code 2 was unexpectedly cause for immediate shutdown request
Hello
2022-02-01 16:12:24,928 ERROR: failed to update leader lock
2022-02-01 16:12:27,063 INFO: demoted self because failed to update leader lock in DCS
Between these two messages, an immediate shutdown is called: https://github.com/zalando/patroni/blob/v2.1.2/patroni/ha.py#L1045
Regards, Sergei