walreceiver termination

Started by Justin Kingabout 6 years ago8 messagesgeneral

kingpin867@gmail.com

about 6 years ago

We've seen unexpected termination of the WAL receiver process. This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened. It appears as if the SIGTERM is coming from the
"postgres: startup" process. The only notable thing that I've seen
discussion of (in the past) and where it might be relevant here is
that these replicas all reside on a ZFS filesystem (on Linux). In
fact, everywhere we've seen issues, the database sits on a ZFS pool.
We also have another PG cluster running on non-ZFS (ext4) and have
never experienced this problem.

Any suggestions?

Logging snippets are below:

postgresql.log
Apr 23 09:09:13 royce postgres[2229]: [43141904-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG: record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141904-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A6E0 for
Btree/INSERT_LEAF: off 88
Apr 23 09:09:13 royce postgres[2229]: [43141905-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG: record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141905-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A720 for
Heap/HEAP_CONFIRM: off 6
Apr 23 09:09:13 royce postgres[2229]: [43141906-1] 2020-04-23
09:09:13.663 GMT [2229] DEBUG: record known xact 8677704
latestObservedXid 8677704
Apr 23 09:09:13 royce postgres[2229]: [43141906-2] 2020-04-23
09:09:13.663 GMT [2229] CONTEXT: WAL redo at 2D6/C259A750 for
Heap/INSERT: off 7 flags 0x0C
Apr 23 09:09:13 royce postgres[2229]: [43141907-1] 2020-04-23
09:09:13.663 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448555-1] 2020-04-23
09:09:13.663 GMT [2299] FATAL: terminating walreceiver process due to
administrator command
Apr 23 09:09:13 royce postgres[2299]: [1448556-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 1 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2299]: [1448557-1] 2020-04-23
09:09:13.663 GMT [2299] DEBUG: shmem_exit(1): 6 on_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141908-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141909-1] 2020-04-23
09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448558-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG: proc_exit(1): 2 callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141910-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2229]: [43141911-1] 2020-04-23
09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2299]: [1448559-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG: exit(1)
Apr 23 09:09:13 royce postgres[2229]: [43141912-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from stream to
archive after failure
Apr 23 09:09:13 royce postgres[2229]: [43141913-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: incorrect resource manager data
checksum in record at 2D6/C259AB90
Apr 23 09:09:13 royce postgres[2229]: [43141914-1] 2020-04-23
09:09:13.664 GMT [2229] DEBUG: switched WAL source from archive to
stream after failure
Apr 23 09:09:13 royce postgres[2299]: [1448560-1] 2020-04-23
09:09:13.664 GMT [2299] DEBUG: shmem_exit(-1): 0 before_shmem_exit
callbacks to make
Apr 23 09:09:13 royce postgres[2229]: [43141915-1] 2020-04-23
09:09:13.664 GMT [2229] LOG: incorrect resource manager data checksum
in record at 2D

process id list:
2224 ? Ss 0:24 /usr/lib/postgresql/12/bin/postgres -D
/var/lib/postgresql/12/main -c
unix_socket_directories=/var/run/postgresql -c
config_file=/etc/postgresql/12/main/postgresql.conf -c
hba_file=/etc/postgresql/12/main/pg_hba.conf -c
ident_file=/etc/postgresql/12/main/pg_ident.conf -c
external_pid_file=/var/run/postgresql/12-main.pid
2229 ? Ss 7:02 postgres: startup
2230 ? Ss 6:09 postgres: checkpointer
2231 ? Ss 0:08 postgres: background writer
2298 ? Ss 1:29 postgres: stats collector
2299 ? Ss 6:11 postgres: walreceiver

strace of walreceiver:
kill(2229, SIGUSR1)                     = 0
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448554-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  sending write
2D6/C25A0000 flush 2D6/C25A0000 apply 2D6/C25915F0", 156,
MSG_NOSIGNAL, NULL, 0) = 156
sendto(4<socket:[208548975]>,
"\x64\x00\x00\x00\x26\x72\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x5a\x00\x00\x00\x00\x02\xd6\xc2\x59\x15\xf0\x00\x02\x46\xf0\x5f\x04\xbe\x9e\x00",
39, MSG_NOSIGNAL, NULL, 0) = 39
epoll_create1(EPOLL_CLOEXEC)            = 6<anon_inode:[eventpoll]>
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
10<pipe:[208548970]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640776,
u64=94863290307528}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
7<pipe:[208548681]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640800,
u64=94863290307552}}) = 0
epoll_ctl(6<anon_inode:[eventpoll]>, EPOLL_CTL_ADD,
4<socket:[208548975]>, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=347640824,
u64=94863290307576}}) = 0
epoll_wait(6<anon_inode:[eventpoll]>, 0x564714b89410, 1, 100) = -1
EINTR (Interrupted system call)
--- SIGTERM {si_signo=SIGTERM, si_code=SI_USER, si_pid=2229, si_uid=112} ---
write(11<pipe:[208548970]>, "\x00", 1)  = 1
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
close(6<anon_inode:[eventpoll]>)        = 0
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<131>Apr 23 09:09:13 postgres[2299]:
[1448555-1] 2020-04-23 09:09:13.663 GMT [2299] FATAL:  terminating
walreceiver process due to administrator command", 152, MSG_NOSIGNAL,
NULL, 0) = 152
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448556-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1):
1 before_shmem_exit callbacks to make", 144, MSG_NOSIGNAL, NULL, 0) =
144
getpid()                                = 2299
sendto(3<socket:[208548671]>, "<135>Apr 23 09:09:13 postgres[2299]:
[1448557-1] 2020-04-23 09:09:13.663 GMT [2299] DEBUG:  shmem_exit(1):
6 on_shmem_exit callbacks to make", 140, MSG_NOSIGNAL, NULL, 0) = 140
sendto(4<socket:[208548975]>, "\x58\x00\x00\x00\x04", 5, MSG_NOSIGNAL,
NULL, 0) = 5

Any suggestions?

Thanks-
Justin

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Justin King (#1)

Re: walreceiver termination

Justin King <kingpin867@gmail.com> writes:

We've seen unexpected termination of the WAL receiver process. This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened. It appears as if the SIGTERM is coming from the
"postgres: startup" process.

The startup process intentionally SIGTERMs the walreceiver under
various circumstances, so I'm not sure that there's any surprise
here. Have you checked the postmaster log?

regards, tom lane

Justin King

kingpin867@gmail.com

about 6 years ago

In reply to: Tom Lane (#2)

Re: walreceiver termination

On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin King <kingpin867@gmail.com> writes:

We've seen unexpected termination of the WAL receiver process. This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened. It appears as if the SIGTERM is coming from the
"postgres: startup" process.

The startup process intentionally SIGTERMs the walreceiver under
various circumstances, so I'm not sure that there's any surprise
here. Have you checked the postmaster log?

regards, tom lane

Yep, I included "debug5" output of the postmaster log in the initial post.

Justin King

kingpin867@gmail.com

about 6 years ago

In reply to: Justin King (#3)

Re: walreceiver termination

I assume it would be related to the following:

LOG: incorrect resource manager data checksum in record at 2D6/C259AB90

since the walreceiver terminates just after this - but I'm unclear
what precisely this means. Without digging into the code, I would
guess that it's unable to verify the checksum on the segment it just
received from the master; however, there are multiple replicas here,
so it points to an issue on this client. However, it happens
everywhere -- we have ~16 replicas across 3 different clusters (on
different versions) and we see this uniformly across them all at
seemingly random times. Also, just to clarify, this will only happen
on a single replica at a time.

Show quoted text

On Thu, Apr 23, 2020 at 2:46 PM Justin King <kingpin867@gmail.com> wrote:

On Thu, Apr 23, 2020 at 12:47 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin King <kingpin867@gmail.com> writes:

We've seen unexpected termination of the WAL receiver process. This
stops streaming replication, but the replica stays available --
restarting the server resumes streaming replication where it left off.
We've seen this across nearly every recent version of PG, (9.4, 9.5,
11.x, 12.x) -- anything omitted is one we haven't used.

I don't have an explanation for the cause, but I was able to set
logging to "debug5" and run an strace of the walrecevier PID when it
eventually happened. It appears as if the SIGTERM is coming from the
"postgres: startup" process.

The startup process intentionally SIGTERMs the walreceiver under
various circumstances, so I'm not sure that there's any surprise
here. Have you checked the postmaster log?

regards, tom lane

Yep, I included "debug5" output of the postmaster log in the initial post.

Tom Lane

tgl@sss.pgh.pa.us

about 6 years ago

In reply to: Justin King (#4)

Re: walreceiver termination

Justin King <kingpin867@gmail.com> writes:

I assume it would be related to the following:
LOG: incorrect resource manager data checksum in record at 2D6/C259AB90
since the walreceiver terminates just after this - but I'm unclear
what precisely this means.

What it indicates is corrupt data in the WAL stream. When reading WAL
after crash recovery, we assume that that indicates end of WAL. When
pulling live data from a source server, it suggests some actual problem
... but killing the walreceiver and trying to re-fetch the data might
be a reasonable response to that. I'm not sure offhand what the startup
code thinks it's doing in this context. It might either be attempting
to retry, or concluding that it's come to the end of WAL and it ought
to promote to being a live server. If you don't see the walreceiver
auto-restarting then I'd suspect that the latter is happening.

regards, tom lane

Justin King

kingpin867@gmail.com

about 6 years ago

In reply to: Tom Lane (#5)

Re: walreceiver termination

On Thu, Apr 23, 2020 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin King <kingpin867@gmail.com> writes:

I assume it would be related to the following:
LOG: incorrect resource manager data checksum in record at 2D6/C259AB90
since the walreceiver terminates just after this - but I'm unclear
what precisely this means.

What it indicates is corrupt data in the WAL stream. When reading WAL
after crash recovery, we assume that that indicates end of WAL. When
pulling live data from a source server, it suggests some actual problem
... but killing the walreceiver and trying to re-fetch the data might
be a reasonable response to that. I'm not sure offhand what the startup
code thinks it's doing in this context. It might either be attempting
to retry, or concluding that it's come to the end of WAL and it ought
to promote to being a live server. If you don't see the walreceiver
auto-restarting then I'd suspect that the latter is happening.

regards, tom lane

walrecevier is definitely not restarting -- replication stops cold
right at that segment. I'm a little unclear where to go from here --
is there additional info that would be useful?

Justin King

kingpin867@gmail.com

about 6 years ago

In reply to: Justin King (#6)

Re: walreceiver termination

Would there be anyone that might be able to help troubleshoot this
issue -- or at least give me something that would be helpful to look
for?

/messages/by-id/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1=tifg@mail.gmail.com
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6

I'm not the first one to report something similar and all the
complaints have a different filesystem in common -- particularly ZFS
(or btrfs, in the bottom case). Is there anything more we can do here
to help narrow down this issue? I'm happy to help, but I honestly
wouldn't even know where to begin.

Thanks-

Justin King
flightaware.com

Show quoted text

On Thu, Apr 23, 2020 at 4:40 PM Justin King <kingpin867@gmail.com> wrote:

On Thu, Apr 23, 2020 at 3:06 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Justin King <kingpin867@gmail.com> writes:

I assume it would be related to the following:
LOG: incorrect resource manager data checksum in record at 2D6/C259AB90
since the walreceiver terminates just after this - but I'm unclear
what precisely this means.

What it indicates is corrupt data in the WAL stream. When reading WAL
after crash recovery, we assume that that indicates end of WAL. When
pulling live data from a source server, it suggests some actual problem
... but killing the walreceiver and trying to re-fetch the data might
be a reasonable response to that. I'm not sure offhand what the startup
code thinks it's doing in this context. It might either be attempting
to retry, or concluding that it's come to the end of WAL and it ought
to promote to being a live server. If you don't see the walreceiver
auto-restarting then I'd suspect that the latter is happening.

regards, tom lane

walrecevier is definitely not restarting -- replication stops cold
right at that segment. I'm a little unclear where to go from here --
is there additional info that would be useful?

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Justin King (#7)

Re: walreceiver termination

Hello.

At Mon, 4 May 2020 09:09:15 -0500, Justin King <kingpin867@gmail.com> wrote in

Would there be anyone that might be able to help troubleshoot this
issue -- or at least give me something that would be helpful to look
for?

/messages/by-id/CAGH8ccdWLLGC7qag5pDUFbh96LbyzN_toORh2eY32-2P1=tifg@mail.gmail.com
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
https://dba.stackexchange.com/questions/116569/postgresql-docker-incorrect-resource-manager-data-checksum-in-record-at-46f-6

I'm not the first one to report something similar and all the
complaints have a different filesystem in common -- particularly ZFS
(or btrfs, in the bottom case). Is there anything more we can do here
to help narrow down this issue? I'm happy to help, but I honestly
wouldn't even know where to begin.

The sendto() call at the end of your strace output is "close
connecion" request to wal sender and normally should be followed by
close() and kill(). If it is really the last strace output, the
sendto() is being blocked with buffer-full.

My diagnosis of the situation is that your replication connection had
a trouble and the TCP session is broken in the way wal receiver
couldn't be aware of the breakage. As the result feedback message
packets from wal receiver were detained in tcp send buffer then
finally the last sendto() was blocked while sending the
close-connection message.

If it happens constantly, routers or firewalls between the primary and
standby may be discarding sessions inadvertantly.

I'm not sure how ZFS can be involved in this trouble, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center