Incorrect resource manager data checksum in record with zfs and compression

Started by John Bolligerover 3 years ago3 messagesgeneral
Jump to latest
#1John Bolliger
johnbolliger@gmail.com

This is a follow up to
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com,
which I am at the same company as the original poster.

Our architecture is similar but all of the servers are now on ZFS now and
Postgres 13.8 with Ubuntu 18.04+ and still doing streaming replication, all
with ECC memory and 26-64 cores with 192gb ram+ on top of a ZPOOL made out
of NVMe PCIe SSDs.

A101 (primary) -> A201 (replica) -> B101(primary) -> B201 (replica).

We are seeing this error occur about once per week (across all postgres
clusters/chains). It is the same pattern we have been seeing for a number
of years now.

Possibly relevant configuration options:

wal_init_zero=on

The last time this occurred I grabbed the good WAL file from the parent,
and the corrupted WAL file from the descendant. Comparing them showed no
differences until 12Mb in where the "good" WAL file continued, and the
"bad" WAL file was zeros until the end of the file.

When the "good" WAL file is copied from the parent and to the descendant,
replication resumes and the descendant becomes a healthy replica once again.

After doing some investigation I partially suspected that
https://github.com/postgres/postgres/commit/dd9b3fced83edb51a3e2f44d3d4476a45d0f5a24
could
possibly impact this behavior and fix the issues. But we saw this occur on
13.8 between two nodes.

I am almost done on a tool that will tail the postgresql journal and walk
up the chain to the primary and get the WAL file name for the LSN in the
error message, then sync and overwrite the bad WAL file from the parent,
but I hope I can track down the actual bug here, rather than relying upon a
process to fix this after the fact.

--
John Bolliger

#2Michael Paquier
michael@paquier.xyz
In reply to: John Bolliger (#1)
Re: Incorrect resource manager data checksum in record with zfs and compression

On Mon, Oct 03, 2022 at 12:41:23PM -0700, John Bolliger wrote:

Our architecture is similar but all of the servers are now on ZFS now and
Postgres 13.8 with Ubuntu 18.04+ and still doing streaming replication, all
with ECC memory and 26-64 cores with 192gb ram+ on top of a ZPOOL made out
of NVMe PCIe SSDs.

A101 (primary) -> A201 (replica) -> B101(primary) -> B201 (replica).

replica -> primary does not really make sense for physical
replication. Or do you mean that B101 is itself a standby doing
streaming from A201?
--
Michael

#3John Bolliger
johnbolliger@gmail.com
In reply to: Michael Paquier (#2)
Re: Incorrect resource manager data checksum in record with zfs and compression

I mistyped and yes B101 is a replica node, not primary.

The correct architecture is
A101(primary) -> A201(replica) -> B101(replica) -> B201(replica)

On Mon, Oct 3, 2022 at 5:55 PM Michael Paquier <michael@paquier.xyz> wrote:

Show quoted text

On Mon, Oct 03, 2022 at 12:41:23PM -0700, John Bolliger wrote:

Our architecture is similar but all of the servers are now on ZFS now and
Postgres 13.8 with Ubuntu 18.04+ and still doing streaming replication,

all

with ECC memory and 26-64 cores with 192gb ram+ on top of a ZPOOL made

out

of NVMe PCIe SSDs.

A101 (primary) -> A201 (replica) -> B101(primary) -> B201 (replica).

replica -> primary does not really make sense for physical
replication. Or do you mean that B101 is itself a standby doing
streaming from A201?
--
Michael