bad wal on replica / incorrect resource manager data checksum in record / zfs
Hello Postgres Hackers -
We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large
data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica,
remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file
had to retrieved from the wal archive.
The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1
The issue seems similar to
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443
One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
sees our array as a single device.
Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue continues.
Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the
issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.
The bad file either had a block of zeros that were not in the good file at
that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.
This made me think that maybe there was an issue w/ wal file recycling and
ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).
There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM, and
the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the other
replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower than the
others.
The other replicas often have 100% utilization of the disk array that
houses the (non-wal) data.
Any insight into the source of this bug or how to address it?
Since the master has a good copy of the WAL file, can the replica
re-request the file from the master? Or from the archive?
When using replication slots, what circumstances would cause the master to
not save the WAL file?
(I can't remember if it always had the next wal file or the one after that)
Thanks in advance,
Alex Malek
On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent@gmail.com> wrote:
Hello Postgres Hackers -
We are having a reoccurring issue on 2 of our replicas where replication stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica, remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file had to retrieved from the wal archive.The WAL log directories on the master and the replicas are on ZFS file systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1The issue seems similar to /messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com and to https://github.com/timescale/timescaledb/issues/1443
One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS sees our array as a single device.
Right before the issue started we did some upgrades and altered some postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue continues.Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the differences were not random bad bytes.The bad file either had a block of zeros that were not in the good file at that position or other data. Occasionally the bad data has contained legible strings not in the good file at that position. At least one of those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.This made me think that maybe there was an issue w/ wal file recycling and ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).
We do print a message "recycled write-ahead log file .." in DEBUG2
mode. You either want to run the server with DEBUG2 or maybe change
the code to make it LOG and see if that is printed. If you do that,
you can verify if the corrupted WAL is the same as a recycled one.
There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM, and the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the other replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower than the others.
The other replicas often have 100% utilization of the disk array that houses the (non-wal) data.Any insight into the source of this bug or how to address it?
Since the master has a good copy of the WAL file, can the replica re-request the file from the master? Or from the archive?
I think we do check in the archive if we get the error during
streaming, but archive might also have the same data due to which this
problem happens. Have you checked that the archive WAL file, is it
different from the bad WAL? See the relevant bits of code in
WaitForWALToBecomeAvailable especially the code near below comment:
"Failure while streaming. Most likely, we got here because streaming
replication was terminated, or promotion was triggered. But we also
get here if we find an invalid record in the WAL streamed from master,
in which case something is seriously wrong. There's little chance that
the problem will just go away, but PANIC is not good for availability
either, especially in hot standby mode. So, we treat that the same as
disconnection, and retry from archive/pg_wal again. The WAL in the
archive should be identical to what was streamed, so it's unlikely
that it helps, but one can hope..."
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent@gmail.com> wrote:
Hello Postgres Hackers -
We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during largedata imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica,remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so thefile had to retrieved from the wal archive.
The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1The issue seems similar to
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
sees our array as a single device.
Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue
continues.
Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week theissue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.
The bad file either had a block of zeros that were not in the good file
at that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.However I am not sure if that is the case for all of them.
This made me think that maybe there was an issue w/ wal file recycling
and ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).
We do print a message "recycled write-ahead log file .." in DEBUG2
mode. You either want to run the server with DEBUG2 or maybe change
the code to make it LOG and see if that is printed. If you do that,
you can verify if the corrupted WAL is the same as a recycled one.
Are you suggesting having the master, the replicas or all in debug mode?
How much extra logging would this generate?
A replica typically consumes over 1 TB of WAL files before a bad wal file
is encountered.
There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM,and the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the
other replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower thanthe others.
The other replicas often have 100% utilization of the disk array that
houses the (non-wal) data.
Any insight into the source of this bug or how to address it?
Since the master has a good copy of the WAL file, can the replica
re-request the file from the master? Or from the archive?
I think we do check in the archive if we get the error during
streaming, but archive might also have the same data due to which this
problem happens. Have you checked that the archive WAL file, is it
different from the bad WAL? See the
Typically the master, the archive and the other replicas all have a good
copy of the WAL file.
relevant bits of code in
WaitForWALToBecomeAvailable especially the code near below comment:
"Failure while streaming. Most likely, we got here because streaming
replication was terminated, or promotion was triggered. But we also
get here if we find an invalid record in the WAL streamed from master,
in which case something is seriously wrong. There's little chance that
the problem will just go away, but PANIC is not good for availability
either, especially in hot standby mode. So, we treat that the same as
disconnection, and retry from archive/pg_wal again. The WAL in the
archive should be identical to what was streamed, so it's unlikely
that it helps, but one can hope..."
Thank you for this comment!
This made me realize that on the replicas I had mentioned we had removed
the restore_command.
The replica we thought was not having the issue, was actually also
getting/producing bad WAL files but was eventually recovering by getting a
good WAL file from the archive b/c it had the restore_command defined.
Import Notes
Reply to msg id not found: CAGH8ccd0PMkkRoh5_N86dOaihe3CbhdQw-br8_DjMrOkjz0x-A@mail.gmail.com
On Thu, Feb 20, 2020 at 7:40 PM Alex Malek <amalek@gmail.com> wrote:
On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent@gmail.com> wrote:
Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the differences were not random bad bytes.The bad file either had a block of zeros that were not in the good file at that position or other data. Occasionally the bad data has contained legible strings not in the good file at that position. At least one of those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.This made me think that maybe there was an issue w/ wal file recycling and ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).We do print a message "recycled write-ahead log file .." in DEBUG2
mode. You either want to run the server with DEBUG2 or maybe change
the code to make it LOG and see if that is printed. If you do that,
you can verify if the corrupted WAL is the same as a recycled one.Are you suggesting having the master, the replicas or all in debug mode?
The system(s) where you are expecting that wal recycling would have
created some problem.
How much extra logging would this generate?
To some extent, it depends on your workload. It will certainly
generate much more than when you have not enabled the debug level.
But, what other option you have to identify the root cause or at least
find out whether your suspicion is right or not. As mentioned
earlier, if you have the flexibility of changing code to find out the
reason, then you can change the code (at the place I told yesterday)
to make the level as LOG in which case you can set the
log_min_messages to LOG and it will generate much fewer logs on the
server.
A replica typically consumes over 1 TB of WAL files before a bad wal file is encountered.
Any insight into the source of this bug or how to address it?
Since the master has a good copy of the WAL file, can the replica re-request the file from the master? Or from the archive?
I think we do check in the archive if we get the error during
streaming, but archive might also have the same data due to which this
problem happens. Have you checked that the archive WAL file, is it
different from the bad WAL? See theTypically the master, the archive and the other replicas all have a good copy of the WAL file.
relevant bits of code in
WaitForWALToBecomeAvailable especially the code near below comment:"Failure while streaming. Most likely, we got here because streaming
replication was terminated, or promotion was triggered. But we also
get here if we find an invalid record in the WAL streamed from master,
in which case something is seriously wrong. There's little chance that
the problem will just go away, but PANIC is not good for availability
either, especially in hot standby mode. So, we treat that the same as
disconnection, and retry from archive/pg_wal again. The WAL in the
archive should be identical to what was streamed, so it's unlikely
that it helps, but one can hope..."Thank you for this comment!
This made me realize that on the replicas I had mentioned we had removed the restore_command.
The replica we thought was not having the issue, was actually also getting/producing bad WAL files but was eventually recovering by getting a good WAL file from the archive b/c it had the restore_command defined.
Good to know that there is some way to recover from the situation.
But, I think it is better to find the root cause of what led to bad
WAL files so that you can fix it if possible.
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Import Notes
Reply to msg id not found: CAGH8ccd0PMkkRoh5_N86dOaihe3CbhdQw-br8_DjMrOkjz0x-A@mail.gmail.com
On Thu, Feb 20, 2020 at 12:01 PM Alex Malek <magicagent@gmail.com> wrote:
On Thu, Feb 20, 2020, 6:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Feb 20, 2020 at 3:06 AM Alex Malek <magicagent@gmail.com> wrote:
Hello Postgres Hackers -
We are having a reoccurring issue on 2 of our replicas where
replication stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during largedata imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdownreplica, remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so thefile had to retrieved from the wal archive.
The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1The issue seems similar to
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443One quirk in our ZFS setup is ZFS is not handling our RAID array, so
ZFS sees our array as a single device.
Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue
continues.
Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that weekthe issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.
The bad file either had a block of zeros that were not in the good file
at that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.However I am not sure if that is the case for all of them.
This made me think that maybe there was an issue w/ wal file recycling
and ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).
We do print a message "recycled write-ahead log file .." in DEBUG2
mode. You either want to run the server with DEBUG2 or maybe change
the code to make it LOG and see if that is printed. If you do that,
you can verify if the corrupted WAL is the same as a recycled one.Are you suggesting having the master, the replicas or all in debug mode?
How much extra logging would this generate?
A replica typically consumes over 1 TB of WAL files before a bad wal file
is encountered.There is a third replica where this bug has not (yet?) surfaced.
This leads me to guess the bad data does not originate on the master.
This replica is older than the other replicas, slower CPUs, less RAM,and the WAL disk array is spinning disks.
The OS, version of Postgres, and version of ZFS are the same as the
other replicas.
This replica is not using a replication slot.
This replica does not serve users so load/contention is much lower thanthe others.
The other replicas often have 100% utilization of the disk array that
houses the (non-wal) data.
Any insight into the source of this bug or how to address it?
Since the master has a good copy of the WAL file, can the replica
re-request the file from the master? Or from the archive?
I think we do check in the archive if we get the error during
streaming, but archive might also have the same data due to which this
problem happens. Have you checked that the archive WAL file, is it
different from the bad WAL? See theTypically the master, the archive and the other replicas all have a good
copy of the WAL file.relevant bits of code in
WaitForWALToBecomeAvailable especially the code near below comment:
"Failure while streaming. Most likely, we got here because streaming
replication was terminated, or promotion was triggered. But we also
get here if we find an invalid record in the WAL streamed from master,
in which case something is seriously wrong. There's little chance that
the problem will just go away, but PANIC is not good for availability
either, especially in hot standby mode. So, we treat that the same as
disconnection, and retry from archive/pg_wal again. The WAL in the
archive should be identical to what was streamed, so it's unlikely
that it helps, but one can hope..."Thank you for this comment!
This made me realize that on the replicas I had mentioned we had removed
the restore_command.
The replica we thought was not having the issue, was actually also
getting/producing bad WAL files but was eventually recovering by getting a
good WAL file from the archive b/c it had the restore_command defined.
So ignoring what is causing the underlying issue, what would be involved in
adding the ability of the replica to try to re-request the WAL file first
from the master? It seems that would make replication more resilient and
address similar issues such as
/messages/by-id/CAPv0rXGZtFr2u5o3g70OMoH+WQYhmwq1aGsmL+PQHMjFf71Dkw@mail.gmail.com
that do not involve ZFS at all.
Thanks.
Alex
On Wed, Feb 19, 2020 at 4:35 PM Alex Malek <magicagent@gmail.com> wrote:
Hello Postgres Hackers -
We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
This has been occurring on average once every 1 to 2 weeks during large
data imports (100s of GBs being written)
on one of two replicas.
Fixing the issue has been relatively straight forward: shutdown replica,
remove the bad wal file, restart replica and
the good wal file is retrieved from the master.
We are doing streaming replication using replication slots.
However twice now, the master had already removed the WAL file so the file
had to retrieved from the wal archive.The WAL log directories on the master and the replicas are on ZFS file
systems.
All servers are running RHEL 7.7 (Maipo)
PostgreSQL 10.11
ZFS v0.7.13-1The issue seems similar to
/messages/by-id/CANQ55Tsoa6=vk2YkeVUN7qO-2YdqJf_AMVQxqsVTYJm0qqQQuw@mail.gmail.com
and to https://github.com/timescale/timescaledb/issues/1443One quirk in our ZFS setup is ZFS is not handling our RAID array, so ZFS
sees our array as a single device.
....
<snip>
An update in case someone else encounters the same issue.
About 5 weeks ago, on the master database server, we turned off ZFS
compression for the volume where the WAL log resides.
The error has not occurred on any replica since.
Best,
Alex
Hi,
On 2020-02-19 16:35:53 -0500, Alex Malek wrote:
We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."
Could you show the *exact* log output please? Because this could
temporarily occur without signalling anything bad, if e.g. the
replication connection goes down.
Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issue continues.Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the
issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.The bad file either had a block of zeros that were not in the good file at
that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.This made me think that maybe there was an issue w/ wal file recycling and
ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).
This sounds a lot more like a broken filesystem than anythingon the PG
level.
When using replication slots, what circumstances would cause the master to
not save the WAL file?
What do you mean by "save the WAL file"?
Greetings,
Andres Freund
On Thu, Apr 2, 2020 at 2:10 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2020-02-19 16:35:53 -0500, Alex Malek wrote:
We are having a reoccurring issue on 2 of our replicas where replication
stops due to this message:
"incorrect resource manager data checksum in record at ..."Could you show the *exact* log output please? Because this could
temporarily occur without signalling anything bad, if e.g. the
replication connection goes down.
Feb 23 00:02:02 wrds-pgdata10-2-w postgres[68329]: [12491-1] 5e4aac44.10ae9
(@) LOG: incorrect resource manager data checksum in record at
39002/57AC0338
When it occurred replication stopped. The only way to resume replication
was to stop server and remove bad WAL file.
Right before the issue started we did some upgrades and altered some
postgres configs and ZFS settings.
We have been slowly rolling back changes but so far the the issuecontinues.
Some interesting data points while debugging:
We had lowered the ZFS recordsize from 128K to 32K and for that week the
issue started happening every other day.
Using xxd and diff we compared "good" and "bad" wal files and the
differences were not random bad bytes.The bad file either had a block of zeros that were not in the good file
at
that position or other data. Occasionally the bad data has contained
legible strings not in the good file at that position. At least one of
those exact strings has existed elsewhere in the files.
However I am not sure if that is the case for all of them.This made me think that maybe there was an issue w/ wal file recycling
and
ZFS under heavy load, so we tried lowering
min_wal_size in order to "discourage" wal file recycling but my
understanding is a low value discourages recycling but it will still
happen (unless setting wal_recycle in psql 12).This sounds a lot more like a broken filesystem than anythingon the PG
level.
Probably. In my recent updated comment turning off ZFS compression on
master seems to have fixed the issue.
However I will note that the WAL file stored on the master was always fine
upon inspection.
When using replication slots, what circumstances would cause the master
to
not save the WAL file?
What do you mean by "save the WAL file"?
Typically, when using replication slots, when replication stops the master
will save the next needed WAL file.
However once or twice when this error occurred the master recycled/removed
the WAL file needed.
I suspect perhaps b/c the replica had started to read the WAL file it sent
some signal to the master that the WAL
file was already consumed. I am guessing, not knowing exactly what is
happening and w/ the caveat that this
situation was rare and not the norm. It is also possible caused by a
different error.
Thanks.
Alex