Corruption during WAL replay

Started by Teja Muppartiover 5 years ago75 messages
#1Teja Mupparti
Teja Mupparti
tejeswarm@hotmail.com
1 attachment(s)

Attachments:

bug-fix-patchapplication/octet-stream; name=bug-fix-patch
#2Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Teja Mupparti (#1)
Re: Corruption during WAL replay

Thanks for working on this.

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

This is my *first* attempt to submit a Postgres patch, please let me know if I missed any process or format of the patch

Welcome! The format looks fine to me. It would be better if it had a
commit message that explains what the patch does. (in the format that
git format-patch emits.)

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1]/messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de, since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

However, note that md truncates a "file" a non-atomic way. mdtruncate
truncates multiple files from the last segment toward the
beginning. If mdtruncate successfully truncated the first several
segments then failed, retaining all buffers triggers assertion failure
in mdwrite while buffer flush.

Some typos found:

+ * a backround task might flush them to the disk right after we

s/backround/background/

+ * saved list of buffers that were marked as BM_IO_IN_PRGRESS just
s/BM_IO_IN_PRGRESS/BM_IO_IN_PROGRESS/

+ * as BM_IO_IN_PROGRES. Though the buffers are marked for IO, they
s/BM_IO_IN_PROGRES/BM_IO_IN_PROGRESS/

+ * being dicarded).
s/dicarded/discarded/

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#2)
Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund

#4Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#3)
Re: Corruption during WAL replay

At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

# I would be wrong with far low odds..

"doesn't" is overstated. Is there a case where the buffer is already
flushed by checkpoint? (If that is the case, dropping clean buffers at
marking truncate would work?)

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

Thank you for the opinion. I agree to that.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

Hmm. If we allow PANIC on truncation failure why do we need to go on
the complicated steps? Wouldn't it enough to enclose the sequence
(WAL insert - drop buffers - truncate) in a critical section? I
believed that this project aims to fix the db-breakage on truncation
failure by allowing rollback on truncation failure?

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

Yeah, I like that direction.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I think the question above is the core part of the problem.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Teja Mupparti
Teja Mupparti
tejeswarm@hotmail.com
In reply to: Andres Freund (#3)
1 attachment(s)
Re: Corruption during WAL replay

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

Regards
Teja

________________________________
From: Andres Freund <andres@anarazel.de>
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Cc: tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund

Attachments:

0001-Wal-replay-corruption.patchapplication/octet-stream; name=0001-Wal-replay-corruption.patch
#6Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#3)
Re: Corruption during WAL replay

On 2020-Mar-30, Andres Freund wrote:

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on. I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

I didn't review Teja's patch carefully, but the idea of panicking on
failure (causing WAL replay) seems better than the current behavior.
I'd rather put the server to wait until storage is really back.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#6)
Re: Corruption during WAL replay

Hi,

On 2020-04-10 20:49:05 -0400, Alvaro Herrera wrote:

On 2020-Mar-30, Andres Freund wrote:

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on. I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

Yea. In that case we're in a really bad state. Because we truncate after
throwing away the old buffer contents (even if dirty), we'll later read
page contents "from the past". Which won't end well...

Greetings,

Andres Freund

#8Masahiko Sawada
Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Teja Mupparti (#5)
Re: Corruption during WAL replay

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

In the above idea, remembering all buffers having to-be-truncated
table at RelationTruncate(), we reduce the time for checking buffers
at the commit time. Since we acquire AccessExclusiveLock the number of
buffers having to-be-truncated table's data never increases. A
downside would be that since we can truncate multiple relations we
need to remember all buffers of each truncated relations, which is up
to (sizeof(int) * NBuffers) in total.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#8)
Re: Corruption during WAL replay

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

What would this help with? If we still need the more complicated
truncation sequence?

Greetings,

Andres Freund

#10Masahiko Sawada
Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Andres Freund (#9)
Re: Corruption during WAL replay

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1]/messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.deDoing sync before truncation, also seems not enough.

Regards,

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.deDoing sync before truncation
sync before truncation

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#10)
Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again. So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone. But we don't have a room for another flag
in buffer header. For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Agreed. Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

[1] /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de
Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Teja Mupparti
Teja Mupparti
tejeswarm@hotmail.com
In reply to: Kyotaro Horiguchi (#11)
Re: Corruption during WAL replay

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point.

Best regards,
Teja

________________________________
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko.sawada@2ndquadrant.com <masahiko.sawada@2ndquadrant.com>
Cc: andres@anarazel.de <andres@anarazel.de>; tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again. So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone. But we don't have a room for another flag
in buffer header. For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Agreed. Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

[1] /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de
Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Masahiko Sawada
Masahiko Sawada
masahiko.sawada@2ndquadrant.com
In reply to: Teja Mupparti (#12)
Re: Corruption during WAL replay

On Wed, 15 Apr 2020 at 04:04, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

I agree that when replaying the deletion of (3) the page LP 1 is
empty, but does that replay really fail with PANIC? I guess that we
record that page into invalid_page_tab but don't raise a PANIC in this
case.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Daniel Wood
Daniel Wood
hexexpert@comcast.net
In reply to: Teja Mupparti (#12)
Wait profiling

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

I feel that it would be useful to have the following two things. One PG enhancement and one standard extension.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

62.6% BufFileRead
50.0% CPU
9.3% LWLock

It uses a PG timer to do this.

2) An extension based function like: select pg_wait_profile(pid, nSeconds, timerFrequency) to return the same thing for an already running query. Useful if you want examine some already long running query that is taking too long.

Neither of these would be doing the heavy weight pg_stat_activity but directly poll the wait event in PROC. I've already coded the EXPLAIN option.

Furthermore, can't we just remove the following "IF" test from pgstat_report_wait_{start,end}?
if (!pgstat_track_activities || !proc)
return;
Just do the assignment of wait_event_info always. We should use a dummy PGPROC assigned to MyProc until we assign the one in the procarray in shared memory. That way we don't need the "!proc" test.
About the only thing I'd want to verify is whether wait_event_info is on the same cache lines as anything else having to do with snapshots.

If I recall correctly the blanks lines above I've used to make this more readable will disappear. :-(

- Dan Wood

#15Alvaro Herrera
Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Daniel Wood (#14)
Re: Wait profiling

On 2020-Jul-10, Daniel Wood wrote:

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling. I
haven't tested it yet ...

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Julien Rouhaud
Julien Rouhaud
rjuju123@gmail.com
In reply to: Alvaro Herrera (#15)
Re: Wait profiling

On Fri, Jul 10, 2020 at 10:37 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Jul-10, Daniel Wood wrote:

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling. I
haven't tested it yet ...

I use it, and I know multiple people that are also using it (or about
to, it's currently being packaged) in production. It's working quite
well and is compatible with pg_stat_statements' queryid. You can see
some examples of dashboards that can be built on top of this extension
at https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_wait_sampling.html.

#17Heikki Linnakangas
Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Teja Mupparti (#12)
Re: Corruption during WAL replay

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

 Repro details (when full page write = off)

         1) Page on disk has empty LP 1, Insert into page LP 1
         2) checkpoint START (Recovery REDO eventually starts here)
         3) Delete all rows on the page (page is empty now)
         4) Autovacuum kicks in and truncates the pages
                 DropRelFileNodeBuffers - Dirty page NOT written, LP 1
on disk still empty
         5) Checkpoint completes
         6) Crash
         7) smgrtruncate - Not reached (this is where we do the
physical truncate)

 Now the crash-recovery starts

         Delete-log-replay (above step-3) reads page with empty LP 1
and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the
truncation will happen in the future but the recovery fails (repeatedly)
even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated
away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly,
it requires scanning the buffers that are dropped twice, which can take
a long time. I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a
checkpoint, right after the redo point has been established. See
comments around the GetVirtualXIDsDelayingChkpt() function call in
CreateCheckPoint(). We could have a similar mechanism to wait out the
truncations before *completing* a checkpoint.

- Heikki

#18Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#17)
Re: Corruption during WAL replay

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

 Repro details (when full page write = off)

         1) Page on disk has empty LP 1, Insert into page LP 1
         2) checkpoint START (Recovery REDO eventually starts here)
         3) Delete all rows on the page (page is empty now)
         4) Autovacuum kicks in and truncates the pages
                 DropRelFileNodeBuffers - Dirty page NOT written, LP 1
on disk still empty
         5) Checkpoint completes
         6) Crash
         7) smgrtruncate - Not reached (this is where we do the
physical truncate)

 Now the crash-recovery starts

         Delete-log-replay (above step-3) reads page with empty LP 1
and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the
truncation will happen in the future but the recovery fails (repeatedly)
even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated away,
expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly, it
requires scanning the buffers that are dropped twice, which can take a long
time.

I was thinking that we'd keep track of all the buffers marked as "in
progress" that way, avoiding the second scan.

It's also worth keeping in mind that this code is really only relevant
for partial truncations, which don't happen at the same frequency as
transactional truncations.

I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once.

But that's when dropping many relations, normally. E.g. at the end of a
regression test.

More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

I think we should apply Robert's patch that makes io locks into
condition variables. Then we can fairly easily have many many buffers io
locked. Obviously there's some issues with doing so in the back
branches :(

I'm working on an AIO branch, and that also requires to be able to mark
multiple buffers as in-progress, FWIW.

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a checkpoint,
right after the redo point has been established. See comments around the
GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
have a similar mechanism to wait out the truncations before *completing* a
checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Greetings,

Andres Freund

#19Anastasia Lubennikova
Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Andres Freund (#18)
Re: Corruption during WAL replay

Status update for a commitfest entry.

I see quite a few unanswered questions in the thread since the last patch version was sent. So, I move it to "Waiting on Author".

The new status of this patch is: Waiting on Author

#20Masahiko Sawada
Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#18)
Re: Corruption during WAL replay

On Tue, Aug 18, 2020 at 3:22 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

+1

If we had a critical section in RelationTruncate(), crash recovery
would continue failing until the situation of the underlying file is
recovered if a PANIC happens. The current comment in
RelationTruncate() says it’s worse than the disease. But considering
physical replication, as Andres mentioned, a failure to truncate the
file after logging WAL is no longer a harmless failure. Also, the
critical section would be necessary even if we reversed the order of
truncation and dropping buffers and resolved the issue. So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#21Anastasia Lubennikova
Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Masahiko Sawada (#20)
Re: Corruption during WAL replay

On 06.11.2020 14:40, Masahiko Sawada wrote:

So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Status update for a commitfest entry.

The commitfest is closed now. As this entry is a bug fix, I am moving it
to the next CF.
Are you planning to continue working on it?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#22Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#18)
1 attachment(s)
Re: Corruption during WAL replay

At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

...

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a checkpoint,
right after the redo point has been established. See comments around the
GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
have a similar mechanism to wait out the truncations before *completing* a
checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Seems reasonable. The attached does that. It actually works for the
initial case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

delay_chkpt_cmpl_after_trunc_sucess.patchtext/x-patch; charset=us-ascii
#23Ibrar Ahmed
Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Kyotaro Horiguchi (#22)
Re: Corruption during WAL replay

On Wed, Jan 6, 2021 at 1:33 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

At Mon, 17 Aug 2020 11:22:15 -0700, Andres Freund <andres@anarazel.de>
wrote in

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

...

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a

checkpoint,

right after the redo point has been established. See comments around

the

GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We

could

have a similar mechanism to wait out the truncations before

*completing* a

checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Seems reasonable. The attached does that. It actually works for the
initial case.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

The regression is failing for this patch, do you mind look at that and send
the updated patch?

https://api.cirrus-ci.com/v1/task/6313174510075904/logs/test.log

...
t/006_logical_decoding.pl ............ ok
t/007_sync_rep.pl .................... ok
Bailout called. Further testing stopped: system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [Makefile:19: check] Error 255
make[1]: *** [Makefile:49: check-recovery-recurse] Error 2
make: *** [GNUmakefile:71: check-world-src/test-recurse] Error 2
...

--
Ibrar Ahmed

#24Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ibrar Ahmed (#23)
1 attachment(s)
Re: Corruption during WAL replay

At Thu, 4 Mar 2021 22:37:23 +0500, Ibrar Ahmed <ibrar.ahmad@gmail.com> wrote in

The regression is failing for this patch, do you mind look at that and send
the updated patch?

https://api.cirrus-ci.com/v1/task/6313174510075904/logs/test.log

...
t/006_logical_decoding.pl ............ ok
t/007_sync_rep.pl .................... ok
Bailout called. Further testing stopped: system pg_ctl failed
FAILED--Further testing stopped: system pg_ctl failed
make[2]: *** [Makefile:19: check] Error 255
make[1]: *** [Makefile:49: check-recovery-recurse] Error 2
make: *** [GNUmakefile:71: check-world-src/test-recurse] Error 2
...

(I regret that I sent this as .patch file..)

Thaks for pointing that!

The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
used each other, but MarkBufferDirtyHint which delays checkpoint start
is called in RelationTruncate while delaying checkpoint completion.
That is not a strange nor harmful behavior. I changed delayChkpt to a
bitmap integer from an enum so that both barrier are separately
triggered.

I'm not sure this is the way to go here, though. This fixes the issue
of a crash during RelationTruncate, but the issue of smgrtruncate
failure during RelationTruncate still remains (unless we treat that
failure as PANIC?).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

delay_chkpt_cmpl_after_trunc_sucess_2.patch.txttext/plain; charset=us-ascii
#25Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#24)
Re: Corruption during WAL replay

On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
used each other, but MarkBufferDirtyHint which delays checkpoint start
is called in RelationTruncate while delaying checkpoint completion.
That is not a strange nor harmful behavior. I changed delayChkpt to a
bitmap integer from an enum so that both barrier are separately
triggered.

I'm not sure this is the way to go here, though. This fixes the issue
of a crash during RelationTruncate, but the issue of smgrtruncate
failure during RelationTruncate still remains (unless we treat that
failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
/messages/by-id/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I
think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

--
Robert Haas
EDB: http://www.enterprisedb.com

#26Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

I like this patch.

I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.
I think two independent boolean flags, one saying "I'm preventing
checkpoint start" and one saying "I'm preventing checkpoint completion",
would be much less confusing and also more future-proof. Who's to say
that we won't ever need both states to be set in the same process?

I also dislike the fact that the patch has made procarray.h depend
on proc.h ... maybe I'm wrong, but I thought that there was a reason
for keeping those independent, if indeed this hasn't actually resulted
in a circular-includes situation. If we avoid inventing that enum type
then there's no need for that. If we do need an enum, maybe it could
be put in some already-common prerequisite header.

regards, tom lane

#27Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
Re: Corruption during WAL replay

On Fri, Sep 24, 2021 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I like this patch.

I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.
I think two independent boolean flags, one saying "I'm preventing
checkpoint start" and one saying "I'm preventing checkpoint completion",
would be much less confusing and also more future-proof. Who's to say
that we won't ever need both states to be set in the same process?

Nobody, but the version of the patch that I was looking at uses a
separate bit for each one:

+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START (1<<0)
+#define DELAY_CHKPT_COMPLETE (1<<1)

One could instead use separate Booleans, but there doesn't seem to be
anything three-way about this?

--
Robert Haas
EDB: http://www.enterprisedb.com

#28Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 24, 2021 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.

Nobody, but the version of the patch that I was looking at uses a
separate bit for each one:

+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START (1<<0)
+#define DELAY_CHKPT_COMPLETE (1<<1)

Hm, that's not in the patch version that the CF app claims to be
latest [1]/messages/by-id/20210106.173327.1444585955309078930.horikyota.ntt@gmail.com. It does this:

+/* type for PGPROC.delayChkpt */
+typedef enum DelayChkptType
+{
+	DELAY_CHKPT_NONE = 0,
+	DELAY_CHKPT_START,
+	DELAY_CHKPT_COMPLETE
+} DelayChkptType;

which seems like a distinct disimprovement over what you're quoting.

regards, tom lane

[1]: /messages/by-id/20210106.173327.1444585955309078930.horikyota.ntt@gmail.com

#29Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#28)
Re: Corruption during WAL replay

Thaks for looking this, Robert and Tom.

At Fri, 24 Sep 2021 16:22:28 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote in

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Sep 24, 2021 at 3:42 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I think the basic idea is about right, but I'm not happy with the
three-way delayChkpt business; that seems too cute by three-quarters.

Nobody, but the version of the patch that I was looking at uses a
separate bit for each one:

+/* symbols for PGPROC.delayChkpt */
+#define DELAY_CHKPT_START (1<<0)
+#define DELAY_CHKPT_COMPLETE (1<<1)

Hm, that's not in the patch version that the CF app claims to be
latest [1]. It does this:

+/* type for PGPROC.delayChkpt */
+typedef enum DelayChkptType
+{
+	DELAY_CHKPT_NONE = 0,
+	DELAY_CHKPT_START,
+	DELAY_CHKPT_COMPLETE
+} DelayChkptType;

which seems like a distinct disimprovement over what you're quoting.

Yeah, that is because the latest patch is not attached as *.patch/diff
but *.txt. I didn't name it as *.patch in order to avoid noise patch
in that thread although it was too late. On the contrary that seems to
have lead in another trouble..

Tom's concern is right. Actually both the two events can happen
simultaneously but the latest *.patch.txt treats that case as Robert
said.

One advantage of having the two flags as one bitmap integer is it
slightly simplifies the logic in GetVirtualXIDsDelayingChkpt and
HaveVirtualXIDsDelayingChkpt. On the other hand it very slightly
complexifies how to set/reset the flags.

GetVirtualXIDsDelayingChkpt:
+ if ((proc->delayChkpt & type) != 0)

vs

+		if (delayStart)
+			delayflag = proc->delayChkptStart;
+		else
+			delayflag = proc->delayChkptEnd;
+		if (delayflag != 0)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#30Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#25)
1 attachment(s)
Re: Corruption during WAL replay

Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
used each other, but MarkBufferDirtyHint which delays checkpoint start
is called in RelationTruncate while delaying checkpoint completion.
That is not a strange nor harmful behavior. I changed delayChkpt to a
bitmap integer from an enum so that both barrier are separately
triggered.

I'm not sure this is the way to go here, though. This fixes the issue
of a crash during RelationTruncate, but the issue of smgrtruncate
failure during RelationTruncate still remains (unless we treat that
failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
/messages/by-id/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

I think your understanding is right.

In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.

think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.

As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+	 * Delay the concurrent checkpoint's completion until this truncation
+	 * successfully completes, so that we don't establish a redo-point between
+	 * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+	 * file content against the WAL records after the REDO position and future
+	 * recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

delay_chkpt_cmpl_after_trunc_sucess_v3.patchtext/x-patch; charset=us-ascii
#31Daniel Shelepanov
Daniel Shelepanov
deniel1495@mail.ru
In reply to: Kyotaro Horiguchi (#30)
Re: Corruption during WAL replay

On 27.09.2021 11:30, Kyotaro Horiguchi wrote:

Thank you for the comments! (Sorry for the late resopnse.)

At Tue, 10 Aug 2021 14:14:05 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Thu, Mar 4, 2021 at 10:01 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The patch assumed that CHKPT_START/COMPLETE barrier are exclusively
used each other, but MarkBufferDirtyHint which delays checkpoint start
is called in RelationTruncate while delaying checkpoint completion.
That is not a strange nor harmful behavior. I changed delayChkpt to a
bitmap integer from an enum so that both barrier are separately
triggered.

I'm not sure this is the way to go here, though. This fixes the issue
of a crash during RelationTruncate, but the issue of smgrtruncate
failure during RelationTruncate still remains (unless we treat that
failure as PANIC?).

I like this patch. As I understand it, we're currently cheating by
allowing checkpoints to complete without necessarily flushing all of
the pages that were dirty at the time we fixed the redo pointer out to
disk. We think this is OK because we know that those pages are going
to get truncated away, but it's not really OK because when the system
starts up, it has to replay WAL starting from the checkpoint's redo
pointer, but the state of the page is not the same as it was at the
time when the redo pointer was the end of WAL, so redo fails. In the
case described in
/messages/by-id/BYAPR06MB63739B2692DC6DBB3C5F186CABDA0@BYAPR06MB6373.namprd06.prod.outlook.com
modifications are made to the page before the redo pointer is fixed
and those changes never make it to disk, but the truncation also never
makes it to the disk either. With this patch, that can't happen,
because no checkpoint can intervene between when we (1) decide we're
not going to bother writing those dirty pages and (2) actually
truncate them away. So either the pages will get written as part of
the checkpoint, or else they'll be gone before the checkpoint
completes. In the latter case, I suppose redo that would have modified
those pages will just be skipped, thus dodging the problem.

I think your understanding is right.

In RelationTruncate, I suggest that we ought to clear the
delay-checkpoint flag before rather than after calling
FreeSpaceMapVacuumRange. Since the free space map is not fully
WAL-logged, anything we're doing there should be non-critical. Also, I

Agreed and fixed.

think it might be better if MarkBufferDirtyHint stays closer to the
existing coding and just uses a Boolean and an if-test to decide
whether to clear the bit, instead of inventing a new mechanism. I
don't really see anything wrong with the new mechanism, but I think
it's better to keep the patch minimal.

Yeah, that was a a kind of silly. Fixed.

As you say, this doesn't fix the problem that truncation might fail.
But as Andres and Sawada-san said, the solution to that is to get rid
of the comments saying that it's OK for truncation to fail and make it
a PANIC. However, I don't think that change needs to be part of this
patch. Even if we do that, we still need to do this. And even if we do
this, we still need to do that.

Ok. Addition to the aboves, I rewrote the comment in RelatinoTruncate.

+	 * Delay the concurrent checkpoint's completion until this truncation
+	 * successfully completes, so that we don't establish a redo-point between
+	 * buffer deletion and file-truncate. Otherwise we can leave inconsistent
+	 * file content against the WAL records after the REDO position and future
+	 * recovery fails.

However, a problem for me for now is that I cannot reproduce the
problem.

To avoid further confusion, the attached is named as *.patch.

regards.

Hi. This is my first attempt to review a patch so feel free to tell me
if I missed something.

As of today's state of REL_14_STABLE
(ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is reproducible
using the script provided by Daniel Wood in this
(1335373813.287510.1573611814107@connect.xfinity.com) message. Also, the
latest patch seems not to be applicable and requires some minor tweaks.

Regards,

Daniel Shelepanov

#32Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Shelepanov (#31)
1 attachment(s)
Re: Corruption during WAL replay

At Mon, 24 Jan 2022 23:33:20 +0300, Daniel Shelepanov <deniel1495@mail.ru> wrote in

Hi. This is my first attempt to review a patch so feel free to tell me
if I missed something.

Welcome!

As of today's state of REL_14_STABLE
(ef9706bbc8ce917a366e4640df8c603c9605817a), the problem is
reproducible using the script provided by Daniel Wood in this
(1335373813.287510.1573611814107@connect.xfinity.com) message. Also,
the latest patch seems not to be applicable and requires some minor
tweaks.

Thanks for the info. The reason for my failure is checksum was
enabled.. After disalbing both fpw and checksum (and wal_log_hints)
allows me to reproduce the issue. And what I found is:

v3 patch:

vxids = GetVirtualXIDsDelayingChkpt(&nvxids, DELAY_CHKPT_COMPLETE);

!> if (0 && nvxids > 0)

{

Ugggggggh! It looks like a debugging tweak but it prevents everything
from working.

The attached is the fixed version and it surely works with the repro.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v4-0001-Delay-checkpoint-completion-until-truncation-comp.patchtext/x-patch; charset=us-ascii
#33Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#32)
1 attachment(s)
Re: Corruption during WAL replay

On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The attached is the fixed version and it surely works with the repro.

Hi,

I spent the morning working on this patch and came up with the
attached version. I wrote substantial comments in RelationTruncate(),
where I tried to make it more clear exactly what the bug is here, and
also in storage/proc.h, where I tried to clarify both the use of the
DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
version, I plan to commit it.

I think it should be back-patched, too, but that looks like a bit of a
pain. I think every back-branch will require different adjustments.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachments:

v5-0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlap.patchapplication/octet-stream; name=v5-0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlap.patch
#34Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#33)
Re: Corruption during WAL replay

At Tue, 15 Mar 2022 12:44:49 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Wed, Jan 26, 2022 at 3:25 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

The attached is the fixed version and it surely works with the repro.

Hi,

I spent the morning working on this patch and came up with the
attached version. I wrote substantial comments in RelationTruncate(),
where I tried to make it more clear exactly what the bug is here, and
also in storage/proc.h, where I tried to clarify both the use of the
DELAY_CHKPT_* flags in general terms. If nobody is too sad about this
version, I plan to commit it.

Thanks for taking this and for the time. The additional comments
seems describing the flags more clearly.

storage.c:
+	 * Make sure that a concurrent checkpoint can't complete while truncation
+	 * is in progress.
+	 *
+	 * The truncation operation might drop buffers that the checkpoint
+	 * otherwise would have flushed. If it does, then it's essential that
+	 * the files actually get truncated on disk before the checkpoint record
+	 * is written. Otherwise, if reply begins from that checkpoint, the
+	 * to-be-truncated buffers might still exist on disk but have older
+	 * contents than expected, which can cause replay to fail. It's OK for
+	 * the buffers to not exist on disk at all, but not for them to have the
+	 * wrong contents.

FWIW, this seems like slightly confusing between buffer and its
content. I can read it correctly so I don't mind if it is natural
enough.

Otherwise all the added/revised comments looks fine. Thanks for the
labor.

I think it should be back-patched, too, but that looks like a bit of a
pain. I think every back-branch will require different adjustments.

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#35Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#34)
Re: Corruption during WAL replay

On Wed, Mar 16, 2022 at 1:14 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

storage.c:
+        * Make sure that a concurrent checkpoint can't complete while truncation
+        * is in progress.
+        *
+        * The truncation operation might drop buffers that the checkpoint
+        * otherwise would have flushed. If it does, then it's essential that
+        * the files actually get truncated on disk before the checkpoint record
+        * is written. Otherwise, if reply begins from that checkpoint, the
+        * to-be-truncated buffers might still exist on disk but have older
+        * contents than expected, which can cause replay to fail. It's OK for
+        * the buffers to not exist on disk at all, but not for them to have the
+        * wrong contents.

FWIW, this seems like slightly confusing between buffer and its
content. I can read it correctly so I don't mind if it is natural
enough.

Hmm. I think the last two instances of "buffers" in this comment
should actually say "blocks".

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

If you want to take a crack at that, I'd be delighted.

--
Robert Haas
EDB: http://www.enterprisedb.com

#36Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#35)
6 attachment(s)
Re: Corruption during WAL replay

At Wed, 16 Mar 2022 10:14:56 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

Hmm. I think the last two instances of "buffers" in this comment
should actually say "blocks".

Ok. I replaced them with "blocks" and it looks nicer. Thanks!

I'll try that, if you are already working on it, please inform me. (It
may more than likely be too late..)

If you want to take a crack at that, I'd be delighted.

Finally, no two of from 10 to 14 doesn't accept the same patch.

As a cross-version check, I compared all combinations of the patches
for two adjacent versions and confirmed that no hunks are lost.

All versions pass check world.

The differences between each two adjacent versions are as follows.

master->14:

A hunk fails due to the change in how to access rel->rd_smgr.

14->13:

Several hunks fail due to simple context differences.

13->12:

Many hunks fail due to the migration of delayChkpt from PGPROC to
PGXACT and the context difference due to change of FSM trancation
logic in RelationTruncate.

12->11:

Several hunks fail due to the removal of volatile qalifier from
pointers to PGPROC/PGXACT.

11-10:

A hunk fails due to the context difference due to an additional
member tempNamespaceId of PGPROC.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlaps-a.patchtext/x-patch; charset=iso-8859-1
0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlaps_14.txttext/plain; charset=iso-8859-1
0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlaps_13.txttext/plain; charset=iso-8859-1
0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlaps_12.txttext/plain; charset=iso-8859-1
0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlaps_11.txttext/plain; charset=iso-8859-1
0001-Fix-possible-recovery-trouble-if-TRUNCATE-overlaps_10.txttext/plain; charset=iso-8859-1
#37Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#36)
Re: Corruption during WAL replay

On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Finally, no two of from 10 to 14 doesn't accept the same patch.

As a cross-version check, I compared all combinations of the patches
for two adjacent versions and confirmed that no hunks are lost.

All versions pass check world.

Thanks, committed.

--
Robert Haas
EDB: http://www.enterprisedb.com

#38Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#37)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

Thanks, committed.

Some of the buildfarm is seeing failures in the pg_checksums test.

regards, tom lane

#39Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#38)
Re: Corruption during WAL replay

On Thu, Mar 24, 2022 at 6:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

Thanks, committed.

Some of the buildfarm is seeing failures in the pg_checksums test.

Hmm. So the tests seem to be failing because 002_actions.pl stops the
database cluster, runs pg_checksums (which passes), writes some zero
bytes over the line pointer array of the first block of pg_class, and
then runs pg_checksums again. In the failing buildfarm runs,
pg_checksums fails to detect the corruption: the second run succeeds,
while pg_checksums expects it to fail. That's pretty curious, because
if the database cluster is stopped, and things are OK at that point,
then how could a server bug of any kind cause a Perl script to be
unable to corrupt a file on disk?

A possible clue is that I also see a few machines failing in
recoveryCheck. And the code that is failing there looks like this:

# We've seen occasional cases where multiple walsender pids are active. An
# immediate shutdown may hide evidence of a locking bug. So if multiple
# walsenders are observed, shut down in fast mode, and collect some more
# information.
if (not like($senderpid, qr/^[0-9]+$/, "have walsender pid $senderpid"))
{
my ($stdout, $stderr);
$node_primary3->psql('postgres',
"\\a\\t\nSELECT * FROM pg_stat_activity",
stdout => \$stdout, stderr => \$stderr);
diag $stdout, $stderr;
$node_primary3->stop('fast');
$node_standby3->stop('fast');
die "could not determine walsender pid, can't continue";
}

And the failure looks like this:

# Failed test 'have walsender pid 1047504
# 1047472'
# at t/019_replslot_limit.pl line 343.

That sure looks like there are multiple walsender PIDs active, and the
pg_stat_activity output confirms it. 1047504 is running
START_REPLICATION SLOT "rep3" 0/700000 TIMELINE 1 and 1047472 is
running START_REPLICATION SLOT "pg_basebackup_1047472" 0/600000
TIMELINE 1.

Both of these failures could possibly be explained by some failure of
things to shut down properly, but it's not the same things. In the
first case, the database server would have had to still be running
after we run $node->stop, and it would have had to overwrite the bad
contents of pg_class with some good contents. In the second case, the
cluster's supposed to still be running, but the backends that were
creating those replication slots should have exited sooner.

I've been running the pg_checksums test in a loop here for a bit now
in the hopes of being able to reproduce the failure, but it doesn't
seem to want to fail here. And I've also looked over the commit and I
can't quite see how it would cause a process, or the cluster, to fail
to shutdown, unless perhaps it's the checkpointer that gets stuck, but
that doesn't really seem to match the symptoms.

Any ideas?

--
Robert Haas
EDB: http://www.enterprisedb.com

#40Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#39)
Re: Corruption during WAL replay

On Thu, Mar 24, 2022 at 8:37 PM Robert Haas <robertmhaas@gmail.com> wrote:

Any ideas?

And ... right after hitting send, I see that the recovery check
failures are under separate troubleshooting and thus probably
unrelated. But that leaves me even more confused. How can a change to
only the server code cause a client utility to fail to detect
corruption that is being created by Perl while the server is stopped?

--
Robert Haas
EDB: http://www.enterprisedb.com

#41Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#40)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

And ... right after hitting send, I see that the recovery check
failures are under separate troubleshooting and thus probably
unrelated.

Yeah, we've been chasing those for months.

But that leaves me even more confused. How can a change to
only the server code cause a client utility to fail to detect
corruption that is being created by Perl while the server is stopped?

Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
Now I see they're not, which indeed puts quite a different spin on
things. Your thought about maybe the server isn't shut down yet is
interesting --- did 412ad7a55 touch anything about the shutdown
sequence?

regards, tom lane

#42Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#41)
Re: Corruption during WAL replay

On Thu, Mar 24, 2022 at 8:45 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Hmm, I'd supposed that the failing test cases were new as of 412ad7a55.
Now I see they're not, which indeed puts quite a different spin on
things. Your thought about maybe the server isn't shut down yet is
interesting --- did 412ad7a55 touch anything about the shutdown
sequence?

I hate to say "no" because the evidence suggests that the answer might
be "yes" -- but it definitely isn't intending to change anything about
the shutdown sequence. It just introduces a mechanism to backends to
force the checkpointer to delay writing the checkpoint record.

--
Robert Haas
EDB: http://www.enterprisedb.com

#43Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#42)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

I hate to say "no" because the evidence suggests that the answer might
be "yes" -- but it definitely isn't intending to change anything about
the shutdown sequence. It just introduces a mechanism to backends to
force the checkpointer to delay writing the checkpoint record.

Wait a minute, I think we may be barking up the wrong tree.

The three commits that serinus saw as new in its first failure were

ce95c54376 Thu Mar 24 20:33:13 2022 UTC Fix pg_statio_all_tables view for multiple TOAST indexes.
7dac61402e Thu Mar 24 19:51:40 2022 UTC Remove unused module imports from TAP tests
412ad7a556 Thu Mar 24 18:52:28 2022 UTC Fix possible recovery trouble if TRUNCATE overlaps a checkpoint.

I failed to look closely at dragonet, but I now see that its
first failure saw

ce95c54376 Thu Mar 24 20:33:13 2022 UTC Fix pg_statio_all_tables view for multiple TOAST indexes.
7dac61402e Thu Mar 24 19:51:40 2022 UTC Remove unused module imports from TAP tests

serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
confident that the failure is repeatable for them. That means that the
culprit must be ce95c54376 or 7dac61402e, not anything nearby such as
412ad7a556.

It's *really* hard to see how the pg_statio_all_tables change could
have affected this. So that leaves 7dac61402e, which did this to
the test script that's failing:

use strict;
use warnings;
-use Config;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;

Discuss.

regards, tom lane

#44Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#40)
Re: Corruption during WAL replay

Hi,

On 2022-03-24 20:39:27 -0400, Robert Haas wrote:

But that leaves me even more confused. How can a change to only the server
code cause a client utility to fail to detect corruption that is being
created by Perl while the server is stopped?

I guess it could somehow cause the first page to be all zeroes, in which case
overwriting it with more zeroes wouldn't cause a problem that pg_checksums can
see? But I have a somewhat more realistic idea:

I'm suspicious of pg_checksums --filenode. If I understand correctly
--filenode scans the data directory, including all tablespaces, for a file
matching that filenode. If we somehow end up with a leftover file in the pre
ALTER TABLE SET TABLESPACE location, it'd not notice that there *also* is a
file in a different place?

Perhaps the --filenode mode should print out the file location...

Randomly noticed: The test fetches the block size without doing anything with
it afaics.

Andres

#45Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#43)
Re: Corruption during WAL replay

I wrote:

... So that leaves 7dac61402e, which did this to
the test script that's failing:

use strict;
use warnings;
-use Config;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;

Discuss.

Another thing that seems quite baffling, but is becoming clearer by
the hour, is that only serinus and dragonet are seeing this failure.
How is that? They're not very similarly configured --- one is gcc,
one clang, and one uses jit and one doesn't. They do share the same
perl version, 5.34.0; but so do twenty-three other animals, many of
which have reported in cleanly. I'm at a loss to explain that.
Andres, can you think of anything that's peculiar to those two
animals?

regards, tom lane

#46Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#43)
Re: Corruption during WAL replay

Hi,

On 2022-03-24 21:22:38 -0400, Tom Lane wrote:

serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
confident that the failure is repeatable for them.

That's weird. They run on the same host, but otherwise they have very little
in common. There's plenty other animals running on the same machine that
didn't report errors.

I copied serinus' configuration, ran the tests repeatedly, without reproducing
the failure so far. Odd.

Combined with the replslot failure I'd be prepared to think the machine has
issues, except that the replslot thing triggered on other machines too.

I looked through logs on the machine without finding anything indicating
something odd.

I turned on keep_error_builds for serinus. Hopefully that'll leave us with
on-disk files to inspect.

Greetings,

Andres Freund

#47Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#45)
Re: Corruption during WAL replay

Hi,

On 2022-03-24 21:59:08 -0400, Tom Lane wrote:

Another thing that seems quite baffling, but is becoming clearer by
the hour, is that only serinus and dragonet are seeing this failure.
How is that? They're not very similarly configured --- one is gcc,
one clang, and one uses jit and one doesn't. They do share the same
perl version, 5.34.0; but so do twenty-three other animals, many of
which have reported in cleanly. I'm at a loss to explain that.
Andres, can you think of anything that's peculiar to those two
animals?

No, I'm quite baffled myself. As I noted in an email I just sent, before
reading this one, I can't explain it, and at least in simple attempts, can't
reproduce it either. And there are animals much closer to each other than
those two...

I forced a run while writing the other email, with keep_error_whatnot, and I
just saw it failing... Looking whether there's anything interesting to glean.

Greetings,

Andres Freund

#48Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#46)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 3:14 PM Andres Freund <andres@anarazel.de> wrote:

On 2022-03-24 21:22:38 -0400, Tom Lane wrote:

serinus is 0-for-3 since then, and dragonet 0-for-4, so we can be pretty
confident that the failure is repeatable for them.

That's weird. They run on the same host, but otherwise they have very little
in common. There's plenty other animals running on the same machine that
didn't report errors.

One random thing I've noticed about serinus is that it seems to drop
UDP packets more than others, but dragonet apparently doesn't:

tmunro=> select animal, count(*) from run where result = 'FAILURE' and
'stats' = any(fail_tests) and snapshot > now() - interval '3 month'
group by 1 order by 2 desc;
animal | count
--------------+-------
serinus | 14
flaviventris | 6
mandrill | 2
bonito | 1
seawasp | 1
crake | 1
sungazer | 1
(7 rows)

Example: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus&amp;dt=2022-03-24%2001:00:14

#49Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#48)
Re: Corruption during WAL replay

Hi,

On 2022-03-25 15:23:24 +1300, Thomas Munro wrote:

One random thing I've noticed about serinus is that it seems to drop
UDP packets more than others, but dragonet apparently doesn't:

Serinus is built with optimization. Which I guess could lead to other backends
reporting stats more quickly? And of course could lead to running more often
(due to finishing before the next cron invocation). I think I've also
configured my animals to run more often than many other owners.

So I'm not sure how much can be gleaned from raw "failure counts" without
taking the number of runs into account as well?

- Andres

#50Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#47)
Re: Corruption during WAL replay

Hi,

On 2022-03-24 19:20:10 -0700, Andres Freund wrote:

I forced a run while writing the other email, with keep_error_whatnot, and I
just saw it failing... Looking whether there's anything interesting to glean.

Unfortunately the test drops the table and it doesn't report the filepath of
the failure. So I haven't learned much from the data dir so far.

I still don't see a failure when running the tests in a separate source
tree. Can't explain that. Going to try to get closer to the buildfarm script
run - it'd be a whole lot easier to be able to edit the source of the test and
reproduce...

Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
leave dragonet's alone for now.

Greetings,

Andres Freund

#51Thomas Munro
Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#49)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 3:35 PM Andres Freund <andres@anarazel.de> wrote:

So I'm not sure how much can be gleaned from raw "failure counts" without
taking the number of runs into account as well?

Ah, right, it does indeed hold the record for most runs in 3 months,
and taking runs into account its "stats" failure rate is clustered
with mandrill and seawasp. Anyway, clearly not relevant because
dragonet doesn't even show up in the list.

animal | runs | stats_test_fail_fraction
---------------+------+--------------------------
mandrill | 158 | 0.0126582278481013
seawasp | 85 | 0.0117647058823529
serinus | 1299 | 0.0107775211701309
sungazer | 174 | 0.00574712643678161
flaviventris | 1292 | 0.00464396284829721
bonito | 313 | 0.00319488817891374
crake | 743 | 0.00134589502018843

#52Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#50)
Re: Corruption during WAL replay

Hi,

On 2022-03-24 19:43:02 -0700, Andres Freund wrote:

Just to be sure I'm going to clean out serinus' ccache dir and rerun. I'll
leave dragonet's alone for now.

Turns out they had the same dir. But it didn't help.

I haven't yet figured out why, but I now *am* able to reproduce the problem in
the buildfarm built tree. Wonder if there's a path length issue or such
somewhere?

Either way, I can now manipulate the tests and still repro. I made the test
abort after the first failure.

hexedit shows that the file is modified, as we'd expect:
00000000 00 00 00 00 C0 01 5B 01 16 7D 00 00 A0 03 C0 03 00 20 04 20 00 00 00 00 00 00 00 00 00 00 00 00 ......[..}....... . ............
00000020 00 9F 38 00 80 9F 38 00 60 9F 38 00 40 9F 38 00 20 9F 38 00 00 9F 38 00 E0 9E 38 00 C0 9E 38 00 ..8...8.`.8.@.8. .8...8...8...8.

And we are checking the right file:

bf@andres-postgres-edb-buildfarm-v1:~/build/buildfarm-serinus/HEAD/pgsql.build$ tmp_install/home/bf/build/buildfarm-serinus/HEAD/inst/bin/pg_checksums --check -D /home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata --filenode 16391 -v
pg_checksums: checksums verified in file "/home/bf/build/buildfarm-serinus/HEAD/pgsql.build/src/bin/pg_checksums/tmp_check/t_002_actions_node_checksum_data/pgdata/pg_tblspc/16387/PG_15_202203241/5/16391"
Checksum operation completed
Files scanned: 1
Blocks scanned: 45
Bad checksums: 0
Data checksum version: 1

If I twiddle further bits, I see that page failing checksum verification, as
expected.

I made the script copy the file before twiddling it around:
00000000 00 00 00 00 C0 01 5B 01 16 7D 00 00 A0 03 C0 03 00 20 04 20 00 00 00 00 E0 9F 38 00 C0 9F 38 00 ......[..}....... . ......8...8.
00000020 A0 9F 38 00 80 9F 38 00 60 9F 38 00 40 9F 38 00 20 9F 38 00 00 9F 38 00 E0 9E 38 00 C0 9E 38 00 ..8...8.`.8.@.8. .8...8...8...8.

So it's indeed modified.

The only thing I can really conclude here is that we apparently end up with
the same checksum for exactly the modifications we are doing? Just on those
two damn instances? Reliably?

Gotta make some food. Suggestions what exactly to look at welcome.

Greetings,

Andres Freund

PS: I should really rename the hostname of that machine one of these days...

#53Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#52)
Re: Corruption during WAL replay

Andres Freund <andres@anarazel.de> writes:

The only thing I can really conclude here is that we apparently end up with
the same checksum for exactly the modifications we are doing? Just on those
two damn instances? Reliably?

IIRC, the table's OID or relfilenode enters into the checksum.
Could it be that assigning a specific OID to the table allows
this to happen, and these two animals are somehow assigning
that OID while others are using some slightly different OID?

regards, tom lane

#54Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#53)
Re: Corruption during WAL replay

Hi,

On 2022-03-25 00:08:20 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

The only thing I can really conclude here is that we apparently end up with
the same checksum for exactly the modifications we are doing? Just on those
two damn instances? Reliably?

IIRC, the table's OID or relfilenode enters into the checksum.
Could it be that assigning a specific OID to the table allows
this to happen, and these two animals are somehow assigning
that OID while others are using some slightly different OID?

It's just the block number that goes into it.

I do see that the LSN that ends up on the page is the same across a few runs
of the test on serinus. Which presumably differs between different
animals. Surprised that it's this predictable - but I guess the run is short
enough that there's no variation due to autovacuum, checkpoints etc.

If I add a 'SELECT txid_current()' before the CREATE TABLE in
check_relation_corruption(), the test doesn't fail anymore, because there's an
additional WAL record.

16bit checksums for the win.

Greetings,

Andres Freund

#55Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#54)
Re: Corruption during WAL replay

Andres Freund <andres@anarazel.de> writes:

I do see that the LSN that ends up on the page is the same across a few runs
of the test on serinus. Which presumably differs between different
animals. Surprised that it's this predictable - but I guess the run is short
enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh. I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now? I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

2. Why just these two animals? If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations. OTOH, I see petalura and pogona just fell over too.
Do you have some of those animals --with-icu and others not?

16bit checksums for the win.

Yay :-(

As for a fix, would damaging more of the page help? I guess
it'd just move around the one-in-64K chance of failure.
Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

regards, tom lane

#56Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#54)
Re: Corruption during WAL replay

Hi,

On 2022-03-24 21:54:38 -0700, Andres Freund wrote:

I do see that the LSN that ends up on the page is the same across a few runs
of the test on serinus. Which presumably differs between different
animals. Surprised that it's this predictable - but I guess the run is short
enough that there's no variation due to autovacuum, checkpoints etc.

This actually explains how the issue could start to be visible with
ce95c543763. It changes the amount of WAL initdb generates and therefore
influences what LSN the page ends up with. I've verified that the failing
test is reproducible with ce95c543763, but not its parent 7dac61402e3. While
of course ce95c543763 isn't "actually responsible".

Ah, and that's finally also the explanation why I couldn't reproduce the
failure it in a different directory, with an otherwise identically configured
PG: The length of the path to the tablespace influences the size of the
XLOG_TBLSPC_CREATE record.

Not sure what to do here... I guess we can just change the value we overwrite
the page with and hope to not hit this again? But that feels deeply deeply
unsatisfying.

Perhaps it would be enough to write into multiple parts of the page? I am very
much not a cryptographical expert, but the way pg_checksum_block() works, it
looks to me that "multiple" changes within a 16 byte chunk have a smaller
influence on the overall result than the same "amount" of changes to separate
16 byte chunks.

I might have to find a store still selling strong beverages at this hour.

- Andres

#57Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#55)
Re: Corruption during WAL replay

Hi,

On 2022-03-25 01:23:00 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I do see that the LSN that ends up on the page is the same across a few runs
of the test on serinus. Which presumably differs between different
animals. Surprised that it's this predictable - but I guess the run is short
enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh. I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now? I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

Yep, verified that (see mail I just sent).

2. Why just these two animals? If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations.

I think it's the animal's name that makes the difference, due to the
tablespace path lenght thing. And while I was confused for a second by

petalura
pogona
serinus
dragonet

failing, despite different name lengths, it still makes sense: We MAXALIGN the
start of records. Which explains why flaviventris didn't fail the same way.

As for a fix, would damaging more of the page help? I guess
it'd just move around the one-in-64K chance of failure.

As I wrote in the other email, I think spreading the changes out wider might
help. But it's still not great. However:

Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

seems like it'd do the trick? Even a single bit change of the checksum ought
to do, as long as we don't set it to 0.

Greetings,

Andres Freund

#58Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#56)
Re: Corruption during WAL replay

Andres Freund <andres@anarazel.de> writes:

Ah, and that's finally also the explanation why I couldn't reproduce the
failure it in a different directory, with an otherwise identically configured
PG: The length of the path to the tablespace influences the size of the
XLOG_TBLSPC_CREATE record.

Ooooohhh ... yeah, that could explain a lot of cross-animal variation.

Not sure what to do here... I guess we can just change the value we overwrite
the page with and hope to not hit this again? But that feels deeply deeply
unsatisfying.

AFAICS, this strategy of whacking a predetermined chunk of the page with
a predetermined value is going to fail 1-out-of-64K times. We have to
change the test so that it's guaranteed to produce an invalid checksum.
Inverting just the checksum field, without doing anything else, would
do that ... but that feels pretty unsatisfying too.

regards, tom lane

#59Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#58)
Re: Corruption during WAL replay

Hi,

On 2022-03-25 01:38:45 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Not sure what to do here... I guess we can just change the value we overwrite
the page with and hope to not hit this again? But that feels deeply deeply
unsatisfying.

AFAICS, this strategy of whacking a predetermined chunk of the page with
a predetermined value is going to fail 1-out-of-64K times.

Yea. I suspect that the way the modifications and checksumming are done are
actually higher chance than 1/64k. But even it actually is 1/64k, it's not
great to wait for (#animals * #catalog-changes) to approach a decent
percentage of 1/64k.

I'm was curious whether there have been similar issues in the past. Querying
the buildfarm logs suggests not, at least not in the pg_checksums test.

We have to change the test so that it's guaranteed to produce an invalid
checksum. Inverting just the checksum field, without doing anything else,
would do that ... but that feels pretty unsatisfying too.

We really ought to find a way to get to wider checksums :/

Greetings,

Andres Freund

#60Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#59)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 2:07 AM Andres Freund <andres@anarazel.de> wrote:

We really ought to find a way to get to wider checksums :/

Eh, let's just use longer names for the buildfarm animals and call it good. :-)

--
Robert Haas
EDB: http://www.enterprisedb.com

#61Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#59)
Re: Corruption during WAL replay

Andres Freund <andres@anarazel.de> writes:

On 2022-03-25 01:38:45 -0400, Tom Lane wrote:

AFAICS, this strategy of whacking a predetermined chunk of the page with
a predetermined value is going to fail 1-out-of-64K times.

Yea. I suspect that the way the modifications and checksumming are done are
actually higher chance than 1/64k. But even it actually is 1/64k, it's not
great to wait for (#animals * #catalog-changes) to approach a decent
percentage of 1/64k.

Exactly.

I'm was curious whether there have been similar issues in the past. Querying
the buildfarm logs suggests not, at least not in the pg_checksums test.

That test has only been there since 2018 (b34e84f16). We've probably
accumulated a couple hundred initial-catalog-contents changes since
then, so maybe this failure arrived right on schedule :-(.

We really ought to find a way to get to wider checksums :/

That'll just reduce the probability of failure, not eliminate it.

regards, tom lane

#62Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#61)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That'll just reduce the probability of failure, not eliminate it.

I mean, if the expected time to the first failure on even 1 machine
exceeds the time until the heat death of the universe by 10 orders of
magnitude, it's probably good enough.

--
Robert Haas
EDB: http://www.enterprisedb.com

#63Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#62)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 25, 2022 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That'll just reduce the probability of failure, not eliminate it.

I mean, if the expected time to the first failure on even 1 machine
exceeds the time until the heat death of the universe by 10 orders of
magnitude, it's probably good enough.

Adding another 16 bits won't get you to that, sadly. Yeah, it *might*
extend the MTTF to more than the project's likely lifespan, but that
doesn't mean we couldn't get unlucky next week.

regards, tom lane

#64Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#63)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 25, 2022 at 9:49 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

That'll just reduce the probability of failure, not eliminate it.

I mean, if the expected time to the first failure on even 1 machine
exceeds the time until the heat death of the universe by 10 orders of
magnitude, it's probably good enough.

Adding another 16 bits won't get you to that, sadly. Yeah, it *might*
extend the MTTF to more than the project's likely lifespan, but that
doesn't mean we couldn't get unlucky next week.

I suspect that the number of bits Andres wants to add is no less than 48.

--
Robert Haas
EDB: http://www.enterprisedb.com

#65Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#64)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 25, 2022 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Adding another 16 bits won't get you to that, sadly. Yeah, it *might*
extend the MTTF to more than the project's likely lifespan, but that
doesn't mean we couldn't get unlucky next week.

I suspect that the number of bits Andres wants to add is no less than 48.

I dunno. Compatibility and speed concerns aside, that seems like an awful
lot of bits to be expending on every page compared to the value.

regards, tom lane

#66Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#65)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I dunno. Compatibility and speed concerns aside, that seems like an awful
lot of bits to be expending on every page compared to the value.

I dunno either, but over on the TDE thread people seemed quite willing
to expend like 16-32 *bytes* for page verifiers and nonces and things.
For compatibility and speed reasons, I doubt we could ever get by with
doing that in every cluster, but I do have some hope of introducing
something like that someday at least as an optional feature. It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

--
Robert Haas
EDB: http://www.enterprisedb.com

#67Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#66)
1 attachment(s)
Re: Corruption during WAL replay

Robert Haas <robertmhaas@gmail.com> writes:

... It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

Indeed. I propose the attached, which also fixes the unsafe use
of seek() alongside syswrite(), directly contrary to what "man perlfunc"
says to do.

regards, tom lane

Attachments:

make-checksum-corruption-more-reliable.patchtext/x-diff; charset=us-ascii; name=make-checksum-corruption-more-reliable.patch
#68Andres Freund
Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#67)
Re: Corruption during WAL replay

Hi,

On 2022-03-25 11:50:48 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

... It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

Indeed. I propose the attached, which also fixes the unsafe use
of seek() alongside syswrite(), directly contrary to what "man perlfunc"
says to do.

That looks reasonable. Although I wonder if we loose something by not testing
the influence of the rest of the block - but I don't really see anything.

The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
which presumably has the same collision risks. Perhaps we should put a
function into Cluster.pm and use it from both?

Greetings,

Andres Freund

#69Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#68)
Re: Corruption during WAL replay

Andres Freund <andres@anarazel.de> writes:

The same code also exists in src/bin/pg_basebackup/t/010_pg_basebackup.pl,
which presumably has the same collision risks.

Oooh, I missed that.

Perhaps we should put a
function into Cluster.pm and use it from both?

+1, I'll make it so.

regards, tom lane

#70Dagfinn Ilmari Mannsåker
Dagfinn Ilmari Mannsåker
ilmari@ilmari.org
In reply to: Tom Lane (#67)
Re: Corruption during WAL replay

Tom Lane <tgl@sss.pgh.pa.us> writes:

Robert Haas <robertmhaas@gmail.com> writes:

... It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

Indeed. I propose the attached, which also fixes the unsafe use
of seek() alongside syswrite(), directly contrary to what "man perlfunc"
says to do.

LGTM, but it would be good to include $! in the die messages.

- ilmari

Show quoted text

regards, tom lane

diff --git a/src/bin/pg_checksums/t/002_actions.pl b/src/bin/pg_checksums/t/002_actions.pl
index 62c608eaf6..8c70453a45 100644
--- a/src/bin/pg_checksums/t/002_actions.pl
+++ b/src/bin/pg_checksums/t/002_actions.pl
@@ -24,6 +24,7 @@ sub check_relation_corruption
my $tablespace = shift;
my $pgdata     = $node->data_dir;
+	# Create table and discover its filesystem location.
$node->safe_psql(
'postgres',
"CREATE TABLE $table AS SELECT a FROM generate_series(1,10000) AS a;
@@ -37,9 +38,6 @@ sub check_relation_corruption
my $relfilenode_corrupted = $node->safe_psql('postgres',
"SELECT relfilenode FROM pg_class WHERE relname = '$table';");

- # Set page header and block size
- my $pageheader_size = 24;
- my $block_size = $node->safe_psql('postgres', 'SHOW block_size;');
$node->stop;

# Checksums are correct for single relfilenode as the table is not
@@ -55,8 +53,12 @@ sub check_relation_corruption

# Time to create some corruption
open my $file, '+<', "$pgdata/$file_corrupted";
-	seek($file, $pageheader_size, SEEK_SET);
-	syswrite($file, "\0\0\0\0\0\0\0\0\0");
+	my $pageheader;
+	sysread($file, $pageheader, 24) or die "sysread failed";
+	# This inverts the pd_checksum field (only); see struct PageHeaderData
+	$pageheader ^= "\0\0\0\0\0\0\0\0\xff\xff";
+	sysseek($file, 0, 0) or die "sysseek failed";
+	syswrite($file, $pageheader) or die "syswrite failed";
close $file;

# Checksum checks on single relfilenode fail

#71Tom Lane
Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dagfinn Ilmari Mannsåker (#70)
Re: Corruption during WAL replay

=?utf-8?Q?Dagfinn_Ilmari_Manns=C3=A5ker?= <ilmari@ilmari.org> writes:

LGTM, but it would be good to include $! in the die messages.

Roger, will do.

regards, tom lane

#72Michael Paquier
Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#65)
Re: Corruption during WAL replay

On Fri, Mar 25, 2022 at 10:34:49AM -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Fri, Mar 25, 2022 at 10:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Adding another 16 bits won't get you to that, sadly. Yeah, it *might*
extend the MTTF to more than the project's likely lifespan, but that
doesn't mean we couldn't get unlucky next week.

I suspect that the number of bits Andres wants to add is no less than 48.

I dunno. Compatibility and speed concerns aside, that seems like an awful
lot of bits to be expending on every page compared to the value.

Err. And there are not that many bits that could be recycled for this
purpose in the current page layout, aren't there?
--
Michael

#73Kyotaro Horiguchi
Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#37)
Re: Corruption during WAL replay

At Thu, 24 Mar 2022 15:33:29 -0400, Robert Haas <robertmhaas@gmail.com> wrote in

On Thu, Mar 17, 2022 at 9:21 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

All versions pass check world.

Thanks, committed.

(I was overwhelmed by the flood of following discussion..)

Anyway, thanks for picking up this and committing!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#74Stephen Frost
Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#66)
Re: Corruption during WAL replay

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Fri, Mar 25, 2022 at 10:34 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

I dunno. Compatibility and speed concerns aside, that seems like an awful
lot of bits to be expending on every page compared to the value.

I dunno either, but over on the TDE thread people seemed quite willing
to expend like 16-32 *bytes* for page verifiers and nonces and things.

Absolutely.

For compatibility and speed reasons, I doubt we could ever get by with
doing that in every cluster, but I do have some hope of introducing
something like that someday at least as an optional feature. It's not
like a 16-bit checksum was state-of-the-art even when we introduced
it. We just did it because we had 2 bytes that we could repurpose
relatively painlessly, and not any larger number. And that's still the
case today, so at least in the short term we will have to choose some
other solution to this problem.

I agree that this would be great as an optional feature. Those patches
to allow the system to be built with reserved space for $whatever would
allow us to have a larger checksum for those who want it and perhaps a
nonce with TDE for those who wish that in the future. I mentioned
before that I thought it might be a good way to introduce page-level
epochs for 64bit xids too though it never seemed to get much traction.

Anyhow, this whole thread has struck me as a good reason to polish those
patches off and add on top of them an extended checksum ability, first,
independent of TDE, and remove the dependency of those patches from the
TDE effort and instead allow it to just leverage that ability. I still
suspect we'll have some folks who will want TDE w/o a per-page nonce and
that could be an option but we'd be able to support TDE w/ integrity
pretty easily, which would be fantastic.

Thanks,

Stephen

#75Robert Haas
Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#74)
Re: Corruption during WAL replay

On Tue, Mar 29, 2022 at 12:34 PM Stephen Frost <sfrost@snowman.net> wrote:

Anyhow, this whole thread has struck me as a good reason to polish those
patches off and add on top of them an extended checksum ability, first,
independent of TDE, and remove the dependency of those patches from the
TDE effort and instead allow it to just leverage that ability. I still
suspect we'll have some folks who will want TDE w/o a per-page nonce and
that could be an option but we'd be able to support TDE w/ integrity
pretty easily, which would be fantastic.

Yes, I like that idea. Once we get beyond feature freeze, perhaps we
can try to coordinate to avoid duplication of effort -- or absence of
effort.

--
Robert Haas
EDB: http://www.enterprisedb.com