Corruption during WAL replay

Started by Teja Muppartiabout 6 years ago75 messageshackers
Jump to latest
#1Teja Mupparti
tejeswarm@hotmail.com

Attachments:

bug-fix-patchapplication/octet-stream; name=bug-fix-patchDownload+217-8
#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Teja Mupparti (#1)
Re: Corruption during WAL replay

Thanks for working on this.

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

This is my *first* attempt to submit a Postgres patch, please let me know if I missed any process or format of the patch

Welcome! The format looks fine to me. It would be better if it had a
commit message that explains what the patch does. (in the format that
git format-patch emits.)

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1]/messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de, since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

However, note that md truncates a "file" a non-atomic way. mdtruncate
truncates multiple files from the last segment toward the
beginning. If mdtruncate successfully truncated the first several
segments then failed, retaining all buffers triggers assertion failure
in mdwrite while buffer flush.

Some typos found:

+ * a backround task might flush them to the disk right after we

s/backround/background/

+ * saved list of buffers that were marked as BM_IO_IN_PRGRESS just
s/BM_IO_IN_PRGRESS/BM_IO_IN_PROGRESS/

+ * as BM_IO_IN_PROGRES. Though the buffers are marked for IO, they
s/BM_IO_IN_PROGRES/BM_IO_IN_PROGRESS/

+ * being dicarded).
s/dicarded/discarded/

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#2)
Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#3)
Re: Corruption during WAL replay

At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

# I would be wrong with far low odds..

"doesn't" is overstated. Is there a case where the buffer is already
flushed by checkpoint? (If that is the case, dropping clean buffers at
marking truncate would work?)

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

Thank you for the opinion. I agree to that.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

Hmm. If we allow PANIC on truncation failure why do we need to go on
the complicated steps? Wouldn't it enough to enclose the sequence
(WAL insert - drop buffers - truncate) in a critical section? I
believed that this project aims to fix the db-breakage on truncation
failure by allowing rollback on truncation failure?

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

Yeah, I like that direction.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I think the question above is the core part of the problem.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Teja Mupparti
tejeswarm@hotmail.com
In reply to: Andres Freund (#3)
Re: Corruption during WAL replay

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

Regards
Teja

________________________________
From: Andres Freund <andres@anarazel.de>
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Cc: tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund

Attachments:

0001-Wal-replay-corruption.patchapplication/octet-stream; name=0001-Wal-replay-corruption.patchDownload+252-37
#6Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#3)
Re: Corruption during WAL replay

On 2020-Mar-30, Andres Freund wrote:

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on. I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

I didn't review Teja's patch carefully, but the idea of panicking on
failure (causing WAL replay) seems better than the current behavior.
I'd rather put the server to wait until storage is really back.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#7Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#6)
Re: Corruption during WAL replay

Hi,

On 2020-04-10 20:49:05 -0400, Alvaro Herrera wrote:

On 2020-Mar-30, Andres Freund wrote:

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on. I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

Yea. In that case we're in a really bad state. Because we truncate after
throwing away the old buffer contents (even if dirty), we'll later read
page contents "from the past". Which won't end well...

Greetings,

Andres Freund

#8Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Teja Mupparti (#5)
Re: Corruption during WAL replay

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

In the above idea, remembering all buffers having to-be-truncated
table at RelationTruncate(), we reduce the time for checking buffers
at the commit time. Since we acquire AccessExclusiveLock the number of
buffers having to-be-truncated table's data never increases. A
downside would be that since we can truncate multiple relations we
need to remember all buffers of each truncated relations, which is up
to (sizeof(int) * NBuffers) in total.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#9Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#8)
Re: Corruption during WAL replay

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

What would this help with? If we still need the more complicated
truncation sequence?

Greetings,

Andres Freund

#10Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#9)
Re: Corruption during WAL replay

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1]/messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.deDoing sync before truncation, also seems not enough.

Regards,

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.deDoing sync before truncation
sync before truncation

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Masahiko Sawada (#10)
Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again. So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone. But we don't have a room for another flag
in buffer header. For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Agreed. Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

[1] /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de
Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12Teja Mupparti
tejeswarm@hotmail.com
In reply to: Kyotaro Horiguchi (#11)
Re: Corruption during WAL replay

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point.

Best regards,
Teja

________________________________
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko.sawada@2ndquadrant.com <masahiko.sawada@2ndquadrant.com>
Cc: andres@anarazel.de <andres@anarazel.de>; tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again. So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone. But we don't have a room for another flag
in buffer header. For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Agreed. Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

[1] /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de
Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Teja Mupparti (#12)
Re: Corruption during WAL replay

On Wed, 15 Apr 2020 at 04:04, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

I agree that when replaying the deletion of (3) the page LP 1 is
empty, but does that replay really fail with PANIC? I guess that we
record that page into invalid_page_tab but don't raise a PANIC in this
case.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14Daniel Wood
hexexpert@comcast.net
In reply to: Teja Mupparti (#12)
Wait profiling

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

I feel that it would be useful to have the following two things. One PG enhancement and one standard extension.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

62.6% BufFileRead
50.0% CPU
9.3% LWLock

It uses a PG timer to do this.

2) An extension based function like: select pg_wait_profile(pid, nSeconds, timerFrequency) to return the same thing for an already running query. Useful if you want examine some already long running query that is taking too long.

Neither of these would be doing the heavy weight pg_stat_activity but directly poll the wait event in PROC. I've already coded the EXPLAIN option.

Furthermore, can't we just remove the following "IF" test from pgstat_report_wait_{start,end}?
if (!pgstat_track_activities || !proc)
return;
Just do the assignment of wait_event_info always. We should use a dummy PGPROC assigned to MyProc until we assign the one in the procarray in shared memory. That way we don't need the "!proc" test.
About the only thing I'd want to verify is whether wait_event_info is on the same cache lines as anything else having to do with snapshots.

If I recall correctly the blanks lines above I've used to make this more readable will disappear. :-(

- Dan Wood

#15Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Daniel Wood (#14)
Re: Wait profiling

On 2020-Jul-10, Daniel Wood wrote:

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling. I
haven't tested it yet ...

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16Julien Rouhaud
rjuju123@gmail.com
In reply to: Alvaro Herrera (#15)
Re: Wait profiling

On Fri, Jul 10, 2020 at 10:37 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Jul-10, Daniel Wood wrote:

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling. I
haven't tested it yet ...

I use it, and I know multiple people that are also using it (or about
to, it's currently being packaged) in production. It's working quite
well and is compatible with pg_stat_statements' queryid. You can see
some examples of dashboards that can be built on top of this extension
at https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_wait_sampling.html.

#17Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Teja Mupparti (#12)
Re: Corruption during WAL replay

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

�Repro details (when full page write = off)

� � � � �1) Page on disk has empty LP 1, Insert into page LP 1
� � � � �2) checkpoint START (Recovery REDO eventually starts here)
� � � � �3) Delete all rows on the page (page is empty now)
� � � � �4) Autovacuum kicks in and truncates the pages
� � � � � � � � �DropRelFileNodeBuffers - Dirty page NOT written, LP 1
on disk still empty
� � � � �5) Checkpoint completes
� � � � �6) Crash
� � � � �7) smgrtruncate - Not reached (this is where we do the
physical truncate)

�Now the crash-recovery starts

� � � � �Delete-log-replay (above step-3) reads page with empty LP 1
and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the
truncation will happen in the future but the recovery fails (repeatedly)
even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated
away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly,
it requires scanning the buffers that are dropped twice, which can take
a long time. I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a
checkpoint, right after the redo point has been established. See
comments around the GetVirtualXIDsDelayingChkpt() function call in
CreateCheckPoint(). We could have a similar mechanism to wait out the
truncations before *completing* a checkpoint.

- Heikki

#18Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#17)
Re: Corruption during WAL replay

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

�Repro details (when full page write = off)

� � � � �1) Page on disk has empty LP 1, Insert into page LP 1
� � � � �2) checkpoint START (Recovery REDO eventually starts here)
� � � � �3) Delete all rows on the page (page is empty now)
� � � � �4) Autovacuum kicks in and truncates the pages
� � � � � � � � �DropRelFileNodeBuffers - Dirty page NOT written, LP 1
on disk still empty
� � � � �5) Checkpoint completes
� � � � �6) Crash
� � � � �7) smgrtruncate - Not reached (this is where we do the
physical truncate)

�Now the crash-recovery starts

� � � � �Delete-log-replay (above step-3) reads page with empty LP 1
and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the
truncation will happen in the future but the recovery fails (repeatedly)
even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated away,
expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly, it
requires scanning the buffers that are dropped twice, which can take a long
time.

I was thinking that we'd keep track of all the buffers marked as "in
progress" that way, avoiding the second scan.

It's also worth keeping in mind that this code is really only relevant
for partial truncations, which don't happen at the same frequency as
transactional truncations.

I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once.

But that's when dropping many relations, normally. E.g. at the end of a
regression test.

More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

I think we should apply Robert's patch that makes io locks into
condition variables. Then we can fairly easily have many many buffers io
locked. Obviously there's some issues with doing so in the back
branches :(

I'm working on an AIO branch, and that also requires to be able to mark
multiple buffers as in-progress, FWIW.

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a checkpoint,
right after the redo point has been established. See comments around the
GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
have a similar mechanism to wait out the truncations before *completing* a
checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Greetings,

Andres Freund

#19Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Andres Freund (#18)
Re: Corruption during WAL replay

Status update for a commitfest entry.

I see quite a few unanswered questions in the thread since the last patch version was sent. So, I move it to "Waiting on Author".

The new status of this patch is: Waiting on Author

#20Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#18)
Re: Corruption during WAL replay

On Tue, Aug 18, 2020 at 3:22 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

+1

If we had a critical section in RelationTruncate(), crash recovery
would continue failing until the situation of the underlying file is
recovered if a PANIC happens. The current comment in
RelationTruncate() says it’s worse than the disease. But considering
physical replication, as Andres mentioned, a failure to truncate the
file after logging WAL is no longer a harmless failure. Also, the
critical section would be necessary even if we reversed the order of
truncation and dropping buffers and resolved the issue. So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#21Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Masahiko Sawada (#20)
#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#18)
#23Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Kyotaro Horiguchi (#22)
#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ibrar Ahmed (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#24)
#26Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#25)
#27Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#26)
#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#27)
#29Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Tom Lane (#28)
#30Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#25)
#31Daniel Shelepanov
deniel1495@mail.ru
In reply to: Kyotaro Horiguchi (#30)
#32Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Daniel Shelepanov (#31)
#33Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#32)
#34Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#33)
#35Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#34)
#36Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#35)
#37Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#36)
#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#37)
#39Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#38)
#40Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#39)
#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#40)
#42Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#41)
#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#42)
#44Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#40)
#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#43)
#46Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#43)
#47Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#45)
#48Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#46)
#49Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#48)
#50Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#47)
#51Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#49)
#52Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#50)
#53Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#52)
#54Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#53)
#55Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#54)
#56Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#54)
#57Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#55)
#58Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#56)
#59Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#58)
#60Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#59)
#61Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#59)
#62Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#61)
#63Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#62)
#64Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#63)
#65Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#64)
#66Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#65)
#67Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#66)
#68Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#67)
#69Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#68)
In reply to: Tom Lane (#67)
#71Tom Lane
tgl@sss.pgh.pa.us
In reply to: Dagfinn Ilmari Mannsåker (#70)
#72Michael Paquier
michael@paquier.xyz
In reply to: Tom Lane (#65)
#73Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Robert Haas (#37)
#74Stephen Frost
sfrost@snowman.net
In reply to: Robert Haas (#66)
#75Robert Haas
robertmhaas@gmail.com
In reply to: Stephen Frost (#74)