Corruption during WAL replay

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Teja Mupparti (#1)

Re: Corruption during WAL replay

Thanks for working on this.

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

This is my *first* attempt to submit a Postgres patch, please let me know if I missed any process or format of the patch

Welcome! The format looks fine to me. It would be better if it had a
commit message that explains what the patch does. (in the format that
git format-patch emits.)

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1]/messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de, since the protocol is aiming to allow server to
continue running after truncation failure.

However, note that md truncates a "file" a non-atomic way. mdtruncate
truncates multiple files from the last segment toward the
beginning. If mdtruncate successfully truncated the first several
segments then failed, retaining all buffers triggers assertion failure
in mdwrite while buffer flush.

Some typos found:

+ * a backround task might flush them to the disk right after we

s/backround/background/

+ * saved list of buffers that were marked as BM_IO_IN_PRGRESS just
s/BM_IO_IN_PRGRESS/BM_IO_IN_PROGRESS/

+ * as BM_IO_IN_PROGRES. Though the buffers are marked for IO, they
s/BM_IO_IN_PROGRES/BM_IO_IN_PROGRESS/

+ * being dicarded).
s/dicarded/discarded/

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

andres@anarazel.de

about 6 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

Greetings,

Andres Freund

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Andres Freund (#3)

Re: Corruption during WAL replay

At Mon, 30 Mar 2020 16:31:59 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

# I would be wrong with far low odds..

"doesn't" is overstated. Is there a case where the buffer is already
flushed by checkpoint? (If that is the case, dropping clean buffers at
marking truncate would work?)

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

I don't see a real problem with it. Nor really a good alternative. Note
that for autovacuum truncations we'll only truncate a limited number of
buffers at once, and for most relation truncations we don't enter this
path (since we create a new relfilenode instead).

Thank you for the opinion. I agree to that.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

I think it's entirely broken to continue running after a truncation
failure. We obviously have to first WAL log the truncation (since
otherwise we can crash just after doing the truncation). But we cannot
just continue running after WAL logging, but not performing the
associated action: The most obvious reason is that otherwise a replica
will execute the trunction, but the primary will not.

Hmm. If we allow PANIC on truncation failure why do we need to go on
the complicated steps? Wouldn't it enough to enclose the sequence
(WAL insert - drop buffers - truncate) in a critical section? I
believed that this project aims to fix the db-breakage on truncation
failure by allowing rollback on truncation failure?

The whole justification for that behaviour "It would turn a usually
harmless failure to truncate, that might spell trouble at WAL replay,
into a certain PANIC." was always dubious (since on-disk and in-memory
state now can diverge), but it's clearly wrong once replication had
entered the picture. There's just no alternative to a critical section
here.

Yeah, I like that direction.

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I think the question above is the core part of the problem.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Teja Mupparti

tejeswarm@hotmail.com

about 6 years ago

In reply to: Andres Freund (#3)

Re: Corruption during WAL replay

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

Regards
Teja

________________________________
From: Andres Freund <andres@anarazel.de>
Sent: Monday, March 30, 2020 4:31 PM
To: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Cc: tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay

Hi,

On 2020-03-24 18:18:12 +0900, Kyotaro Horiguchi wrote:

At Mon, 23 Mar 2020 20:56:59 +0000, Teja Mupparti <tejeswarm@hotmail.com> wrote in

The original bug reporting-email and the relevant discussion is here

...

The crux of the fix is, in the current code, engine drops the buffer and then truncates the file, but a crash before the truncate and after the buffer-drop is causing the corruption. Patch reverses the order i.e. truncate the file and drop the buffer later.

BufferAlloc doesn't wait for the BM_IO_IN_PROGRESS for a valid buffer.

I don't think that's true. For any of this to be relevant the buffer has
to be dirty. In which case BufferAlloc() has to call
FlushBuffer(). Which in turn does a WaitIO() if BM_IO_IN_PROGRESS is
set.

What path are you thinking of? Or alternatively, what am I missing?

I'm not sure it's acceptable to remember all to-be-deleted buffers
while truncation.

+ /*START_CRIT_SECTION();*/

Is this a point of argument? It is not needed if we choose the
strategy (c) in [1], since the protocol is aiming to allow server to
continue running after truncation failure.

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de

Greetings,

Andres Freund

Alvaro Herrera

alvherre@2ndquadrant.com

about 6 years ago

In reply to: Andres Freund (#3)

Re: Corruption during WAL replay

On 2020-Mar-30, Andres Freund wrote:

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on. I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

I didn't review Teja's patch carefully, but the idea of panicking on
failure (causing WAL replay) seems better than the current behavior.
I'd rather put the server to wait until storage is really back.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

andres@anarazel.de

about 6 years ago

In reply to: Alvaro Herrera (#6)

Re: Corruption during WAL replay

Hi,

On 2020-04-10 20:49:05 -0400, Alvaro Herrera wrote:

On 2020-Mar-30, Andres Freund wrote:

If we are really concerned with truncation failing - I don't know why we
would be, we accept that we have to be able to modify files etc to stay
up - we can add a pre-check ensuring that permissions are set up
appropriately to allow us to truncate.

I remember I saw a case where the datadir was NFS or some other network
filesystem thingy, and it lost connection just before autovacuum
truncation, or something like that -- so there was no permission
failure, but the truncate failed and yet PG soldiered on. I think the
connection was re-established soon thereafter and things went back to
"normal", with nobody realizing that a truncate had been lost.
Corruption was discovered a long time afterwards IIRC (weeks or months,
I don't remember).

Yea. In that case we're in a really bad state. Because we truncate after
throwing away the old buffer contents (even if dirty), we'll later read
page contents "from the past". Which won't end well...

Greetings,

Andres Freund

sawada.mshk@gmail.com

about 6 years ago

In reply to: Teja Mupparti (#5)

Re: Corruption during WAL replay

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.

+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();

And the comments of RelationTruncate() mentions:

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

In the above idea, remembering all buffers having to-be-truncated
table at RelationTruncate(), we reduce the time for checking buffers
at the commit time. Since we acquire AccessExclusiveLock the number of
buffers having to-be-truncated table's data never increases. A
downside would be that since we can truncate multiple relations we
need to remember all buffers of each truncated relations, which is up
to (sizeof(int) * NBuffers) in total.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

andres@anarazel.de

about 6 years ago

In reply to: Masahiko Sawada (#8)

Re: Corruption during WAL replay

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.
+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();
And the comments of RelationTruncate() mentions:

I think that's just a workaround for mdtruncate not being usable in
critical sections.

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

At CommitTransaction(),
3. Revisit the remembered buffers to check if the buffer still has
table data that needs to be truncated.
4-a, If it has, we mark it as IO_IN_PROGRESS.
4-b, If it already has different table data, ignore it.
5, Truncate physical files.
6, Mark the buffer we marked at #4-a as invalid.

If an error occurs between #3 and #6 or in abort case, we revert all
IO_IN_PROGRESS flags on the buffers.

What would this help with? If we still need the more complicated
truncation sequence?

Greetings,

Andres Freund

#10

sawada.mshk@gmail.com

about 6 years ago

In reply to: Andres Freund (#9)

Re: Corruption during WAL replay

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:
On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Andres and Kyotaro for the quick review. I have fixed the typos and also included the critical section (emulated it with try-catch block since palloc()s are causing issues in the truncate code). This time I used git format-patch.

I briefly looked at the latest patch but I'm not sure it's the right
thing here to use PG_TRY/PG_CATCH to report the PANIC error. For
example, with the following code you changed, we will always end up
with emitting a PANIC "failed to truncate the relation" regardless of
the actual cause of the error.
+   PG_CATCH();
+   {
+       ereport(PANIC, (errcode(ERRCODE_INTERNAL_ERROR),
+           errmsg("failed to truncate the relation")));
+   }
+   PG_END_TRY();
And the comments of RelationTruncate() mentions:
I think that's just a workaround for mdtruncate not being usable in
critical sections.

/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1]/messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.deDoing sync before truncation, also seems not enough.

Regards,

[1]: /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.deDoing sync before truncation
sync before truncation

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Masahiko Sawada (#10)

Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

The reason other operations like INSERTs (that extends the underlying
file) are "safe" after an extension failure is the following
operations are performed in shared buffers as if the new page exists,
then tries to extend the file again. So if we continue working after
truncation failure, we need to disguise on shared buffers as if the
truncated pages are gone. But we don't have a room for another flag
in buffer header. For example, BM_DIRTY && !BM_VALID might be able to
be used as the state that the page should have been truncated but not
succeeded yet, but I'm not sure.

Anyway, I think the prognosis of a truncation failure is far hopeless
than extension failure in most cases and I doubt that it's good to
introduce such a complex feature only to overcome such a hopeless
situation.

In short, I think we should PANIC in that case.

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Agreed. Since it's not acceptable ether WAL-logging->not-performing
nor performing->WAL-logging, there's no other way than working as if
truncation is succeeded (and try again) even if it actually
failed. But it would be too complex.

Just making it a critical section seems the right thing here.

[1] /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de
Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#12

Teja Mupparti

tejeswarm@hotmail.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#11)

Re: Corruption during WAL replay

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the truncation will happen in the future but the recovery fails (repeatedly) even before reaching that point.

Best regards,
Teja

________________________________
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Sent: Monday, April 13, 2020 7:35 PM
To: masahiko.sawada@2ndquadrant.com <masahiko.sawada@2ndquadrant.com>
Cc: andres@anarazel.de <andres@anarazel.de>; tejeswarm@hotmail.com <tejeswarm@hotmail.com>; pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; hexexpert@comcast.net <hexexpert@comcast.net>
Subject: Re: Corruption during WAL replay

At Mon, 13 Apr 2020 18:53:26 +0900, Masahiko Sawada <masahiko.sawada@2ndquadrant.com> wrote in

On Mon, 13 Apr 2020 at 17:40, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-04-13 15:24:55 +0900, Masahiko Sawada wrote:

On Sat, 11 Apr 2020 at 09:00, Teja Mupparti <tejeswarm@hotmail.com> wrote:
/*
* We WAL-log the truncation before actually truncating, which means
* trouble if the truncation fails. If we then crash, the WAL replay
* likely isn't going to succeed in the truncation either, and cause a
* PANIC. It's tempting to put a critical section here, but that cure
* would be worse than the disease. It would turn a usually harmless
* failure to truncate, that might spell trouble at WAL replay, into a
* certain PANIC.
*/

Yea, but that reasoning is just plain *wrong*. It's *never* ok to WAL
log something and then not perform the action. This leads to to primary
/ replica getting out of sync, crash recovery potentially not completing
(because of records referencing the should-be-truncated pages), ...

It is introduced in 2008 by 3396000684, for 8.4. So it can be said as
an overlook when introducing log-shipping.

In short, I think we should PANIC in that case.

As a second idea, I wonder if we can defer truncation until commit
time like smgrDoPendingDeletes mechanism. The sequence would be:

This is mostly an issue during [auto]vacuum partially truncating the end
of the file. We intentionally release the AEL regularly to allow other
accesses to continue.

For transactional truncations we don't go down this path (as we create a
new relfilenode).

At RelationTruncate(),
1. WAL logging.
2. Remember buffers to be dropped.

You definitely cannot do that, as explained above.

Ah yes, you're right.

So it seems to me currently what we can do for this issue would be to
enclose the truncation operation in a critical section. IIUC it's not
enough just to reverse the order of dropping buffers and physical file
truncation because it cannot solve the problem of inconsistency on the
standby. And as Horiguchi-san mentioned, there is no need to reverse
that order if we envelop the truncation operation by a critical
section because we can recover page changes during crash recovery. The
strategy of writing out all dirty buffers before dropping buffers,
proposed as (a) in [1], also seems not enough.

Just making it a critical section seems the right thing here.

[1] /messages/by-id/20191207001232.klidxnm756wqxvwx@alap3.anarazel.de
Doing sync before truncation

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

sawada.mshk@gmail.com

almost 6 years ago

In reply to: Teja Mupparti (#12)

Re: Corruption during WAL replay

On Wed, 15 Apr 2020 at 04:04, Teja Mupparti <tejeswarm@hotmail.com> wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a consensus on the critical-section around truncate, but I just want to emphasize the need for reversing the order of the dropping the buffers and the truncation.

Repro details (when full page write = off)

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and truncates the pages
DropRelFileNodeBuffers - Dirty page NOT written, LP 1 on disk still empty
5) Checkpoint completes
6) Crash
7) smgrtruncate - Not reached (this is where we do the physical truncate)

Now the crash-recovery starts

Delete-log-replay (above step-3) reads page with empty LP 1 and the delete fails with PANIC (old page on disk with no insert)

I agree that when replaying the deletion of (3) the page LP 1 is
empty, but does that replay really fail with PANIC? I guess that we
record that page into invalid_page_tab but don't raise a PANIC in this
case.

Regards,

--
Masahiko Sawada http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14

Daniel Wood

hexexpert@comcast.net

almost 6 years ago

In reply to: Teja Mupparti (#12)

Wait profiling

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

I feel that it would be useful to have the following two things. One PG enhancement and one standard extension.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

62.6% BufFileRead
50.0% CPU
9.3% LWLock

It uses a PG timer to do this.

2) An extension based function like: select pg_wait_profile(pid, nSeconds, timerFrequency) to return the same thing for an already running query. Useful if you want examine some already long running query that is taking too long.

Neither of these would be doing the heavy weight pg_stat_activity but directly poll the wait event in PROC. I've already coded the EXPLAIN option.

Furthermore, can't we just remove the following "IF" test from pgstat_report_wait_{start,end}?
if (!pgstat_track_activities || !proc)
return;
Just do the assignment of wait_event_info always. We should use a dummy PGPROC assigned to MyProc until we assign the one in the procarray in shared memory. That way we don't need the "!proc" test.
About the only thing I'd want to verify is whether wait_event_info is on the same cache lines as anything else having to do with snapshots.

If I recall correctly the blanks lines above I've used to make this more readable will disappear. :-(

- Dan Wood

#15

Alvaro Herrera

alvherre@2ndquadrant.com

almost 6 years ago

In reply to: Daniel Wood (#14)

Re: Wait profiling

On 2020-Jul-10, Daniel Wood wrote:

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling. I
haven't tested it yet ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#16

Julien Rouhaud

rjuju123@gmail.com

almost 6 years ago

In reply to: Alvaro Herrera (#15)

Re: Wait profiling

On Fri, Jul 10, 2020 at 10:37 PM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2020-Jul-10, Daniel Wood wrote:

After nearly 5 years does something like the following yet exist?
/messages/by-id/559D4729.9080704@postgrespro.ru

Yes, we have pg_stat_activity.wait_events which implement pretty much
what Ildus describes there.

1) An option to "explain" to produce a wait events profile.
postgres=# explain (analyze, waitprofile) update pgbench_accounts set bid=bid+1 where aid < 2000000;
...
Execution time: 23111.231 ms

There's an out-of-core extension, search for pg_wait_sampling. I
haven't tested it yet ...

I use it, and I know multiple people that are also using it (or about
to, it's currently being packaged) in production. It's working quite
well and is compatible with pg_stat_statements' queryid. You can see
some examples of dashboards that can be built on top of this extension
at https://powa.readthedocs.io/en/latest/components/stats_extensions/pg_wait_sampling.html.

#17

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 5 years ago

In reply to: Teja Mupparti (#12)

Re: Corruption during WAL replay

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

ï¿½Repro details (when full page write = off)

ï¿½ ï¿½ ï¿½ ï¿½ ï¿½1) Page on disk has empty LP 1, Insert into page LP 1
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½2) checkpoint START (Recovery REDO eventually starts here)
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½3) Delete all rows on the page (page is empty now)
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½4) Autovacuum kicks in and truncates the pages
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½DropRelFileNodeBuffers - Dirty page NOT written, LP 1
on disk still empty
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½5) Checkpoint completes
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½6) Crash
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½7) smgrtruncate - Not reached (this is where we do the
physical truncate)

ï¿½Now the crash-recovery starts

ï¿½ ï¿½ ï¿½ ï¿½ ï¿½Delete-log-replay (above step-3) reads page with empty LP 1
and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the
truncation will happen in the future but the recovery fails (repeatedly)
even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated
away, expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly,
it requires scanning the buffers that are dropped twice, which can take
a long time. I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once. More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a
checkpoint, right after the redo point has been established. See
comments around the GetVirtualXIDsDelayingChkpt() function call in
CreateCheckPoint(). We could have a similar mechanism to wait out the
truncations before *completing* a checkpoint.

- Heikki

#18

andres@anarazel.de

over 5 years ago

In reply to: Heikki Linnakangas (#17)

Re: Corruption during WAL replay

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

but I just want to emphasize the need for reversing the order of the
dropping the buffers and the truncation.

ï¿½Repro details (when full page write = off)

ï¿½ ï¿½ ï¿½ ï¿½ ï¿½1) Page on disk has empty LP 1, Insert into page LP 1
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½2) checkpoint START (Recovery REDO eventually starts here)
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½3) Delete all rows on the page (page is empty now)
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½4) Autovacuum kicks in and truncates the pages
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½ ï¿½DropRelFileNodeBuffers - Dirty page NOT written, LP 1
on disk still empty
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½5) Checkpoint completes
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½6) Crash
ï¿½ ï¿½ ï¿½ ï¿½ ï¿½7) smgrtruncate - Not reached (this is where we do the
physical truncate)

ï¿½Now the crash-recovery starts

ï¿½ ï¿½ ï¿½ ï¿½ ï¿½Delete-log-replay (above step-3) reads page with empty LP 1
and the delete fails with PANIC (old page on disk with no insert)

Doing recovery, truncate is even not reached, a WAL replay of the
truncation will happen in the future but the recovery fails (repeatedly)
even before reaching that point.

Hmm. I think simply reversing the order of DropRelFileNodeBuffers() and
truncating the file would open a different issue:

1) Page on disk has empty LP 1, Insert into page LP 1
2) checkpoint START (Recovery REDO eventually starts here)
3) Delete all rows on the page (page is empty now)
4) Autovacuum kicks in and starts truncating
5) smgrtruncate() truncates the file
6) checkpoint writes out buffers for pages that were just truncated away,
expanding the file again.

Your patch had a mechanism to mark the buffers as io-in-progress before
truncating the file to fix that, but I'm wary of that approach. Firstly, it
requires scanning the buffers that are dropped twice, which can take a long
time.

I was thinking that we'd keep track of all the buffers marked as "in
progress" that way, avoiding the second scan.

It's also worth keeping in mind that this code is really only relevant
for partial truncations, which don't happen at the same frequency as
transactional truncations.

I remember that people have already complained that
DropRelFileNodeBuffers() is slow, when it has to scan all the buffers
once.

But that's when dropping many relations, normally. E.g. at the end of a
regression test.

More importantly, abusing the BM_IO_INPROGRESS flag for this seems
bad. For starters, because you're not holding buffer's I/O lock, I
believe the checkpointer would busy-wait on the buffers until the
truncation has completed. See StartBufferIO() and AbortBufferIO().

I think we should apply Robert's patch that makes io locks into
condition variables. Then we can fairly easily have many many buffers io
locked. Obviously there's some issues with doing so in the back
branches :(

I'm working on an AIO branch, and that also requires to be able to mark
multiple buffers as in-progress, FWIW.

Perhaps a better approach would be to prevent the checkpoint from
completing, until all in-progress truncations have completed. We have a
mechanism to wait out in-progress commits at the beginning of a checkpoint,
right after the redo point has been established. See comments around the
GetVirtualXIDsDelayingChkpt() function call in CreateCheckPoint(). We could
have a similar mechanism to wait out the truncations before *completing* a
checkpoint.

What I outlined earlier *is* essentially a way to do so, by preventing
checkpointing from finishing the buffer scan while a dangerous state
exists.

Greetings,

Andres Freund

#19

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 5 years ago

In reply to: Andres Freund (#18)

Re: Corruption during WAL replay

Status update for a commitfest entry.

I see quite a few unanswered questions in the thread since the last patch version was sent. So, I move it to "Waiting on Author".

The new status of this patch is: Waiting on Author

#20

sawada.mshk@gmail.com

over 5 years ago

In reply to: Andres Freund (#18)

Re: Corruption during WAL replay

On Tue, Aug 18, 2020 at 3:22 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2020-08-17 14:05:37 +0300, Heikki Linnakangas wrote:

On 14/04/2020 22:04, Teja Mupparti wrote:

Thanks Kyotaro and Masahiko for the feedback. I think there is a
consensus on the critical-section around truncate,

+1

I'm inclined to think that we should do that independent of the far more
complicated fix for other related issues.

If we had a critical section in RelationTruncate(), crash recovery
would continue failing until the situation of the underlying file is
recovered if a PANIC happens. The current comment in
RelationTruncate() says it’s worse than the disease. But considering
physical replication, as Andres mentioned, a failure to truncate the
file after logging WAL is no longer a harmless failure. Also, the
critical section would be necessary even if we reversed the order of
truncation and dropping buffers and resolved the issue. So I agree to
proceed with the patch that adds a critical section independent of
fixing other related things discussed in this thread. If Teja seems
not to work on this I’ll write the patch.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#21

Anastasia Lubennikova

a.lubennikova@postgrespro.ru

over 5 years ago

In reply to: Masahiko Sawada (#20)

#22

horikyota.ntt@gmail.com

over 5 years ago

In reply to: Andres Freund (#18)

#23

Ibrar Ahmed

ibrar.ahmad@gmail.com

about 5 years ago

In reply to: Kyotaro Horiguchi (#22)

#24

horikyota.ntt@gmail.com

about 5 years ago

In reply to: Ibrar Ahmed (#23)

#25

robertmhaas@gmail.com

over 4 years ago

In reply to: Kyotaro Horiguchi (#24)

#26

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#25)

#27

robertmhaas@gmail.com

over 4 years ago

In reply to: Tom Lane (#26)

#28

tgl@sss.pgh.pa.us

over 4 years ago

In reply to: Robert Haas (#27)

#29

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Tom Lane (#28)

#30

horikyota.ntt@gmail.com

over 4 years ago

In reply to: Robert Haas (#25)

#31

Daniel Shelepanov

deniel1495@mail.ru

about 4 years ago

In reply to: Kyotaro Horiguchi (#30)

#32

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Daniel Shelepanov (#31)

#33

robertmhaas@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#32)

#34

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#33)

#35

robertmhaas@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#34)

#36

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#35)

#37

robertmhaas@gmail.com

about 4 years ago

In reply to: Kyotaro Horiguchi (#36)

#38

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#37)

#39

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#38)

#40

robertmhaas@gmail.com

about 4 years ago

In reply to: Robert Haas (#39)

#41

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#40)

#42

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#41)

#43

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#42)

#44

andres@anarazel.de

about 4 years ago

In reply to: Robert Haas (#40)

#45

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Tom Lane (#43)

#46

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#43)

#47

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#45)

#48

Thomas Munro

thomas.munro@gmail.com

about 4 years ago

In reply to: Andres Freund (#46)

#49

andres@anarazel.de

about 4 years ago

In reply to: Thomas Munro (#48)

#50

andres@anarazel.de

about 4 years ago

In reply to: Andres Freund (#47)

#51

Thomas Munro

thomas.munro@gmail.com

about 4 years ago

In reply to: Andres Freund (#49)

#52

andres@anarazel.de

about 4 years ago

In reply to: Andres Freund (#50)

#53

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Andres Freund (#52)

#54

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#53)

#55

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Andres Freund (#54)

#56

andres@anarazel.de

about 4 years ago

In reply to: Andres Freund (#54)

#57

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#55)

#58

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Andres Freund (#56)

#59

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#58)

#60

robertmhaas@gmail.com

about 4 years ago

In reply to: Andres Freund (#59)

#61

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Andres Freund (#59)

#62

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#61)

#63

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#62)

#64

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#63)

#65

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#64)

#66

robertmhaas@gmail.com

about 4 years ago

In reply to: Tom Lane (#65)

#67

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Robert Haas (#66)

#68

andres@anarazel.de

about 4 years ago

In reply to: Tom Lane (#67)

#69

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Andres Freund (#68)

#70

Dagfinn Ilmari Mannsåker

ilmari@ilmari.org

about 4 years ago

In reply to: Tom Lane (#67)

#71

tgl@sss.pgh.pa.us

about 4 years ago

In reply to: Dagfinn Ilmari Mannsåker (#70)

#72

Michael Paquier

michael@paquier.xyz

about 4 years ago

In reply to: Tom Lane (#65)

#73

horikyota.ntt@gmail.com

about 4 years ago

In reply to: Robert Haas (#37)

#74

Stephen Frost

sfrost@snowman.net

about 4 years ago

In reply to: Robert Haas (#66)

#75