FSM corruption leading to errors

Started by Pavan Deolaseeover 9 years ago35 messageshackers
Jump to latest
#1Pavan Deolasee
pavan.deolasee@gmail.com

I investigated a bug report from one of our customers and it looked very
similar to previous bug reports here [1]/messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com, [2]/messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org, [3]/messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com (and probably more). In
these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function
returns without setting the DIRTY bit on the standby:

3413 /*
3414 * If we're in recovery we cannot dirty a page because of
a hint.
3415 * We can set the hint, just not dirty the page as a
result so the
3416 * hint is lost when we evict the page or shutdown.
3417 *
3418 * See src/backend/storage/page/README for longer
discussion.
3419 */
3420 if (RecoveryInProgress())
3421 return;
3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
the FSM. I think that's usually alright because FSM changes are not WAL
logged and if FSM ever returns a block with less free space than the caller
needs, the caller is usually prepared to update the FSM and request for a
new block. But if it returns a block that is outside the size of the
relation, then we've a trouble. The very next ReadBuffer() fails to handle
such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references
to the heap blocks that are being truncated. But since the FSM buffer may
not be marked DIRTY on the standby, if the buffer gets evicted from the
buffer cache, the on-disk copy of the FSM page may be left with references
to the truncated heap pages. When the standby is later promoted to be the
master, and an insert/update is attempted to the table, the FSM may return
a block that is outside the valid range of the relation. That results in
the said error.

Once this was clear, it was easy to put together a fully reproducible test
case. See the attached script; you'll need to adjust to your environment.
This affects all releases starting 9.3 and the script can reproduce the
problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during
relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If
the FSM for some table is already corrupted at the standby, users won't
notice it until the standby is promoted to be the new master. If the
standby starts throwing errors suddenly after failover, it will be a very
bad situation for the users, like we noticed with our customers. The fix is
simple and users can just delete the FSM (and VACUUM the table), but that
doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the
range and discard such blocks after setting the FSM (attached patch does
that). The problem with that approach is that RelationGetNumberOfBlocks()
is not really cheap and invoking it everytime FSM is consulted may not be a
bright idea. Can we cache that value in the RelationData or some such place
(BulkInsertState?) and use that as a hint for quickly checking if the block
is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real
problem, until I saw RecoveryInProgress() specific code, is: can this also
affect stand-alone masters? The comments at MarkBufferDirtyHint() made me
think so:

3358 * 3. This function does not guarantee that the buffer is always
marked dirty
3359 * (due to a race condition), so it cannot be used for important
changes.

So I was working with a theory that somehow updates to the FSM page are
lost because the race mentioned in the comment actually kicks in. But I'm
not sure if the race is only possible when the caller is holding a SHARE
lock on the buffer. When the FSM is truncated, the caller holds an
EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not
reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
standby perspective. I did one round, and couldn't see another problem.

Thanks,
Pavan

[1]: /messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
/messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
[2]: /messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org
/messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org
[3]: /messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com
/messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_fsm_corruption_fix.patchapplication/octet-stream; name=pg_fsm_corruption_fix.patchDownload+72-4
pg_trigger_fsm_error.shapplication/x-sh; name=pg_trigger_fsm_error.shDownload
#2Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Pavan Deolasee (#1)
Re: FSM corruption leading to errors

06.10.2016 20:59, Pavan Deolasee:

I investigated a bug report from one of our customers and it looked
very similar to previous bug reports here [1], [2], [3] (and probably
more). In these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read
only 0 of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The
function returns without setting the DIRTY bit on the standby:

3413 /*
3414 * If we're in recovery we cannot dirty a page
because of a hint.
3415 * We can set the hint, just not dirty the page as a
result so the
3416 * hint is lost when we evict the page or shutdown.
3417 *
3418 * See src/backend/storage/page/README for longer
discussion.
3419 */
3420 if (RecoveryInProgress())
3421 return;
3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are
made to the FSM. I think that's usually alright because FSM changes
are not WAL logged and if FSM ever returns a block with less free
space than the caller needs, the caller is usually prepared to update
the FSM and request for a new block. But if it returns a block that is
outside the size of the relation, then we've a trouble. The very next
ReadBuffer() fails to handle such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove
references to the heap blocks that are being truncated. But since the
FSM buffer may not be marked DIRTY on the standby, if the buffer gets
evicted from the buffer cache, the on-disk copy of the FSM page may be
left with references to the truncated heap pages. When the standby is
later promoted to be the master, and an insert/update is attempted to
the table, the FSM may return a block that is outside the valid range
of the relation. That results in the said error.

Once this was clear, it was easy to put together a fully reproducible
test case. See the attached script; you'll need to adjust to your
environment. This affects all releases starting 9.3 and the script can
reproduce the problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log
them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs.
But it must not be lost across a checkpoint. Also, since it happens
only during relation truncation, I don't see any problem from
performance perspective.

What bothers me is how to fix the problem for already affected
standbys. If the FSM for some table is already corrupted at the
standby, users won't notice it until the standby is promoted to be the
new master. If the standby starts throwing errors suddenly after
failover, it will be a very bad situation for the users, like we
noticed with our customers. The fix is simple and users can just
delete the FSM (and VACUUM the table), but that doesn't sound nice and
they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is
outside the range and discard such blocks after setting the FSM
(attached patch does that). The problem with that approach is that
RelationGetNumberOfBlocks() is not really cheap and invoking it
everytime FSM is consulted may not be a bright idea. Can we cache that
value in the RelationData or some such place (BulkInsertState?) and
use that as a hint for quickly checking if the block is (potentially)
outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the
real problem, until I saw RecoveryInProgress() specific code, is: can
this also affect stand-alone masters? The comments at
MarkBufferDirtyHint() made me think so:

3358 * 3. This function does not guarantee that the buffer is always
marked dirty
3359 * (due to a race condition), so it cannot be used for
important changes.

So I was working with a theory that somehow updates to the FSM page
are lost because the race mentioned in the comment actually kicks in.
But I'm not sure if the race is only possible when the caller is
holding a SHARE lock on the buffer. When the FSM is truncated, the
caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're
safe. I could not reproduce the issue on a stand-alone master. But
probably worth checking.

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially
from standby perspective. I did one round, and couldn't see another
problem.

Thanks,
Pavan

[1]
/messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
[2]
/messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org
[3]
/messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#3Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#1)
Re: FSM corruption leading to errors

On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I investigated a bug report from one of our customers and it looked very
similar to previous bug reports here [1], [2], [3] (and probably more). In
these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint().

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
the FSM. I think that's usually alright because FSM changes are not WAL
logged and if FSM ever returns a block with less free space than the caller
needs, the caller is usually prepared to update the FSM and request for a
new block. But if it returns a block that is outside the size of the
relation, then we've a trouble. The very next ReadBuffer() fails to handle
such a block and throws the error.

To be honest, I have been seeing a very similar issue for a couple of
weeks now on some test beds on a couple of relations involving a
promoted standby, this error happening more frequently for relations
having a more aggressive autovacuum setup. I did not take the time to
look at that seriously, and I am very glad to see this email.

When a relation is truncated, the FSM is truncated too to remove references
to the heap blocks that are being truncated. But since the FSM buffer may
not be marked DIRTY on the standby, if the buffer gets evicted from the
buffer cache, the on-disk copy of the FSM page may be left with references
to the truncated heap pages. When the standby is later promoted to be the
master, and an insert/update is attempted to the table, the FSM may return a
block that is outside the valid range of the relation. That results in the
said error.

Oops.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during relation
truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

What bothers me is how to fix the problem for already affected standbys. If
the FSM for some table is already corrupted at the standby, users won't
notice it until the standby is promoted to be the new master. If the standby
starts throwing errors suddenly after failover, it will be a very bad
situation for the users, like we noticed with our customers. The fix is
simple and users can just delete the FSM (and VACUUM the table), but that
doesn't sound nice and they would not know until they see the problem.

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted. And isn't that going to break once the
relation is extended again? I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.
I have also been thinking about the comment you added and we could
just do something like that:
+       /*
+        * Mark the buffer still holding references to truncated blocks as
+        * dirty to be sure that this makes it to disk and is kept consistent
+        * with the truncated relation.
+        */
+       MarkBufferDirty(buf)

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
standby perspective. I did one round, and couldn't see another problem.

Haven't looked at those yet, will do so tomorrow.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#3)
Re: FSM corruption leading to errors

On Mon, Oct 10, 2016 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Well... Here is the actual patch.
--
Michael

Attachments:

pg_fsm_corruption_fix_v2.patchtext/x-diff; charset=US-ASCII; name=pg_fsm_corruption_fix_v2.patchDownload+98-1
#5Michael Paquier
michael@paquier.xyz
In reply to: Anastasia Lubennikova (#2)
Re: FSM corruption leading to errors

On Fri, Oct 7, 2016 at 11:50 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

Here you go:
https://commitfest.postgresql.org/11/817/
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#3)
Re: FSM corruption leading to errors

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the
valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range
blocks. But now that I look at it again, may be this is not correct and it
may get into an endless loop if the relation is repeatedly extended
concurrently.

And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we
can just document and leave it to the user to run some correctness checks
(see below), especially given that the code is not cheap and adds overheads
for everybody, irrespective of whether they have or will ever have corrupt
FSM.

I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.

Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Thanks for doing that.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#6)
Re: FSM corruption leading to errors

On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the
valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range blocks.
But now that I look at it again, may be this is not correct and it may get
into an endless loop if the relation is repeatedly extended concurrently.

Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we can
just document and leave it to the user to run some correctness checks (see
below), especially given that the code is not cheap and adds overheads for
everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.

Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#7)
Re: FSM corruption leading to errors

On Tue, Oct 11, 2016 at 5:20 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Once the underlying bug is fixed, I don't see why it should break again.

I

added the above code to mostly deal with already corrupt FSMs. May be we

can

just document and leave it to the user to run some correctness checks

(see

below), especially given that the code is not cheap and adds overheads

for

everybody, irrespective of whether they have or will ever have corrupt

FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

I'm okay with that. It's annoying, especially because the bug may show up
when your primary is down and you just failed over for HA, only to find
that the standby won't work correctly. But I don't have ideas how to fix
existing corruption without adding significant penalty to normal path.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)

Yes, that will enough once the fix is in place.

I think this is a major bug and I would appreciate any ideas to get the
patch in a committable shape before the next minor release goes out. We
probably need a committer to get interested in this to make progress.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#8)
Re: FSM corruption leading to errors

On Mon, Oct 17, 2016 at 4:14 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I think this is a major bug and I would appreciate any ideas to get the
patch in a committable shape before the next minor release goes out. We
probably need a committer to get interested in this to make progress.

Same opinion here, this is very annoying for some of my internal
users.. I don't have more to offer though.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Michael Paquier (#3)
Re: FSM corruption leading to errors

On 10/10/2016 05:25 PM, Michael Paquier wrote:

On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during relation
truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are
truncated. But visibilitymap_truncate() has already modified and dirtied
the page. If the VM page change is flushed to disk before the WAL
record, and you crash, you might have a torn VM page and a checksum failure.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty()
in FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's
LSN to make sure the WAL record is flushed first.

I think we need something like the attached.

- Heikki

Attachments:

0001-WIP-Fix-FSM-corruption-leading-to-errors.patchtext/x-patch; name=0001-WIP-Fix-FSM-corruption-leading-to-errors.patchDownload+179-50
#11Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#10)
Re: FSM corruption leading to errors

On Mon, Oct 17, 2016 at 8:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.

Good point! I didn't think about that.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.

Right.. All the code paths calling FreeSpaceMapTruncateRel or
visibilitymap_truncate do flush the record, but it definitely make
things more robust to set the LSN on the page. So +1 this way.

I think we need something like the attached.

OK, I had a look at that.

MarkBufferDirty(mapBuffer);
+ if (lsn != InvalidXLogRecPtr)
+ PageSetLSN(page, lsn);
UnlockReleaseBuffer(mapBuffer);
Nit: using XLogRecPtrIsInvalid() makes more sense for such checks?

pg_visibility calls as well visibilitymap_truncate, but this patch did
not update it. And it would be better to do the actual truncate after
writing the WAL record I think.

You are also breaking XLogFlush() in RelationTruncate() because vm and
fsm are assigned thanks to a call of smgrexists(), something done
before XLogFlush is called on HEAD, and after with your patch. So your
patch finishes by never calling XLogFlush() on relation truncation
even if there is a VM or a FSM.

Attached is an updated version with those issues fixed. The final
version does not need the test as that's really a corner-case, but I
still included it to help testing for now.
--
Michael

Attachments:

truncation-vm-fsm.patchapplication/x-download; name=truncation-vm-fsm.patchDownload+190-54
#12Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: FSM corruption leading to errors

On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.

Right, I missed the problem with checksums.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.

I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.

BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode. While most of
the reports so far involved standbys, and the bug can also hit a standalone
master during crash recovery, I wonder if a race can occur even on a live
system.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#12)
Re: FSM corruption leading to errors

On 10/18/2016 07:01 AM, Pavan Deolasee wrote:

On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.

Right, I missed the problem with checksums.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.

I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.

Actually, this is still not 100% safe. Flushing the WAL before modifying
the FSM page is not enough. We also need to WAL-log a full-page image of
the FSM page, otherwise we are still vulnerable to the torn page problem.

I came up with the attached. This is fortunately much simpler than my
previous attempt. I replaced the MarkBufferDirtyHint() calls with
MarkBufferDirty(), to fix the original issue, plus WAL-logging a
full-page image to fix the torn page issue.

BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.

I think the race condition can only happen when the page is locked in
shared mode. In any case, with this proposed fix, we'll use
MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.

- Heikki

Attachments:

truncation-vm-fsm-2.patchapplication/x-download; name=truncation-vm-fsm-2.patchDownload+125-1
#14Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 2:37 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Actually, this is still not 100% safe. Flushing the WAL before modifying
the FSM page is not enough. We also need to WAL-log a full-page image of
the FSM page, otherwise we are still vulnerable to the torn page problem.

I came up with the attached. This is fortunately much simpler than my
previous attempt. I replaced the MarkBufferDirtyHint() calls with
MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page
image to fix the torn page issue.

Looks good to me.

BTW any thoughts on race-condition on the primary? Comments at

MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.

I think the race condition can only happen when the page is locked in
shared mode. In any case, with this proposed fix, we'll use
MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.

Yes, the fix will cover that problem (if it exists). The reason why I was
curious to know is because there are several reports of similar error in
the past and some of them did not involve as standby. Those reports mostly
remained unresolved and I wondered if this explains them. But yeah, my
conclusion was that the race is not possible with page locked in EXCLUSIVE
mode. So may be there is another problem somewhere or a crash recovery may
have left the FSM in inconsistent state.

Anyways, we seem good to go with the patch.

Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Pavan Deolasee (#14)
Re: FSM corruption leading to errors

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#15)
Re: FSM corruption leading to errors

On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

Oh, forgot that this needs to be backported, of course. Will do that
shortly...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#15)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 8:29 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

Thanks! I am surprised that you kept the TAP test at the end.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#16)
Re: FSM corruption leading to errors

On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:

On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

Oh, forgot that this needs to be backported, of course. Will do that
shortly...

Done.

This didn't include anything to cope with an already-corrupt FSM, BTW.
Do we still want to try something for that? I think it's good enough if
we prevent the FSM corruption from happening, but not sure what the
consensus on that might be..

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#18)
Re: FSM corruption leading to errors

Heikki Linnakangas <hlinnaka@iki.fi> writes:

This didn't include anything to cope with an already-corrupt FSM, BTW.
Do we still want to try something for that? I think it's good enough if
we prevent the FSM corruption from happening, but not sure what the
consensus on that might be..

Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?) We're going to need to be able to explain how to
fix busted visibility maps in 9.6.1, too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 6:44 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:

Oh, forgot that this needs to be backported, of course. Will do that
shortly...

Done.

Thanks!

This didn't include anything to cope with an already-corrupt FSM, BTW. Do
we still want to try something for that? I think it's good enough if we
prevent the FSM corruption from happening, but not sure what the consensus
on that might be..

I thought it will be nice to handle already corrupt FSM since our customer
found it immediately after a failover and then it was a serious issue. In
one case, a system table was affected, thus preventing all DDLs from
running. Having said that, I don't have a better idea to handle the problem
without causing non-trivial overhead for normal cases (see my original
patch). If you've better ideas, it might be worth pursuing.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#19)
#22Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#21)
#23Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#22)
#24Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#23)
#25Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#24)
#26Michael Paquier
michael@paquier.xyz
In reply to: Pavan Deolasee (#25)
#27Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Michael Paquier (#26)
#28Michael Paquier
michael@paquier.xyz
In reply to: Jim Nasby (#27)
#29Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#28)
#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#29)
#31Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#30)
#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#30)
#33Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#32)
#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#32)
#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#34)