FSM corruption leading to errors

Started by Pavan Deolaseeover 9 years ago35 messages
#1Pavan Deolasee
pavan.deolasee@gmail.com
2 attachment(s)

I investigated a bug report from one of our customers and it looked very
similar to previous bug reports here [1]/messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com, [2]/messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org, [3]/messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com (and probably more). In
these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The function
returns without setting the DIRTY bit on the standby:

3413 /*
3414 * If we're in recovery we cannot dirty a page because of
a hint.
3415 * We can set the hint, just not dirty the page as a
result so the
3416 * hint is lost when we evict the page or shutdown.
3417 *
3418 * See src/backend/storage/page/README for longer
discussion.
3419 */
3420 if (RecoveryInProgress())
3421 return;
3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
the FSM. I think that's usually alright because FSM changes are not WAL
logged and if FSM ever returns a block with less free space than the caller
needs, the caller is usually prepared to update the FSM and request for a
new block. But if it returns a block that is outside the size of the
relation, then we've a trouble. The very next ReadBuffer() fails to handle
such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove references
to the heap blocks that are being truncated. But since the FSM buffer may
not be marked DIRTY on the standby, if the buffer gets evicted from the
buffer cache, the on-disk copy of the FSM page may be left with references
to the truncated heap pages. When the standby is later promoted to be the
master, and an insert/update is attempted to the table, the FSM may return
a block that is outside the valid range of the relation. That results in
the said error.

Once this was clear, it was easy to put together a fully reproducible test
case. See the attached script; you'll need to adjust to your environment.
This affects all releases starting 9.3 and the script can reproduce the
problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during
relation truncation, I don't see any problem from performance perspective.

What bothers me is how to fix the problem for already affected standbys. If
the FSM for some table is already corrupted at the standby, users won't
notice it until the standby is promoted to be the new master. If the
standby starts throwing errors suddenly after failover, it will be a very
bad situation for the users, like we noticed with our customers. The fix is
simple and users can just delete the FSM (and VACUUM the table), but that
doesn't sound nice and they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is outside the
range and discard such blocks after setting the FSM (attached patch does
that). The problem with that approach is that RelationGetNumberOfBlocks()
is not really cheap and invoking it everytime FSM is consulted may not be a
bright idea. Can we cache that value in the RelationData or some such place
(BulkInsertState?) and use that as a hint for quickly checking if the block
is (potentially) outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the real
problem, until I saw RecoveryInProgress() specific code, is: can this also
affect stand-alone masters? The comments at MarkBufferDirtyHint() made me
think so:

3358 * 3. This function does not guarantee that the buffer is always
marked dirty
3359 * (due to a race condition), so it cannot be used for important
changes.

So I was working with a theory that somehow updates to the FSM page are
lost because the race mentioned in the comment actually kicks in. But I'm
not sure if the race is only possible when the caller is holding a SHARE
lock on the buffer. When the FSM is truncated, the caller holds an
EXCLUSIVE lock on the FSM buffer. So probably we're safe. I could not
reproduce the issue on a stand-alone master. But probably worth checking.

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
standby perspective. I did one round, and couldn't see another problem.

Thanks,
Pavan

[1]: /messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
/messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
[2]: /messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org
/messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org
[3]: /messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com
/messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

pg_fsm_corruption_fix.patchapplication/octet-stream; name=pg_fsm_corruption_fix.patchDownload
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bbd90c9..2edc8f6 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -132,8 +132,37 @@ BlockNumber
 GetPageWithFreeSpace(Relation rel, Size spaceNeeded)
 {
 	uint8		min_cat = fsm_space_needed_to_cat(spaceNeeded);
+	BlockNumber	target_block, nblocks;
 
-	return fsm_search(rel, min_cat);
+	target_block = fsm_search(rel, min_cat);
+
+	/*
+	 * The callers are usually equipped to handle cases where FSM
+	 * information is not accurate and the returned block contains less free
+	 * space than the caller expected. But if FSM returns a block which is
+	 * beyond the current size of the relation, then that's a critical
+	 * situation. If such a situation arises because of say some corruption in
+	 * FSM, the database may get into unrecoverable state. So instead of
+	 * returning a wrong block, we keep trying until we find a block within a
+	 * valid range or no block is found. To ensure, we don't create infinite
+	 * recursion, we update FSM for each block found outside the valid range so
+	 * that the same block is not visited twice.
+	 *
+	 * RelationGetNumberOfBlocks() call is not exactly cheap, but there are
+	 * other optimizations to minimize calls to GetPageWithFreeSpace() and
+	 * hence this should not be a big drag on performance.
+	 *
+	 * Keep retrying until FSM hands out a block outside the range or no
+	 * suitable block is found.
+	 */
+	nblocks = RelationGetNumberOfBlocks(rel);
+	while (target_block >= nblocks && target_block != InvalidBlockNumber)
+	{
+		target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+				spaceNeeded);
+	}
+
+	return target_block;
 }
 
 /*
@@ -154,6 +183,7 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 	FSMAddress	addr;
 	uint16		slot;
 	int			search_slot;
+	BlockNumber	target_block, nblocks;
 
 	/* Get the location of the FSM byte representing the heap block */
 	addr = fsm_get_location(oldPage, &slot);
@@ -165,9 +195,22 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
 	 * Otherwise, search as usual.
 	 */
 	if (search_slot != -1)
-		return fsm_get_heap_blk(addr, search_slot);
+		target_block = fsm_get_heap_blk(addr, search_slot);
 	else
-		return fsm_search(rel, search_cat);
+		target_block = fsm_search(rel, search_cat);
+
+	/*
+	 * See comments in GetPageWithFreeSpace about handling outside the valid
+	 * range blocks
+	 */
+	nblocks = RelationGetNumberOfBlocks(rel);
+	while (target_block >= nblocks && target_block != InvalidBlockNumber)
+	{
+		target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+				spaceNeeded);
+	}
+
+	return target_block;
 }
 
 /*
@@ -328,7 +371,32 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 			return;				/* nothing to do; the FSM was already smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
-		MarkBufferDirtyHint(buf, false);
+
+		/*
+		 * It's not enough to just use a hint here because
+		 * MarkBufferDirtyHint() may not mark the buffer as DIRTY on a standby.
+		 * Failure to write the buffer to the permanent storage may result in
+		 * FSM corruption. Now typically FSM corruption in terms of wrong
+		 * reporting of available free space is tolerable and callers are
+		 * prepared to handle those cases, but if FSM returns a block outside
+		 * the valid range of the main relation, then that's a catastrophic
+		 * error and may result in unusable database. So we must ensure that
+		 * FSM truncation is always guaranteed with main relation truncation.
+		 *
+		 * Note that there is no separate WAL logging for this operation, but
+		 * this is covered by XLOG_SMGR_TRUNCATE and smgr_redo() will take care
+		 * of truncating the FSM during replay, if necessary. If a CHECKPOINT
+		 * occurs in between, marking the FSM buffer dirty will guarantee that
+		 * the buffer is written to disk.
+		 *
+		 * Note: To ensure that the system can cope with FSM corruption that
+		 * may have already happened before we fixed this bug, we also ensure
+		 * that RecordPageWithFreeSpace and RecordAndGetPageWithFreeSpace
+		 * checks for main relation size and never returns a block beyond the
+		 * valid range.
+		 */
+		MarkBufferDirty(buf);
+
 		UnlockReleaseBuffer(buf);
 
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
pg_trigger_fsm_error.shapplication/x-sh; name=pg_trigger_fsm_error.shDownload
#2Anastasia Lubennikova
a.lubennikova@postgrespro.ru
In reply to: Pavan Deolasee (#1)
Re: FSM corruption leading to errors

06.10.2016 20:59, Pavan Deolasee:

I investigated a bug report from one of our customers and it looked
very similar to previous bug reports here [1], [2], [3] (and probably
more). In these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read
only 0 of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint(). The
function returns without setting the DIRTY bit on the standby:

3413 /*
3414 * If we're in recovery we cannot dirty a page
because of a hint.
3415 * We can set the hint, just not dirty the page as a
result so the
3416 * hint is lost when we evict the page or shutdown.
3417 *
3418 * See src/backend/storage/page/README for longer
discussion.
3419 */
3420 if (RecoveryInProgress())
3421 return;
3422

freespace.c freely uses MarkBufferDirtyHint() whenever changes are
made to the FSM. I think that's usually alright because FSM changes
are not WAL logged and if FSM ever returns a block with less free
space than the caller needs, the caller is usually prepared to update
the FSM and request for a new block. But if it returns a block that is
outside the size of the relation, then we've a trouble. The very next
ReadBuffer() fails to handle such a block and throws the error.

When a relation is truncated, the FSM is truncated too to remove
references to the heap blocks that are being truncated. But since the
FSM buffer may not be marked DIRTY on the standby, if the buffer gets
evicted from the buffer cache, the on-disk copy of the FSM page may be
left with references to the truncated heap pages. When the standby is
later promoted to be the master, and an insert/update is attempted to
the table, the FSM may return a block that is outside the valid range
of the relation. That results in the said error.

Once this was clear, it was easy to put together a fully reproducible
test case. See the attached script; you'll need to adjust to your
environment. This affects all releases starting 9.3 and the script can
reproduce the problem on all these releases.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log
them because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs.
But it must not be lost across a checkpoint. Also, since it happens
only during relation truncation, I don't see any problem from
performance perspective.

What bothers me is how to fix the problem for already affected
standbys. If the FSM for some table is already corrupted at the
standby, users won't notice it until the standby is promoted to be the
new master. If the standby starts throwing errors suddenly after
failover, it will be a very bad situation for the users, like we
noticed with our customers. The fix is simple and users can just
delete the FSM (and VACUUM the table), but that doesn't sound nice and
they would not know until they see the problem.

One idea is to always check if the block returned by the FSM is
outside the range and discard such blocks after setting the FSM
(attached patch does that). The problem with that approach is that
RelationGetNumberOfBlocks() is not really cheap and invoking it
everytime FSM is consulted may not be a bright idea. Can we cache that
value in the RelationData or some such place (BulkInsertState?) and
use that as a hint for quickly checking if the block is (potentially)
outside the range and discard it? Any other ideas?

The other concern I've and TBH that's what I initially thought as the
real problem, until I saw RecoveryInProgress() specific code, is: can
this also affect stand-alone masters? The comments at
MarkBufferDirtyHint() made me think so:

3358 * 3. This function does not guarantee that the buffer is always
marked dirty
3359 * (due to a race condition), so it cannot be used for
important changes.

So I was working with a theory that somehow updates to the FSM page
are lost because the race mentioned in the comment actually kicks in.
But I'm not sure if the race is only possible when the caller is
holding a SHARE lock on the buffer. When the FSM is truncated, the
caller holds an EXCLUSIVE lock on the FSM buffer. So probably we're
safe. I could not reproduce the issue on a stand-alone master. But
probably worth checking.

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially
from standby perspective. I did one round, and couldn't see another
problem.

Thanks,
Pavan

[1]
/messages/by-id/CAJakt-8=aXa-F7uFeLAeSYhQ4wFuaX3+ytDuDj9c8Gx6S_ou=w@mail.gmail.com
[2]
/messages/by-id/20160601134819.30392.85324@wrigleys.postgresql.org
[3]
/messages/by-id/AMSPR06MB504CD8FE8AA30D4B7C958AAE39E0@AMSPR06MB504.eurprd06.prod.outlook.com

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#3Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#1)
Re: FSM corruption leading to errors

On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I investigated a bug report from one of our customers and it looked very
similar to previous bug reports here [1], [2], [3] (and probably more). In
these reports, the error looks something like this:

ERROR: could not read block 28991 in file "base/16390/572026": read only 0
of 8192 bytes

I traced it to the following code in MarkBufferDirtyHint().

freespace.c freely uses MarkBufferDirtyHint() whenever changes are made to
the FSM. I think that's usually alright because FSM changes are not WAL
logged and if FSM ever returns a block with less free space than the caller
needs, the caller is usually prepared to update the FSM and request for a
new block. But if it returns a block that is outside the size of the
relation, then we've a trouble. The very next ReadBuffer() fails to handle
such a block and throws the error.

To be honest, I have been seeing a very similar issue for a couple of
weeks now on some test beds on a couple of relations involving a
promoted standby, this error happening more frequently for relations
having a more aggressive autovacuum setup. I did not take the time to
look at that seriously, and I am very glad to see this email.

When a relation is truncated, the FSM is truncated too to remove references
to the heap blocks that are being truncated. But since the FSM buffer may
not be marked DIRTY on the standby, if the buffer gets evicted from the
buffer cache, the on-disk copy of the FSM page may be left with references
to the truncated heap pages. When the standby is later promoted to be the
master, and an insert/update is attempted to the table, the FSM may return a
block that is outside the valid range of the relation. That results in the
said error.

Oops.

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during relation
truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

What bothers me is how to fix the problem for already affected standbys. If
the FSM for some table is already corrupted at the standby, users won't
notice it until the standby is promoted to be the new master. If the standby
starts throwing errors suddenly after failover, it will be a very bad
situation for the users, like we noticed with our customers. The fix is
simple and users can just delete the FSM (and VACUUM the table), but that
doesn't sound nice and they would not know until they see the problem.

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted. And isn't that going to break once the
relation is extended again? I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.
I have also been thinking about the comment you added and we could
just do something like that:
+       /*
+        * Mark the buffer still holding references to truncated blocks as
+        * dirty to be sure that this makes it to disk and is kept consistent
+        * with the truncated relation.
+        */
+       MarkBufferDirty(buf)

It might also be a good idea to inspect other callers of
MarkBufferDirtyHint() and see if any of them is vulnerable, especially from
standby perspective. I did one round, and couldn't see another problem.

Haven't looked at those yet, will do so tomorrow.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#3)
1 attachment(s)
Re: FSM corruption leading to errors

On Mon, Oct 10, 2016 at 11:25 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Well... Here is the actual patch.
--
Michael

Attachments:

pg_fsm_corruption_fix_v2.patchtext/x-diff; charset=US-ASCII; name=pg_fsm_corruption_fix_v2.patchDownload
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bbd90c9..a6c6760 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -328,7 +328,14 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 			return;				/* nothing to do; the FSM was already smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
-		MarkBufferDirtyHint(buf, false);
+
+		/*
+		 * Mark the buffer still holding references to truncated blocks as
+		 * dirty to be sure that this makes it to disk and is kept consistent
+		 * with the truncated relation.
+		 */
+		MarkBufferDirty(buf);
+
 		UnlockReleaseBuffer(buf);
 
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
diff --git a/src/test/recovery/t/008_fsm_check.pl b/src/test/recovery/t/008_fsm_check.pl
new file mode 100644
index 0000000..2c58e31
--- /dev/null
+++ b/src/test/recovery/t/008_fsm_check.pl
@@ -0,0 +1,90 @@
+# Set of tests to check file-space map consistency on standbys
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+
+$node_master->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = replica
+wal_log_hints = on
+max_prepared_transactions = 5
+autovacuum = off
+});
+
+# Create a master node and its standby, initializing both with
+# some data at the same time.
+$node_master->start;
+
+$node_master->backup('master_backup');
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, 'master_backup',
+	has_streaming => 1);
+$node_standby->start;
+
+$node_master->psql('postgres', qq{
+create table testtab (a int, b char(100));
+insert into testtab select generate_series(1,1000), 'foo';
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+});
+
+# Take a lock on the table to prevent following vacuum from truncating it
+$node_master->psql('postgres', qq{
+begin;
+lock table testtab in row share mode;
+prepare transaction 'p1';
+});
+
+# Vacuum, update FSM without truncation
+$node_master->psql('postgres', 'vacuum verbose testtab');
+
+# Force a checkpoint
+$node_master->psql('postgres', 'checkpoint');
+
+# Now do some more insert/deletes, another vacuum to ensure FPW writes are
+# done
+$node_master->psql('postgres', qq{
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+vacuum verbose testtab;
+});
+
+# Ensure all buffers are now clean on the standby
+$node_standby->psql('postgres', 'checkpoint');
+
+# Release the lock, vacuum again which should lead to truncation
+$node_master->psql('postgres', qq{
+rollback prepared 'p1';
+vacuum verbose testtab;
+});
+
+$node_master->psql('postgres', 'checkpoint');
+my $until_lsn =
+	$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+
+# Wait long enough for standby to receive and apply all WAL
+my $caughtup_query =
+	"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for standby to catch up";
+
+# now promote the standby
+$node_standby->promote;
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery()")
+  or die "Timed out while waiting for promotion of standby";
+$node_standby->psql('postgres', 'checkpoint');
+
+# restart to discard in-memory copy of FSM
+$node_standby->restart;
+
+# Insert should work on standby
+is($node_standby->psql('postgres',
+   qq{insert into testtab select generate_series(1,1000), 'foo';}),
+   0, 'INSERT succeeds with truncated relation FSM');
#5Michael Paquier
michael.paquier@gmail.com
In reply to: Anastasia Lubennikova (#2)
Re: FSM corruption leading to errors

On Fri, Oct 7, 2016 at 11:50 PM, Anastasia Lubennikova
<a.lubennikova@postgrespro.ru> wrote:

Could you please add the patches to commitfest?
I'm going to test them and write a review in a few days.

Here you go:
https://commitfest.postgresql.org/11/817/
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#3)
Re: FSM corruption leading to errors

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the
valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range
blocks. But now that I look at it again, may be this is not correct and it
may get into an endless loop if the relation is repeatedly extended
concurrently.

And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we
can just document and leave it to the user to run some correctness checks
(see below), especially given that the code is not cheap and adds overheads
for everybody, irrespective of whether they have or will ever have corrupt
FSM.

I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.

Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

At the same time, I have translated your script into a TAP test, I
found that more useful when testing..

Thanks for doing that.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#6)
Re: FSM corruption leading to errors

On Mon, Oct 10, 2016 at 11:41 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

On Mon, Oct 10, 2016 at 7:55 PM, Michael Paquier <michael.paquier@gmail.com>
wrote:

+   /*
+    * See comments in GetPageWithFreeSpace about handling outside the
valid
+    * range blocks
+    */
+   nblocks = RelationGetNumberOfBlocks(rel);
+   while (target_block >= nblocks && target_block != InvalidBlockNumber)
+   {
+       target_block = RecordAndGetPageWithFreeSpace(rel, target_block, 0,
+               spaceNeeded);
+   }
Hm. This is just a workaround. Even if things are done this way the
FSM will remain corrupted.

No, because the code above updates the FSM of those out-of-the range blocks.
But now that I look at it again, may be this is not correct and it may get
into an endless loop if the relation is repeatedly extended concurrently.

Ah yes, that's what the call for
RecordAndGetPageWithFreeSpace()/fsm_set_and_search() is for. I missed
that yesterday before sleeping.

And isn't that going to break once the
relation is extended again?

Once the underlying bug is fixed, I don't see why it should break again. I
added the above code to mostly deal with already corrupt FSMs. May be we can
just document and leave it to the user to run some correctness checks (see
below), especially given that the code is not cheap and adds overheads for
everybody, irrespective of whether they have or will ever have corrupt FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

I'd suggest instead putting in the release
notes a query that allows one to analyze what are the relations broken
and directly have them fixed. That's annoying, but it would be really
better than a workaround. One idea here is to use pg_freespace() and
see if it returns a non-zero value for an out-of-range block on a
standby.

Right, that's how I tested for broken FSMs. A challenge with any such query
is that if the shared buffer copy of the FSM page is intact, then the query
won't return problematic FSMs. Of course, if the fix is applied to the
standby and is restarted, then corrupt FSMs can be detected.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#7)
Re: FSM corruption leading to errors

On Tue, Oct 11, 2016 at 5:20 AM, Michael Paquier <michael.paquier@gmail.com>
wrote:

Once the underlying bug is fixed, I don't see why it should break again.

I

added the above code to mostly deal with already corrupt FSMs. May be we

can

just document and leave it to the user to run some correctness checks

(see

below), especially given that the code is not cheap and adds overheads

for

everybody, irrespective of whether they have or will ever have corrupt

FSM.

Yep. I'd leave it for the release notes to hold a diagnostic method.
That's annoying, but this has been done in the past like for the
multixact issues..

I'm okay with that. It's annoying, especially because the bug may show up
when your primary is down and you just failed over for HA, only to find
that the standby won't work correctly. But I don't have ideas how to fix
existing corruption without adding significant penalty to normal path.

What if you restart the standby, and then do a diagnostic query?
Wouldn't that be enough? (Something just based on
pg_freespace(pg_relation_size(oid) / block_size) != 0)

Yes, that will enough once the fix is in place.

I think this is a major bug and I would appreciate any ideas to get the
patch in a committable shape before the next minor release goes out. We
probably need a committer to get interested in this to make progress.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#8)
Re: FSM corruption leading to errors

On Mon, Oct 17, 2016 at 4:14 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

I think this is a major bug and I would appreciate any ideas to get the
patch in a committable shape before the next minor release goes out. We
probably need a committer to get interested in this to make progress.

Same opinion here, this is very annoying for some of my internal
users.. I don't have more to offer though.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Michael Paquier (#3)
1 attachment(s)
Re: FSM corruption leading to errors

On 10/10/2016 05:25 PM, Michael Paquier wrote:

On Fri, Oct 7, 2016 at 2:59 AM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:

I believe the fix is very simple. The FSM change during truncation is
critical and the buffer must be marked by MarkBufferDirty() i.e. those
changes must make to the disk. I think it's alright not to WAL log them
because XLOG_SMGR_TRUNCATE will redo() them if a crash occurs. But it must
not be lost across a checkpoint. Also, since it happens only during relation
truncation, I don't see any problem from performance perspective.

Agreed. I happen to notice that VM is similalry careful when it comes
to truncate it (visibilitymap_truncate).

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are
truncated. But visibilitymap_truncate() has already modified and dirtied
the page. If the VM page change is flushed to disk before the WAL
record, and you crash, you might have a torn VM page and a checksum failure.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty()
in FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's
LSN to make sure the WAL record is flushed first.

I think we need something like the attached.

- Heikki

Attachments:

0001-WIP-Fix-FSM-corruption-leading-to-errors.patchtext/x-patch; name=0001-WIP-Fix-FSM-corruption-leading-to-errors.patchDownload
From 7b100e951236c21656eb7bd63b4631c0c32cc573 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Mon, 17 Oct 2016 14:02:52 +0300
Subject: [PATCH 1/1] WIP: Fix "FSM corruption leading to errors"

---
 src/backend/access/heap/visibilitymap.c   |   9 ++-
 src/backend/catalog/storage.c             | 110 ++++++++++++++++++------------
 src/backend/storage/freespace/freespace.c |  14 +++-
 src/include/access/visibilitymap.h        |   2 +-
 src/include/storage/freespace.h           |   3 +-
 src/test/recovery/t/008_fsm_check.pl      |  90 ++++++++++++++++++++++++
 6 files changed, 179 insertions(+), 49 deletions(-)
 create mode 100644 src/test/recovery/t/008_fsm_check.pl

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..377e511 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -457,9 +457,14 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
  * before they access the VM again.
  *
  * nheapblocks is the new size of the heap.
+ *
+ * The caller must have WAL-logged the truncation already (if the relation
+ * needs WAL-logging at all). 'lsn' is the LSN of the XLOG record. It is
+ * used to stamp the last remaining VM page, so that it doesn't get flushed
+ * to disk before the WAL record.
  */
 void
-visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
+visibilitymap_truncate(Relation rel, BlockNumber nheapblocks, XLogRecPtr lsn)
 {
 	BlockNumber newnblocks;
 
@@ -524,6 +529,8 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 		map[truncByte] &= (1 << truncOffset) - 1;
 
 		MarkBufferDirty(mapBuffer);
+		if (lsn != InvalidXLogRecPtr)
+			PageSetLSN(page, lsn);
 		UnlockReleaseBuffer(mapBuffer);
 	}
 	else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..5478953 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -228,6 +229,7 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 {
 	bool		fsm;
 	bool		vm;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
 
 	/* Open it at the smgr level if not already done */
 	RelationOpenSmgr(rel);
@@ -239,56 +241,76 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
 	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
 
-	/* Truncate the FSM first if it exists */
-	fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
-	if (fsm)
-		FreeSpaceMapTruncateRel(rel, nblocks);
-
-	/* Truncate the visibility map too if it exists. */
-	vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-	if (vm)
-		visibilitymap_truncate(rel, nblocks);
-
-	/*
-	 * We WAL-log the truncation before actually truncating, which means
-	 * trouble if the truncation fails. If we then crash, the WAL replay
-	 * likely isn't going to succeed in the truncation either, and cause a
-	 * PANIC. It's tempting to put a critical section here, but that cure
-	 * would be worse than the disease. It would turn a usually harmless
-	 * failure to truncate, that might spell trouble at WAL replay, into a
-	 * certain PANIC.
-	 */
-	if (RelationNeedsWAL(rel))
+	PG_TRY();
 	{
 		/*
-		 * Make an XLOG entry reporting the file truncation.
+		 * We WAL-log the truncation before actually truncating, which means
+		 * trouble if the truncation fails. If we then crash, the WAL replay
+		 * likely isn't going to succeed in the truncation either, and cause a
+		 * PANIC. It's tempting to put a critical section here, but that cure
+		 * would be worse than the disease. It would turn a usually harmless
+		 * failure to truncate, that might spell trouble at WAL replay, into a
+		 * certain PANIC.
 		 */
-		XLogRecPtr	lsn;
-		xl_smgr_truncate xlrec;
-
-		xlrec.blkno = nblocks;
-		xlrec.rnode = rel->rd_node;
-		xlrec.flags = SMGR_TRUNCATE_ALL;
+		if (RelationNeedsWAL(rel))
+		{
+			/*
+			 * Make an XLOG entry reporting the file truncation.
+			 */
+			xl_smgr_truncate xlrec;
+
+			/*
+			 * Since we're not holding a buffer lock, use delayChkpt to
+			 * prevent a checkpoint from happening between WAL-logging, and
+			 * performing the truncation.
+			 */
+			MyPgXact->delayChkpt = true;
+
+			xlrec.blkno = nblocks;
+			xlrec.rnode = rel->rd_node;
+			xlrec.flags = SMGR_TRUNCATE_ALL;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+			lsn = XLogInsert(RM_SMGR_ID,
+							 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+			/*
+			 * Flush, because otherwise the truncation of the main relation
+			 * might hit the disk before the WAL record, and the truncation of
+			 * the FSM or visibility map. If we crashed during that window,
+			 * we'd be left with a truncated heap, but the FSM or visibility
+			 * map would still contain entries for the non-existent heap
+			 * pages.
+			 */
+			if (fsm || vm)
+				XLogFlush(lsn);
+		}
 
-		XLogBeginInsert();
-		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		/* Truncate the FSM first if it exists */
+		fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
+		if (fsm)
+			FreeSpaceMapTruncateRel(rel, nblocks, lsn);
 
-		lsn = XLogInsert(RM_SMGR_ID,
-						 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+		/* Truncate the visibility map too if it exists. */
+		vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
+		if (vm)
+			visibilitymap_truncate(rel, nblocks, lsn);
 
-		/*
-		 * Flush, because otherwise the truncation of the main relation might
-		 * hit the disk before the WAL record, and the truncation of the FSM
-		 * or visibility map. If we crashed during that window, we'd be left
-		 * with a truncated heap, but the FSM or visibility map would still
-		 * contain entries for the non-existent heap pages.
-		 */
-		if (fsm || vm)
-			XLogFlush(lsn);
+		/* Do the real work */
+		smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
 	}
+	PG_CATCH();
+	{
+		if (RelationNeedsWAL(rel))
+			MyPgXact->delayChkpt = false;
+		PG_RE_THROW();
+	}
+	PG_END_TRY();
 
-	/* Do the real work */
-	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+	if (RelationNeedsWAL(rel))
+		MyPgXact->delayChkpt = false;
 }
 
 /*
@@ -536,10 +558,10 @@ smgr_redo(XLogReaderState *record)
 
 		if ((xlrec->flags & SMGR_TRUNCATE_FSM) != 0 &&
 			smgrexists(reln, FSM_FORKNUM))
-			FreeSpaceMapTruncateRel(rel, xlrec->blkno);
+			FreeSpaceMapTruncateRel(rel, xlrec->blkno, lsn);
 		if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0 &&
 			smgrexists(reln, VISIBILITYMAP_FORKNUM))
-			visibilitymap_truncate(rel, xlrec->blkno);
+			visibilitymap_truncate(rel, xlrec->blkno, lsn);
 
 		FreeFakeRelcacheEntry(rel);
 	}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bbd90c9..4483984 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -294,9 +294,14 @@ GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk)
  * before they access the FSM again.
  *
  * nblocks is the new size of the heap.
+ *
+ * The caller must have WAL-logged the truncation already (if the relation
+ * needs WAL-logging at all). 'lsn' is the LSN of the XLOG record. It is
+ * used to stamp the last remaining VM page, so that it doesn't get flushed
+ * to disk before the WAL record.
  */
 void
-FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
+FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks, XLogRecPtr lsn)
 {
 	BlockNumber new_nfsmblocks;
 	FSMAddress	first_removed_address;
@@ -327,8 +332,13 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 		if (!BufferIsValid(buf))
 			return;				/* nothing to do; the FSM was already smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
 		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
-		MarkBufferDirtyHint(buf, false);
+
+		MarkBufferDirty(buf);
+		if (lsn != InvalidXLogRecPtr)
+			PageSetLSN(BufferGetPage(buf), lsn);
+
 		UnlockReleaseBuffer(buf);
 
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 00bbd4c..8fe8906 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -44,6 +44,6 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 				  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
-extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
+extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks, XLogRecPtr lsn);
 
 #endif   /* VISIBILITYMAP_H */
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 77b3bc3..ac014a4 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -14,6 +14,7 @@
 #ifndef FREESPACE_H_
 #define FREESPACE_H_
 
+#include "access/xlogdefs.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
@@ -30,7 +31,7 @@ extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 							Size spaceAvail);
 
-extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
+extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks, XLogRecPtr lsn);
 extern void FreeSpaceMapVacuum(Relation rel);
 extern void UpdateFreeSpaceMap(Relation rel,
 				   BlockNumber startBlkNum,
diff --git a/src/test/recovery/t/008_fsm_check.pl b/src/test/recovery/t/008_fsm_check.pl
new file mode 100644
index 0000000..2c58e31
--- /dev/null
+++ b/src/test/recovery/t/008_fsm_check.pl
@@ -0,0 +1,90 @@
+# Set of tests to check file-space map consistency on standbys
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+
+$node_master->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = replica
+wal_log_hints = on
+max_prepared_transactions = 5
+autovacuum = off
+});
+
+# Create a master node and its standby, initializing both with
+# some data at the same time.
+$node_master->start;
+
+$node_master->backup('master_backup');
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, 'master_backup',
+	has_streaming => 1);
+$node_standby->start;
+
+$node_master->psql('postgres', qq{
+create table testtab (a int, b char(100));
+insert into testtab select generate_series(1,1000), 'foo';
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+});
+
+# Take a lock on the table to prevent following vacuum from truncating it
+$node_master->psql('postgres', qq{
+begin;
+lock table testtab in row share mode;
+prepare transaction 'p1';
+});
+
+# Vacuum, update FSM without truncation
+$node_master->psql('postgres', 'vacuum verbose testtab');
+
+# Force a checkpoint
+$node_master->psql('postgres', 'checkpoint');
+
+# Now do some more insert/deletes, another vacuum to ensure FPW writes are
+# done
+$node_master->psql('postgres', qq{
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+vacuum verbose testtab;
+});
+
+# Ensure all buffers are now clean on the standby
+$node_standby->psql('postgres', 'checkpoint');
+
+# Release the lock, vacuum again which should lead to truncation
+$node_master->psql('postgres', qq{
+rollback prepared 'p1';
+vacuum verbose testtab;
+});
+
+$node_master->psql('postgres', 'checkpoint');
+my $until_lsn =
+	$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+
+# Wait long enough for standby to receive and apply all WAL
+my $caughtup_query =
+	"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for standby to catch up";
+
+# now promote the standby
+$node_standby->promote;
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery()")
+  or die "Timed out while waiting for promotion of standby";
+$node_standby->psql('postgres', 'checkpoint');
+
+# restart to discard in-memory copy of FSM
+$node_standby->restart;
+
+# Insert should work on standby
+is($node_standby->psql('postgres',
+   qq{insert into testtab select generate_series(1,1000), 'foo';}),
+   0, 'INSERT succeeds with truncated relation FSM');
-- 
2.9.3

#11Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#10)
1 attachment(s)
Re: FSM corruption leading to errors

On Mon, Oct 17, 2016 at 8:04 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.

Good point! I didn't think about that.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.

Right.. All the code paths calling FreeSpaceMapTruncateRel or
visibilitymap_truncate do flush the record, but it definitely make
things more robust to set the LSN on the page. So +1 this way.

I think we need something like the attached.

OK, I had a look at that.

MarkBufferDirty(mapBuffer);
+ if (lsn != InvalidXLogRecPtr)
+ PageSetLSN(page, lsn);
UnlockReleaseBuffer(mapBuffer);
Nit: using XLogRecPtrIsInvalid() makes more sense for such checks?

pg_visibility calls as well visibilitymap_truncate, but this patch did
not update it. And it would be better to do the actual truncate after
writing the WAL record I think.

You are also breaking XLogFlush() in RelationTruncate() because vm and
fsm are assigned thanks to a call of smgrexists(), something done
before XLogFlush is called on HEAD, and after with your patch. So your
patch finishes by never calling XLogFlush() on relation truncation
even if there is a VM or a FSM.

Attached is an updated version with those issues fixed. The final
version does not need the test as that's really a corner-case, but I
still included it to help testing for now.
--
Michael

Attachments:

truncation-vm-fsm.patchapplication/x-download; name=truncation-vm-fsm.patchDownload
diff --git a/contrib/pg_visibility/pg_visibility.c b/contrib/pg_visibility/pg_visibility.c
index 9985e3e..84f6a11 100644
--- a/contrib/pg_visibility/pg_visibility.c
+++ b/contrib/pg_visibility/pg_visibility.c
@@ -366,6 +366,7 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 {
 	Oid			relid = PG_GETARG_OID(0);
 	Relation	rel;
+	XLogRecPtr	lsn = InvalidXLogRecPtr;
 
 	rel = relation_open(relid, AccessExclusiveLock);
 
@@ -380,8 +381,6 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 	RelationOpenSmgr(rel);
 	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
 
-	visibilitymap_truncate(rel, 0);
-
 	if (RelationNeedsWAL(rel))
 	{
 		xl_smgr_truncate xlrec;
@@ -393,9 +392,11 @@ pg_truncate_visibility_map(PG_FUNCTION_ARGS)
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 
-		XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+		lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
 	}
 
+	visibilitymap_truncate(rel, 0, lsn);
+
 	/*
 	 * Release the lock right away, not at commit time.
 	 *
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..e93806d 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -457,9 +457,14 @@ visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_fro
  * before they access the VM again.
  *
  * nheapblocks is the new size of the heap.
+ *
+ * The caller must have WAL-logged the truncation already (if the relation
+ * needs WAL-logging at all). 'lsn' is the LSN of the XLOG record. It is
+ * used to stamp the last remaining VM page, so that it doesn't get flushed
+ * to disk before the WAL record.
  */
 void
-visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
+visibilitymap_truncate(Relation rel, BlockNumber nheapblocks, XLogRecPtr lsn)
 {
 	BlockNumber newnblocks;
 
@@ -524,6 +529,8 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 		map[truncByte] &= (1 << truncOffset) - 1;
 
 		MarkBufferDirty(mapBuffer);
+		if (!XLogRecPtrIsInvalid(lsn))
+			PageSetLSN(page, lsn);
 		UnlockReleaseBuffer(mapBuffer);
 	}
 	else
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 0d8311c..6c6e21f 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -28,6 +28,7 @@
 #include "catalog/storage.h"
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
+#include "storage/proc.h"
 #include "storage/smgr.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -226,9 +227,6 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 void
 RelationTruncate(Relation rel, BlockNumber nblocks)
 {
-	bool		fsm;
-	bool		vm;
-
 	/* Open it at the smgr level if not already done */
 	RelationOpenSmgr(rel);
 
@@ -239,56 +237,85 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	rel->rd_smgr->smgr_fsm_nblocks = InvalidBlockNumber;
 	rel->rd_smgr->smgr_vm_nblocks = InvalidBlockNumber;
 
-	/* Truncate the FSM first if it exists */
-	fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
-	if (fsm)
-		FreeSpaceMapTruncateRel(rel, nblocks);
-
-	/* Truncate the visibility map too if it exists. */
-	vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
-	if (vm)
-		visibilitymap_truncate(rel, nblocks);
-
-	/*
-	 * We WAL-log the truncation before actually truncating, which means
-	 * trouble if the truncation fails. If we then crash, the WAL replay
-	 * likely isn't going to succeed in the truncation either, and cause a
-	 * PANIC. It's tempting to put a critical section here, but that cure
-	 * would be worse than the disease. It would turn a usually harmless
-	 * failure to truncate, that might spell trouble at WAL replay, into a
-	 * certain PANIC.
-	 */
-	if (RelationNeedsWAL(rel))
+	PG_TRY();
 	{
+		bool		fsm;
+		bool		vm;
+		XLogRecPtr	lsn = InvalidXLogRecPtr;
+
 		/*
-		 * Make an XLOG entry reporting the file truncation.
+		 * Look first for the existence of the VM and FSM to control
+		 * properly the flush of WAL record marking relation truncation.
 		 */
-		XLogRecPtr	lsn;
-		xl_smgr_truncate xlrec;
+		fsm = smgrexists(rel->rd_smgr, FSM_FORKNUM);
+		vm = smgrexists(rel->rd_smgr, VISIBILITYMAP_FORKNUM);
 
-		xlrec.blkno = nblocks;
-		xlrec.rnode = rel->rd_node;
-		xlrec.flags = SMGR_TRUNCATE_ALL;
+		/*
+		 * We WAL-log the truncation before actually truncating, which means
+		 * trouble if the truncation fails. If we then crash, the WAL replay
+		 * likely isn't going to succeed in the truncation either, and cause a
+		 * PANIC. It's tempting to put a critical section here, but that cure
+		 * would be worse than the disease. It would turn a usually harmless
+		 * failure to truncate, that might spell trouble at WAL replay, into a
+		 * certain PANIC.
+		 */
+		if (RelationNeedsWAL(rel))
+		{
+			/*
+			 * Make an XLOG entry reporting the file truncation.
+			 */
+			xl_smgr_truncate xlrec;
+
+			/*
+			 * Since we're not holding a buffer lock, use delayChkpt to
+			 * prevent a checkpoint from happening between WAL-logging, and
+			 * performing the truncation.
+			 */
+			MyPgXact->delayChkpt = true;
+
+			xlrec.blkno = nblocks;
+			xlrec.rnode = rel->rd_node;
+			xlrec.flags = SMGR_TRUNCATE_ALL;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
+			lsn = XLogInsert(RM_SMGR_ID,
+							 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+
+			/*
+			 * Flush, because otherwise the truncation of the main relation
+			 * might hit the disk before the WAL record, and the truncation of
+			 * the FSM or visibility map. If we crashed during that window,
+			 * we'd be left with a truncated heap, but the FSM or visibility
+			 * map would still contain entries for the non-existent heap
+			 * pages.
+			 */
+			if (fsm || vm)
+				XLogFlush(lsn);
+		}
 
-		XLogBeginInsert();
-		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		/* Truncate the FSM first if it exists */
+		if (fsm)
+			FreeSpaceMapTruncateRel(rel, nblocks, lsn);
 
-		lsn = XLogInsert(RM_SMGR_ID,
-						 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+		/* Truncate the visibility map too if it exists. */
+		if (vm)
+			visibilitymap_truncate(rel, nblocks, lsn);
 
-		/*
-		 * Flush, because otherwise the truncation of the main relation might
-		 * hit the disk before the WAL record, and the truncation of the FSM
-		 * or visibility map. If we crashed during that window, we'd be left
-		 * with a truncated heap, but the FSM or visibility map would still
-		 * contain entries for the non-existent heap pages.
-		 */
-		if (fsm || vm)
-			XLogFlush(lsn);
+		/* Do the real work */
+		smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+	}
+	PG_CATCH();
+	{
+		if (RelationNeedsWAL(rel))
+			MyPgXact->delayChkpt = false;
+		PG_RE_THROW();
 	}
+	PG_END_TRY();
 
-	/* Do the real work */
-	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
+	if (RelationNeedsWAL(rel))
+		MyPgXact->delayChkpt = false;
 }
 
 /*
@@ -536,10 +563,10 @@ smgr_redo(XLogReaderState *record)
 
 		if ((xlrec->flags & SMGR_TRUNCATE_FSM) != 0 &&
 			smgrexists(reln, FSM_FORKNUM))
-			FreeSpaceMapTruncateRel(rel, xlrec->blkno);
+			FreeSpaceMapTruncateRel(rel, xlrec->blkno, lsn);
 		if ((xlrec->flags & SMGR_TRUNCATE_VM) != 0 &&
 			smgrexists(reln, VISIBILITYMAP_FORKNUM))
-			visibilitymap_truncate(rel, xlrec->blkno);
+			visibilitymap_truncate(rel, xlrec->blkno, lsn);
 
 		FreeFakeRelcacheEntry(rel);
 	}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bbd90c9..0e675c8 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -294,9 +294,14 @@ GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk)
  * before they access the FSM again.
  *
  * nblocks is the new size of the heap.
+ *
+ * The caller must have WAL-logged the truncation already (if the relation
+ * needs WAL-logging at all). 'lsn' is the LSN of the XLOG record. It is
+ * used to stamp the last remaining VM page, so that it doesn't get flushed
+ * to disk before the WAL record.
  */
 void
-FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
+FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks, XLogRecPtr lsn)
 {
 	BlockNumber new_nfsmblocks;
 	FSMAddress	first_removed_address;
@@ -327,8 +332,13 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 		if (!BufferIsValid(buf))
 			return;				/* nothing to do; the FSM was already smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
 		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
-		MarkBufferDirtyHint(buf, false);
+
+		MarkBufferDirty(buf);
+		if (!XLogRecPtrIsInvalid(lsn))
+			PageSetLSN(BufferGetPage(buf), lsn);
+
 		UnlockReleaseBuffer(buf);
 
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
diff --git a/src/include/access/visibilitymap.h b/src/include/access/visibilitymap.h
index 00bbd4c..8fe8906 100644
--- a/src/include/access/visibilitymap.h
+++ b/src/include/access/visibilitymap.h
@@ -44,6 +44,6 @@ extern void visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 				  uint8 flags);
 extern uint8 visibilitymap_get_status(Relation rel, BlockNumber heapBlk, Buffer *vmbuf);
 extern void visibilitymap_count(Relation rel, BlockNumber *all_visible, BlockNumber *all_frozen);
-extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks);
+extern void visibilitymap_truncate(Relation rel, BlockNumber nheapblocks, XLogRecPtr lsn);
 
 #endif   /* VISIBILITYMAP_H */
diff --git a/src/include/storage/freespace.h b/src/include/storage/freespace.h
index 77b3bc3..ac014a4 100644
--- a/src/include/storage/freespace.h
+++ b/src/include/storage/freespace.h
@@ -14,6 +14,7 @@
 #ifndef FREESPACE_H_
 #define FREESPACE_H_
 
+#include "access/xlogdefs.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
@@ -30,7 +31,7 @@ extern void RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
 extern void XLogRecordPageWithFreeSpace(RelFileNode rnode, BlockNumber heapBlk,
 							Size spaceAvail);
 
-extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks);
+extern void FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks, XLogRecPtr lsn);
 extern void FreeSpaceMapVacuum(Relation rel);
 extern void UpdateFreeSpaceMap(Relation rel,
 				   BlockNumber startBlkNum,
diff --git a/src/test/recovery/t/008_fsm_check.pl b/src/test/recovery/t/008_fsm_check.pl
new file mode 100644
index 0000000..2c58e31
--- /dev/null
+++ b/src/test/recovery/t/008_fsm_check.pl
@@ -0,0 +1,90 @@
+# Set of tests to check file-space map consistency on standbys
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+
+$node_master->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = replica
+wal_log_hints = on
+max_prepared_transactions = 5
+autovacuum = off
+});
+
+# Create a master node and its standby, initializing both with
+# some data at the same time.
+$node_master->start;
+
+$node_master->backup('master_backup');
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, 'master_backup',
+	has_streaming => 1);
+$node_standby->start;
+
+$node_master->psql('postgres', qq{
+create table testtab (a int, b char(100));
+insert into testtab select generate_series(1,1000), 'foo';
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+});
+
+# Take a lock on the table to prevent following vacuum from truncating it
+$node_master->psql('postgres', qq{
+begin;
+lock table testtab in row share mode;
+prepare transaction 'p1';
+});
+
+# Vacuum, update FSM without truncation
+$node_master->psql('postgres', 'vacuum verbose testtab');
+
+# Force a checkpoint
+$node_master->psql('postgres', 'checkpoint');
+
+# Now do some more insert/deletes, another vacuum to ensure FPW writes are
+# done
+$node_master->psql('postgres', qq{
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+vacuum verbose testtab;
+});
+
+# Ensure all buffers are now clean on the standby
+$node_standby->psql('postgres', 'checkpoint');
+
+# Release the lock, vacuum again which should lead to truncation
+$node_master->psql('postgres', qq{
+rollback prepared 'p1';
+vacuum verbose testtab;
+});
+
+$node_master->psql('postgres', 'checkpoint');
+my $until_lsn =
+	$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+
+# Wait long enough for standby to receive and apply all WAL
+my $caughtup_query =
+	"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for standby to catch up";
+
+# now promote the standby
+$node_standby->promote;
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery()")
+  or die "Timed out while waiting for promotion of standby";
+$node_standby->psql('postgres', 'checkpoint');
+
+# restart to discard in-memory copy of FSM
+$node_standby->restart;
+
+# Insert should work on standby
+is($node_standby->psql('postgres',
+   qq{insert into testtab select generate_series(1,1000), 'foo';}),
+   0, 'INSERT succeeds with truncated relation FSM');
#12Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: FSM corruption leading to errors

On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.

Right, I missed the problem with checksums.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.

I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.

BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode. While most of
the reports so far involved standbys, and the bug can also hit a standalone
master during crash recovery, I wonder if a race can occur even on a live
system.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Pavan Deolasee (#12)
1 attachment(s)
Re: FSM corruption leading to errors

On 10/18/2016 07:01 AM, Pavan Deolasee wrote:

On Mon, Oct 17, 2016 at 4:34 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

visibilitymap_truncate is actually also wrong, in a different way. The
truncation WAL record is written only after the VM (and FSM) are truncated.
But visibilitymap_truncate() has already modified and dirtied the page. If
the VM page change is flushed to disk before the WAL record, and you crash,
you might have a torn VM page and a checksum failure.

Right, I missed the problem with checksums.

Simply replacing the MarkBufferDirtyHint() call with MarkBufferDirty() in
FreeSpaceMapTruncateRel would have the same issue. If you call
MarkBufferDirty(), you must WAL-log the change, and also set the page's LSN
to make sure the WAL record is flushed first.

I'm not sure I fully understand this. If the page is flushed before the WAL
record, we might have a truncated FSM where as the relation truncation is
not replayed. But that's not the same problem, right? I mean, you might
have an FSM which is not accurate, but it won't at the least return the
blocks outside the range. Having said that, I agree your proposed changes
are more robust.

Actually, this is still not 100% safe. Flushing the WAL before modifying
the FSM page is not enough. We also need to WAL-log a full-page image of
the FSM page, otherwise we are still vulnerable to the torn page problem.

I came up with the attached. This is fortunately much simpler than my
previous attempt. I replaced the MarkBufferDirtyHint() calls with
MarkBufferDirty(), to fix the original issue, plus WAL-logging a
full-page image to fix the torn page issue.

BTW any thoughts on race-condition on the primary? Comments at
MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.

I think the race condition can only happen when the page is locked in
shared mode. In any case, with this proposed fix, we'll use
MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.

- Heikki

Attachments:

truncation-vm-fsm-2.patchapplication/x-download; name=truncation-vm-fsm-2.patchDownload
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 3ad4a9f..f020737 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -508,6 +508,9 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 
 		LockBuffer(mapBuffer, BUFFER_LOCK_EXCLUSIVE);
 
+		/* NO EREPORT(ERROR) from here till changes are logged */
+		START_CRIT_SECTION();
+
 		/* Clear out the unwanted bytes. */
 		MemSet(&map[truncByte + 1], 0, MAPSIZE - (truncByte + 1));
 
@@ -523,7 +526,20 @@ visibilitymap_truncate(Relation rel, BlockNumber nheapblocks)
 		 */
 		map[truncByte] &= (1 << truncOffset) - 1;
 
+		/*
+		 * Truncation of a relation is WAL-logged at a higher-level, and we
+		 * will be called at WAL replay. But if checksums are enabled, we need
+		 * to still write a WAL record to protect against a torn page, if the
+		 * page is flushed to disk before the truncation WAL record. We cannot
+		 * use MarkBufferDirtyHint here, because that will not dirty the page
+		 * during recovery.
+		 */
 		MarkBufferDirty(mapBuffer);
+		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+			log_newpage_buffer(mapBuffer, false);
+
+		END_CRIT_SECTION();
+
 		UnlockReleaseBuffer(mapBuffer);
 	}
 	else
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index bbd90c9..4138b04 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -327,8 +327,26 @@ FreeSpaceMapTruncateRel(Relation rel, BlockNumber nblocks)
 		if (!BufferIsValid(buf))
 			return;				/* nothing to do; the FSM was already smaller */
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
+
+		/* NO EREPORT(ERROR) from here till changes are logged */
+		START_CRIT_SECTION();
+
 		fsm_truncate_avail(BufferGetPage(buf), first_removed_slot);
-		MarkBufferDirtyHint(buf, false);
+
+		/*
+		 * Truncation of a relation is WAL-logged at a higher-level, and we
+		 * will be called at WAL replay. But if checksums are enabled, we need
+		 * to still write a WAL record to protect against a torn page, if the
+		 * page is flushed to disk before the truncation WAL record. We cannot
+		 * use MarkBufferDirtyHint here, because that will not dirty the page
+		 * during recovery.
+		 */
+		MarkBufferDirty(buf);
+		if (!InRecovery && RelationNeedsWAL(rel) && XLogHintBitIsNeeded())
+			log_newpage_buffer(buf, false);
+
+		END_CRIT_SECTION();
+
 		UnlockReleaseBuffer(buf);
 
 		new_nfsmblocks = fsm_logical_to_physical(first_removed_address) + 1;
diff --git a/src/test/recovery/t/008_fsm_check.pl b/src/test/recovery/t/008_fsm_check.pl
new file mode 100644
index 0000000..2c58e31
--- /dev/null
+++ b/src/test/recovery/t/008_fsm_check.pl
@@ -0,0 +1,90 @@
+# Set of tests to check file-space map consistency on standbys
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 1;
+
+my $node_master = get_new_node('master');
+$node_master->init(allows_streaming => 1);
+
+$node_master->append_conf('postgresql.conf', qq{
+fsync = on
+wal_level = replica
+wal_log_hints = on
+max_prepared_transactions = 5
+autovacuum = off
+});
+
+# Create a master node and its standby, initializing both with
+# some data at the same time.
+$node_master->start;
+
+$node_master->backup('master_backup');
+my $node_standby = get_new_node('standby');
+$node_standby->init_from_backup($node_master, 'master_backup',
+	has_streaming => 1);
+$node_standby->start;
+
+$node_master->psql('postgres', qq{
+create table testtab (a int, b char(100));
+insert into testtab select generate_series(1,1000), 'foo';
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+});
+
+# Take a lock on the table to prevent following vacuum from truncating it
+$node_master->psql('postgres', qq{
+begin;
+lock table testtab in row share mode;
+prepare transaction 'p1';
+});
+
+# Vacuum, update FSM without truncation
+$node_master->psql('postgres', 'vacuum verbose testtab');
+
+# Force a checkpoint
+$node_master->psql('postgres', 'checkpoint');
+
+# Now do some more insert/deletes, another vacuum to ensure FPW writes are
+# done
+$node_master->psql('postgres', qq{
+insert into testtab select generate_series(1,1000), 'foo';
+delete from testtab where ctid > '(8,0)';
+vacuum verbose testtab;
+});
+
+# Ensure all buffers are now clean on the standby
+$node_standby->psql('postgres', 'checkpoint');
+
+# Release the lock, vacuum again which should lead to truncation
+$node_master->psql('postgres', qq{
+rollback prepared 'p1';
+vacuum verbose testtab;
+});
+
+$node_master->psql('postgres', 'checkpoint');
+my $until_lsn =
+	$node_master->safe_psql('postgres', "SELECT pg_current_xlog_location();");
+
+# Wait long enough for standby to receive and apply all WAL
+my $caughtup_query =
+	"SELECT '$until_lsn'::pg_lsn <= pg_last_xlog_replay_location()";
+$node_standby->poll_query_until('postgres', $caughtup_query)
+	or die "Timed out while waiting for standby to catch up";
+
+# now promote the standby
+$node_standby->promote;
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery()")
+  or die "Timed out while waiting for promotion of standby";
+$node_standby->psql('postgres', 'checkpoint');
+
+# restart to discard in-memory copy of FSM
+$node_standby->restart;
+
+# Insert should work on standby
+is($node_standby->psql('postgres',
+   qq{insert into testtab select generate_series(1,1000), 'foo';}),
+   0, 'INSERT succeeds with truncated relation FSM');
#14Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 2:37 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Actually, this is still not 100% safe. Flushing the WAL before modifying
the FSM page is not enough. We also need to WAL-log a full-page image of
the FSM page, otherwise we are still vulnerable to the torn page problem.

I came up with the attached. This is fortunately much simpler than my
previous attempt. I replaced the MarkBufferDirtyHint() calls with
MarkBufferDirty(), to fix the original issue, plus WAL-logging a full-page
image to fix the torn page issue.

Looks good to me.

BTW any thoughts on race-condition on the primary? Comments at

MarkBufferDirtyHint() seems to suggest that a race condition is possible
which might leave the buffer without the DIRTY flag, but I'm not sure if
that can only happen when the page is locked in shared mode.

I think the race condition can only happen when the page is locked in
shared mode. In any case, with this proposed fix, we'll use
MarkBufferDirty() rather than MarkBufferDirtyHint(), so it's moot.

Yes, the fix will cover that problem (if it exists). The reason why I was
curious to know is because there are several reports of similar error in
the past and some of them did not involve as standby. Those reports mostly
remained unresolved and I wondered if this explains them. But yeah, my
conclusion was that the race is not possible with page locked in EXCLUSIVE
mode. So may be there is another problem somewhere or a crash recovery may
have left the FSM in inconsistent state.

Anyways, we seem good to go with the patch.

Thanks,
Pavan
--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Pavan Deolasee (#14)
Re: FSM corruption leading to errors

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#15)
Re: FSM corruption leading to errors

On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

Oh, forgot that this needs to be backported, of course. Will do that
shortly...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Michael Paquier
michael.paquier@gmail.com
In reply to: Heikki Linnakangas (#15)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 8:29 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

Thanks! I am surprised that you kept the TAP test at the end.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#16)
Re: FSM corruption leading to errors

On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:

On 10/19/2016 02:29 PM, Heikki Linnakangas wrote:

On 10/19/2016 01:07 PM, Pavan Deolasee wrote:

Anyways, we seem good to go with the patch.

Ok, committed. Thanks for the analysis!

Oh, forgot that this needs to be backported, of course. Will do that
shortly...

Done.

This didn't include anything to cope with an already-corrupt FSM, BTW.
Do we still want to try something for that? I think it's good enough if
we prevent the FSM corruption from happening, but not sure what the
consensus on that might be..

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#18)
Re: FSM corruption leading to errors

Heikki Linnakangas <hlinnaka@iki.fi> writes:

This didn't include anything to cope with an already-corrupt FSM, BTW.
Do we still want to try something for that? I think it's good enough if
we prevent the FSM corruption from happening, but not sure what the
consensus on that might be..

Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?) We're going to need to be able to explain how to
fix busted visibility maps in 9.6.1, too.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 6:44 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 10/19/2016 02:32 PM, Heikki Linnakangas wrote:

Oh, forgot that this needs to be backported, of course. Will do that
shortly...

Done.

Thanks!

This didn't include anything to cope with an already-corrupt FSM, BTW. Do
we still want to try something for that? I think it's good enough if we
prevent the FSM corruption from happening, but not sure what the consensus
on that might be..

I thought it will be nice to handle already corrupt FSM since our customer
found it immediately after a failover and then it was a serious issue. In
one case, a system table was affected, thus preventing all DDLs from
running. Having said that, I don't have a better idea to handle the problem
without causing non-trivial overhead for normal cases (see my original
patch). If you've better ideas, it might be worth pursuing.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#21Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#19)
Re: FSM corruption leading to errors

On Wed, Oct 19, 2016 at 6:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Can we document an existing procedure for repairing FSM corruption?
(VACUUM, maybe?)

I'm afraid it may not be easy to repair the corruption with existing
facilities. Most often the corruption will be on the standby and a VACUUM
may not actually touch affected pages on the master (because they may not
even exists on the master or skipped because of visibility maps). It may
not even trigger relation truncation. So AFAICS it may not generate any WAL
activity that can fix the corruption on the standby.

One possible way would be to delete the FSM (and VM) information on the
master and standby and then run VACUUM so these maps are rebuilt. We
obviously don't need to do this for all tables, but we need a way to find
the tables with corrupt FSM [1]SELECT * FROM ( SELECT oid::regclass as relname, EXISTS ( SELECT * FROM ( SELECT blkno, pg_freespace(oid::regclass, blkno) FROM generate_series(pg_relation_size(oid::regclass)/ current_setting('block_size')::bigint, pg_relation_size(oid::regclass, 'fsm') / 2) as blkno ) as avail WHERE pg_freespace > 0 ) as corrupt_fsm FROM pg_class WHERE relkind = 'r' ) b WHERE b.corrupt_fsm = true;.

Suggested procedure could be:

1. Upgrade master and standby to the latest minor release (which involves
restart)
2. Install pg_freespace extension and run the query [1]SELECT * FROM ( SELECT oid::regclass as relname, EXISTS ( SELECT * FROM ( SELECT blkno, pg_freespace(oid::regclass, blkno) FROM generate_series(pg_relation_size(oid::regclass)/ current_setting('block_size')::bigint, pg_relation_size(oid::regclass, 'fsm') / 2) as blkno ) as avail WHERE pg_freespace > 0 ) as corrupt_fsm FROM pg_class WHERE relkind = 'r' ) b WHERE b.corrupt_fsm = true; on master to find
possible corruption cases. The query checks if FSM reports free space in a
block outside the size of the relation. Unfortunately, we might have false
positives if the relation is extended while the query is running.
3. Repeat the same query on standby (if it's running in Hot standby mode,
otherwise the corruption can only be detected once it's promoted to be a
master)
4. Remove FSM and VM files for the affected tables (I don't think if it's
safe to do this on a running server)
5. VACUUM affected tables so that FSM and VM is rebuilt.

Another idea is to implement a pg_freespace_repair() function in
pg_freespace which takes an AccessExclusiveLock on the table and truncates
it to it's current size, thus generating a WAL record that the standby will
replay to fix the corruption. This probably looks more promising, easy to
explain and less error prone.

[1]: SELECT * FROM ( SELECT oid::regclass as relname, EXISTS ( SELECT * FROM ( SELECT blkno, pg_freespace(oid::regclass, blkno) FROM generate_series(pg_relation_size(oid::regclass)/ current_setting('block_size')::bigint, pg_relation_size(oid::regclass, 'fsm') / 2) as blkno ) as avail WHERE pg_freespace > 0 ) as corrupt_fsm FROM pg_class WHERE relkind = 'r' ) b WHERE b.corrupt_fsm = true;
FROM (
SELECT oid::regclass as relname, EXISTS (
SELECT *
FROM (
SELECT blkno, pg_freespace(oid::regclass, blkno)
FROM
generate_series(pg_relation_size(oid::regclass)/
current_setting('block_size')::bigint, pg_relation_size(oid::regclass,
'fsm') / 2) as blkno
) as avail
WHERE pg_freespace > 0
) as corrupt_fsm
FROM pg_class
WHERE relkind = 'r'
) b
WHERE b.corrupt_fsm = true;

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#22Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#21)
Re: FSM corruption leading to errors

On Thu, Oct 20, 2016 at 2:11 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

4. Remove FSM and VM files for the affected tables (I don't think if it's
safe to do this on a running server)

Definitely not while the server is running... For VMs a good way would
be to use pg_visibility's pg_truncate_visibility_map(), but only for
9.6~. For FSM there is no real solution, and actually a
pg_truncate_fsm would prove to be useful here. So users would need to
stop once the server if there are corrupted VMs or FSMs, delete them
manually, and then run VACUUM to rebuild them.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#22)
Re: FSM corruption leading to errors

On Thu, Oct 20, 2016 at 10:50 AM, Michael Paquier <michael.paquier@gmail.com

wrote:

For VMs a good way would
be to use pg_visibility's pg_truncate_visibility_map(), but only for
9.6~.

Ah ok..

For FSM there is no real solution, and actually a
pg_truncate_fsm would prove to be useful here.

Right, that's what I proposed as an alternate idea. I agree this is much
cleaner and less error prone approach.

Actually, if we could add an API which can truncate FSM to the given heap
block, then the user may not even need to run VACUUM, which could be costly
for very large tables. Also, AFAICS we will need to backport
pg_truncate_visibility_map() to older releases because unless the VM is
truncated along with the FSM, VACUUM may not scan all pages and the FSM for
those pages won't be recorded.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#24Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#23)
Re: FSM corruption leading to errors

On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Actually, if we could add an API which can truncate FSM to the given heap
block, then the user may not even need to run VACUUM, which could be costly
for very large tables.

FreeSpaceMapTruncateRel()?

Also, AFAICS we will need to backport
pg_truncate_visibility_map() to older releases because unless the VM is
truncated along with the FSM, VACUUM may not scan all pages and the FSM for
those pages won't be recorded.

This would need a careful lookup, and it would not be that complicated
to implement. And this bug justifies the presence of a tool like that
actually... But usually those don't get a backpatch, so the
probability of getting that backported is low IMO, personally I am not
sure I like that either.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Michael Paquier (#24)
Re: FSM corruption leading to errors

On Thu, Oct 20, 2016 at 11:34 AM, Michael Paquier <michael.paquier@gmail.com

wrote:

On Thu, Oct 20, 2016 at 2:50 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Actually, if we could add an API which can truncate FSM to the given heap
block, then the user may not even need to run VACUUM, which could be

costly

for very large tables.

FreeSpaceMapTruncateRel()?

Right. We need a SQL callable function to invoke that.

Also, AFAICS we will need to backport
pg_truncate_visibility_map() to older releases because unless the VM is
truncated along with the FSM, VACUUM may not scan all pages and the FSM

for

those pages won't be recorded.

This would need a careful lookup, and it would not be that complicated
to implement. And this bug justifies the presence of a tool like that
actually... But usually those don't get a backpatch, so the
probability of getting that backported is low IMO, personally I am not
sure I like that either.

Just to clarify, I meant if we truncate the entire FSM then we'll need API
to truncate VM as well so that VACUUM rebuilds everything completely. OTOH
if we provide a function just to truncate FSM to match the size of the
table, then we don't need to rebuild the FSM. So that's probably a better
way to handle FSM corruption, as far as this particular issue is concerned.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#26Michael Paquier
michael.paquier@gmail.com
In reply to: Pavan Deolasee (#25)
1 attachment(s)
Re: FSM corruption leading to errors

On Thu, Oct 20, 2016 at 3:37 PM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:

Just to clarify, I meant if we truncate the entire FSM then we'll need API
to truncate VM as well so that VACUUM rebuilds everything completely. OTOH
if we provide a function just to truncate FSM to match the size of the
table, then we don't need to rebuild the FSM. So that's probably a better
way to handle FSM corruption, as far as this particular issue is concerned.

To be honest, I think that just having in the release notes the method
that does not involve the use any extra extension or SQL routine is
fine. So we could just tell to users to:
1) Run something like the query you gave upthread, giving to the user
a list of the files that are corrupted. And add this query to the
release notes.
2) If anything is found, stop the server and delete the files manually.
3) Re-start the server.
OK, that's troublesome and costly for large relations, but we know
that's the safest way to go for any versions, and there is no need to
complicate the code with any one-time repairing extensions.

Speaking of which, I implemented a small extension able to truncate
the FSM up to the size of the relation as attached, but as I looked at
it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
is rather limited... And I pushed as well a version on github:
https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
The limitation range of such an extension is a argument good enough to
just rely on the stop/delete-FSM/start method to fix an instance and
let VACUUM do the rest of the work. That looks to work but use it at
your own risk.

This bug would be a good blog topic by the way...
--
Michael

Attachments:

pg_fix_truncation.tar.gzapplication/x-gzip; name=pg_fix_truncation.tar.gzDownload
#27Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Michael Paquier (#26)
Re: FSM corruption leading to errors

On 10/20/16 10:15 PM, Michael Paquier wrote:

2) If anything is found, stop the server and delete the files manually.
3) Re-start the server.
OK, that's troublesome and costly for large relations, but we know
that's the safest way to go for any versions, and there is no need to
complicate the code with any one-time repairing extensions.

Wouldn't you need to run around to all your replicas and do that as well?

Speaking of which, I implemented a small extension able to truncate
the FSM up to the size of the relation as attached, but as I looked at
it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
is rather limited... And I pushed as well a version on github:
https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
The limitation range of such an extension is a argument good enough to
just rely on the stop/delete-FSM/start method to fix an instance and
let VACUUM do the rest of the work. That looks to work but use it at
your own risk.

Shouldn't the truncation be logged before it's performed?
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532) mobile: 512-569-9461

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28Michael Paquier
michael.paquier@gmail.com
In reply to: Jim Nasby (#27)
Re: FSM corruption leading to errors

On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 10/20/16 10:15 PM, Michael Paquier wrote:

2) If anything is found, stop the server and delete the files manually.
3) Re-start the server.
OK, that's troublesome and costly for large relations, but we know
that's the safest way to go for any versions, and there is no need to
complicate the code with any one-time repairing extensions.

Wouldn't you need to run around to all your replicas and do that as well?

Yeah...

Speaking of which, I implemented a small extension able to truncate
the FSM up to the size of the relation as attached, but as I looked at
it SMGR_TRUNCATE_FSM has been introduced in 9.6 so its range of action
is rather limited... And I pushed as well a version on github:
https://github.com/michaelpq/pg_plugins/tree/master/pg_fix_truncation
The limitation range of such an extension is a argument good enough to
just rely on the stop/delete-FSM/start method to fix an instance and
let VACUUM do the rest of the work. That looks to work but use it at
your own risk.

Shouldn't the truncation be logged before it's performed?

Doh. Yes, thanks for the reminder. And I commented on that upthread..
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29Michael Paquier
michael.paquier@gmail.com
In reply to: Michael Paquier (#28)
Re: FSM corruption leading to errors

On Sat, Oct 22, 2016 at 7:31 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:

On Sat, Oct 22, 2016 at 5:17 AM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:

On 10/20/16 10:15 PM, Michael Paquier wrote:

2) If anything is found, stop the server and delete the files manually.
3) Re-start the server.
OK, that's troublesome and costly for large relations, but we know
that's the safest way to go for any versions, and there is no need to
complicate the code with any one-time repairing extensions.

Wouldn't you need to run around to all your replicas and do that as well?

Yeah...

Release notes refer to this page for methods to fix corrupted instances:
https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
I just typed something based on Pavan's upper method, feel free to
jump in and improve it as necessary.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Michael Paquier (#29)
Re: FSM corruption leading to errors

Michael Paquier <michael.paquier@gmail.com> writes:

Release notes refer to this page for methods to fix corrupted instances:
https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
I just typed something based on Pavan's upper method, feel free to
jump in and improve it as necessary.

Thanks for drafting that, but isn't the query wrong?
Specifically the bit about

SELECT blkno, pg_freespace(oid::regclass, blkno)
FROM generate_series(pg_relation_size(oid::regclass) / current_setting('block_size')::BIGINT,
pg_relation_size(oid::regclass, 'fsm') / 2) AS blkno

It looks to me like this is approximating the highest block number that
could possibly have an FSM entry as size of the FSM fork (in bytes)
divided by 2. But the FSM stores one byte per block. There is overhead
for the FSM search tree, but in a large relation it's not going to be as
much as a factor of 2. So I think that to be conservative we need to
drop the "/ 2". Am I missing something?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#30)
Re: FSM corruption leading to errors

On Mon, Oct 24, 2016 at 9:34 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

SELECT blkno, pg_freespace(oid::regclass, blkno)
FROM generate_series(pg_relation_size(oid::regclass) /
current_setting('block_size')::BIGINT,
pg_relation_size(oid::regclass, 'fsm') / 2) AS
blkno

It looks to me like this is approximating the highest block number that
could possibly have an FSM entry as size of the FSM fork (in bytes)
divided by 2. But the FSM stores one byte per block. There is overhead
for the FSM search tree, but in a large relation it's not going to be as
much as a factor of 2. So I think that to be conservative we need to
drop the "/ 2". Am I missing something?

I went by these comments in fsm_internals.h, which suggest that the
SlotsPerFSMPage are limited to somewhat less than BLCKSZ divided by 2.

/*
* Number of non-leaf and leaf nodes, and nodes in total, on an FSM page.
* These definitions are internal to fsmpage.c.
*/
#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \
offsetof(FSMPageData, fp_nodes))

#define NonLeafNodesPerPage (BLCKSZ / 2 - 1)
#define LeafNodesPerPage (NodesPerPage - NonLeafNodesPerPage)

/*
* Number of FSM "slots" on a FSM page. This is what should be used
* outside fsmpage.c.
*/
#define SlotsPerFSMPage LeafNodesPerPage

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#32Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#30)
Re: FSM corruption leading to errors

I wrote:

It looks to me like this is approximating the highest block number that
could possibly have an FSM entry as size of the FSM fork (in bytes)
divided by 2. But the FSM stores one byte per block. There is overhead
for the FSM search tree, but in a large relation it's not going to be as
much as a factor of 2. So I think that to be conservative we need to
drop the "/ 2". Am I missing something?

Ah, scratch that, after rereading the FSM README I see it's correct,
because there's a binary tree within each page; I'd only remembered
that there was a search tree of pages.

Also, we could at least discount the FSM root page and first intermediate
page, no? That is, the upper limit could be

pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT

I think this is a worthwhile improvement because it reduces the time spent
on small relations. For me, the query as given takes 9 seconds to examine
the regression database, which seems like a lot. Discounting two pages
reduces that to 20 ms.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33Pavan Deolasee
pavan.deolasee@gmail.com
In reply to: Tom Lane (#32)
Re: FSM corruption leading to errors

On Mon, Oct 24, 2016 at 9:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Also, we could at least discount the FSM root page and first intermediate
page, no? That is, the upper limit could be

pg_relation_size(oid::regclass, 'fsm') / 2 -
2*current_setting('block_size')::BIGINT

+1 for doing that.

Thanks,
Pavan

--
Pavan Deolasee http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#34Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Tom Lane (#32)
Re: FSM corruption leading to errors

Tom Lane wrote:

Ah, scratch that, after rereading the FSM README I see it's correct,
because there's a binary tree within each page; I'd only remembered
that there was a search tree of pages.

Also, we could at least discount the FSM root page and first intermediate
page, no? That is, the upper limit could be

pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT

I think this is a worthwhile improvement because it reduces the time spent
on small relations. For me, the query as given takes 9 seconds to examine
the regression database, which seems like a lot. Discounting two pages
reduces that to 20 ms.

Hah, good one. We spent some time thinking about subtracting some value
to make the value more accurate but it didn't occur to me to just use
constant two.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#34)
Re: FSM corruption leading to errors

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

Tom Lane wrote:

Also, we could at least discount the FSM root page and first intermediate
page, no? That is, the upper limit could be

pg_relation_size(oid::regclass, 'fsm') / 2 - 2*current_setting('block_size')::BIGINT

I think this is a worthwhile improvement because it reduces the time spent
on small relations. For me, the query as given takes 9 seconds to examine
the regression database, which seems like a lot. Discounting two pages
reduces that to 20 ms.

Hah, good one. We spent some time thinking about subtracting some value
to make the value more accurate but it didn't occur to me to just use
constant two.

I got the arithmetic wrong in the above, it should be like

(pg_relation_size(oid::regclass, 'fsm') - 2*current_setting('block_size')::BIGINT) / 2

With that, the runtime on HEAD's regression DB is about 700 ms, which is
still a nice win over 9000 ms.

I've put up draft wiki pages about these two problems at

https://wiki.postgresql.org/wiki/Free_Space_Map_Problems
https://wiki.postgresql.org/wiki/Visibility_Map_Problems

(thanks to Michael Paquier for initial work on the first one).
They're meant to be reasonably generic about FSM/VM problems
rather than only being about our current bugs. Please review.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers