Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Started by Andres Freundabout 7 years ago25 messages

andres@anarazel.de

about 7 years ago

Hi,

The zheap patchset, even after being based on pluggable storage,
currently has the following condition in RelationAddExtraBlocks():
if (RelationStorageIsZHeap(relation))
{
Assert(BufferGetBlockNumber(buffer) != ZHEAP_METAPAGE);
ZheapInitPage(page, BufferGetPageSize(buffer));
freespace = PageGetZHeapFreeSpace(page);
}
else
{
PageInit(page, BufferGetPageSize(buffer), 0);
freespace = PageGetHeapFreeSpace(page);
}

I.e. it initializes the page differently when zheap is used versus
heap.

Thinking about whether it's worth to allow to extend that function in an
extensible manner made me wonder: Is it actually a good idea to
initialize the page at that point, including marking it dirty?

As far as I can tell that that has several downsides:
- Dirtying the buffer for initialization will cause potentially
superfluous IO, with no interesting data in the write except for a
newly initialized page.
- As there's no sort of interlock, it's entirely possible that, after a
crash, the blocks will come up empty, but with the FSM returning it as
as empty, so that path would be good to support anyway.
- It adds heap specific code to a routine that otherwise could be
generic for different table access methods

It seems to me, this could be optimized by *not* initializing the page,
and having a PageIsNew(), check at the places that check whether the
page is new, and initialize it in that case.

Right now we'll just ignore such pages when we encounter them
(e.g. after a crash) until vacuum initializes them. But given that we've
accepted that we'll potentially create empty pages anyway, I don't see
what advantages that provides? The warnings emitted by vacuum are
pretty scary, and can happen in a lot of legitimate cases, so removing
them seems like a good idea anyway.

I'm pretty tired, so I might be missing something large and obvious
here.

Greetings,

Andres Freund

amit.kapila16@gmail.com

about 7 years ago

In reply to: Andres Freund (#1)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On Wed, Dec 19, 2018 at 2:09 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

The zheap patchset, even after being based on pluggable storage,
currently has the following condition in RelationAddExtraBlocks():
if (RelationStorageIsZHeap(relation))
{
Assert(BufferGetBlockNumber(buffer) != ZHEAP_METAPAGE);
ZheapInitPage(page, BufferGetPageSize(buffer));
freespace = PageGetZHeapFreeSpace(page);
}
else
{
PageInit(page, BufferGetPageSize(buffer), 0);
freespace = PageGetHeapFreeSpace(page);
}

I.e. it initializes the page differently when zheap is used versus
heap.

Thinking about whether it's worth to allow to extend that function in an
extensible manner made me wonder: Is it actually a good idea to
initialize the page at that point, including marking it dirty?

As far as I can tell that that has several downsides:
- Dirtying the buffer for initialization will cause potentially
superfluous IO, with no interesting data in the write except for a
newly initialized page.
- As there's no sort of interlock, it's entirely possible that, after a
crash, the blocks will come up empty, but with the FSM returning it as
as empty, so that path would be good to support anyway.
- It adds heap specific code to a routine that otherwise could be
generic for different table access methods

IIUC, your proposal is to remove page initialization and
MarkBufferDirty from RelationAddExtraBlocks(), but record them in FSM.
Is my understanding correct, if so, I don't see any problem with that
and as you have mentioned, it will be generally advantageous as well?

It seems to me, this could be optimized by *not* initializing the page,
and having a PageIsNew(), check at the places that check whether the
page is new, and initialize it in that case.

makes sense to me.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#1)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On Wed, Dec 19, 2018 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:

Thinking about whether it's worth to allow to extend that function in an
extensible manner made me wonder: Is it actually a good idea to
initialize the page at that point, including marking it dirty?

As far as I can recall, my motivation was to avoid increasing the
number of warnings produced by VACUUM. If those warnings are going
away, then I don't know that there's any reason to keep that code as
it is. But I am not sure whether such a move would provoke any
opposition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

andres@anarazel.de

about 7 years ago

In reply to: Robert Haas (#3)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2018-12-19 17:28:17 -0500, Robert Haas wrote:

On Wed, Dec 19, 2018 at 3:39 AM Andres Freund <andres@anarazel.de> wrote:

Thinking about whether it's worth to allow to extend that function in an
extensible manner made me wonder: Is it actually a good idea to
initialize the page at that point, including marking it dirty?

As far as I can recall, my motivation was to avoid increasing the
number of warnings produced by VACUUM. If those warnings are going
away, then I don't know that there's any reason to keep that code as
it is. But I am not sure whether such a move would provoke any
opposition.

What's gained by the logic of emitting that warning in VACUUM after a
crash? I don't really see any robustness advantages in it. If the logic
were that we'd never reuse empty pages because they can hide corruption
that normally would discovered by checksums, then we shouldn't
reinitialize them at all and instead error out hard - but we can't do
that, because it's normal that they occur. Right now we have empty
pages on-disk whenever a busy server is restarted in immediate mode,
after all.

Greetings,

Andres Freund

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#4)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On Wed, Dec 19, 2018 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:

What's gained by the logic of emitting that warning in VACUUM after a
crash? I don't really see any robustness advantages in it. If the logic
were that we'd never reuse empty pages because they can hide corruption
that normally would discovered by checksums, then we shouldn't
reinitialize them at all and instead error out hard - but we can't do
that, because it's normal that they occur. Right now we have empty
pages on-disk whenever a busy server is restarted in immediate mode,
after all.

I don't know. I am just normally reluctant to change things
precipitously that are of long tenure.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

tgl@sss.pgh.pa.us

about 7 years ago

In reply to: Robert Haas (#5)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Dec 19, 2018 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:

What's gained by the logic of emitting that warning in VACUUM after a
crash? I don't really see any robustness advantages in it.

I don't know. I am just normally reluctant to change things
precipitously that are of long tenure.

Me too, but I think Andres has a point here. Those warnings in VACUUM
are ancient, probably predating the introduction of WAL :-(. At the
time there was good reason to be suspicious of zeroed pages in tables.
Now, though, we have (what we think is) a bulletproof crash recovery
procedure in which possibly-zeroed pages are to be expected; so we're
just causing users unnecessary alarm by warning about them. I think
it's reasonable to, if not remove the messages entirely, at least
downgrade them to a low DEBUG level.

regards, tom lane

andres@anarazel.de

about 7 years ago

In reply to: Tom Lane (#6)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2018-12-19 19:39:33 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Dec 19, 2018 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:

What's gained by the logic of emitting that warning in VACUUM after a
crash? I don't really see any robustness advantages in it.

I don't know. I am just normally reluctant to change things
precipitously that are of long tenure.

Me too, but I think Andres has a point here. Those warnings in VACUUM
are ancient, probably predating the introduction of WAL :-(. At the
time there was good reason to be suspicious of zeroed pages in tables.
Now, though, we have (what we think is) a bulletproof crash recovery
procedure in which possibly-zeroed pages are to be expected; so we're
just causing users unnecessary alarm by warning about them. I think
it's reasonable to, if not remove the messages entirely, at least
downgrade them to a low DEBUG level.

Yea, I think that'd be reasonable.

I think we ought to add them, as new/zero pages, to the FSM. If we did
so both during vacuum and RelationAddExtraBlocks() we'd avoid the
redundant writes, and avoid depending on heap layout in
RelationAddExtraBlocks().

I wonder if it'd make sense to only log a DEBUG if there's no
corresponding FSM entry. That'd obviously not be bulltproof, but it'd
reduce the spammyness considerably.

Greetings,

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Andres Freund (#7)

1 attachment(s)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2018-12-19 16:56:36 -0800, Andres Freund wrote:

On 2018-12-19 19:39:33 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Dec 19, 2018 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:

What's gained by the logic of emitting that warning in VACUUM after a
crash? I don't really see any robustness advantages in it.

I don't know. I am just normally reluctant to change things
precipitously that are of long tenure.

Me too, but I think Andres has a point here. Those warnings in VACUUM
are ancient, probably predating the introduction of WAL :-(. At the
time there was good reason to be suspicious of zeroed pages in tables.
Now, though, we have (what we think is) a bulletproof crash recovery
procedure in which possibly-zeroed pages are to be expected; so we're
just causing users unnecessary alarm by warning about them. I think
it's reasonable to, if not remove the messages entirely, at least
downgrade them to a low DEBUG level.

Yea, I think that'd be reasonable.

I think we ought to add them, as new/zero pages, to the FSM. If we did
so both during vacuum and RelationAddExtraBlocks() we'd avoid the
redundant writes, and avoid depending on heap layout in
RelationAddExtraBlocks().

I wonder if it'd make sense to only log a DEBUG if there's no
corresponding FSM entry. That'd obviously not be bulltproof, but it'd
reduce the spammyness considerably.

Here's a prototype patch implementing this change. I'm not sure the new
DEBUG message is really worth it?

Looking at the surrounding code made me wonder about the wisdom of
entering empty pages as all-visible and all-frozen into the VM. That'll
mean we'll never re-discover them on a primary, after promotion. There's
no mechanism to add such pages to the FSM on a standby (in contrast to
e.g. pages where tuples are modified), as vacuum will never visit that
page again. Now obviously it has the advantage of avoiding
re-processing the page in the next vacuum, but is that really an
important goal? If there's frequent vacuums, there got to be a fair
amount of modifications, which ought to lead to re-use of such pages at
a not too far away point?

Greetings,

Andres Freund

Attachments:

new-blocks-as-new.difftext/x-diff; charset=us-asciiDownload

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index b8b5871559b..aa0bc4d8f9c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -216,18 +216,15 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 				 BufferGetBlockNumber(buffer),
 				 RelationGetRelationName(relation));
 
-		PageInit(page, BufferGetPageSize(buffer), 0);
-
 		/*
-		 * We mark all the new buffers dirty, but do nothing to write them
-		 * out; they'll probably get used soon, and even if they are not, a
-		 * crash will leave an okay all-zeroes page on disk.
+		 * Add the page to the FSM without initializing. If we were to
+		 * initialize here the page would potentially get flushed out to disk
+		 * before we add any useful content. There's no guarantee that that'd
+		 * happen however, so we need to deal with unitialized pages anyway,
+		 * thus let's avoid the potential for unnecessary writes.
 		 */
-		MarkBufferDirty(buffer);
-
-		/* we'll need this info below */
 		blockNum = BufferGetBlockNumber(buffer);
-		freespace = PageGetHeapFreeSpace(page);
+		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
 
 		UnlockReleaseBuffer(buffer);
 
@@ -479,6 +476,18 @@ loop:
 		 * we're done.
 		 */
 		page = BufferGetPage(buffer);
+
+		/*
+		 * Initialize page, it'll be used soon.  We could avoid dirtying the
+		 * buffer here, and rely on the caller to do so whenever it puts a
+		 * tuple onto the page, but there seems not much benefit in doing so.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BufferGetPageSize(buffer), 0);
+			MarkBufferDirty(buffer);
+		}
+
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 		if (len + saveFreeSpace <= pageFreeSpace)
 		{
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 8134c52253e..2d8d3569372 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -862,42 +862,42 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 		if (PageIsNew(page))
 		{
 			/*
-			 * An all-zeroes page could be left over if a backend extends the
-			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
+			 * All-zeroes pages can be left over if either a backend extends
+			 * the relation by a single page, but crashes before initializing
+			 * the page, or when bulk-extending the relation (which creates a
+			 * number of empty pages at the tail end of the relation, but
+			 * enters them into the FSM).
 			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * protect against that, release the buffer lock, grab the
-			 * relation extension lock momentarily, and re-lock the buffer. If
-			 * the page is still uninitialized by then, it must be left over
-			 * from a crashed backend, and we can initialize it.
+			 * Make sure these pages are in the FSM, to ensure they can be
+			 * reused. Do that by testing if there's any space recorded for
+			 * the page. If not, enter it.  We log that case on a debug level.
 			 *
-			 * We don't really need the relation lock when this is a new or
-			 * temp relation, but it's probably not worth the code space to
-			 * check that, since this surely isn't a critical path.
-			 *
-			 * Note: the comparable code in vacuum.c need not worry because
-			 * it's got exclusive lock on the whole relation.
+			 * Note we do not enter the page into the visibilitymap. That has
+			 * the downside that we repeatedly visit this page in subsequent
+			 * vacuums, but otherwise we'll never not discover the space on a
+			 * promoted standby. The harm of repeated checking ought to
+			 * normally not be too bad - the space usually should be used at
+			 * some point, otherwise there wouldn't be any regular vacuums.
+			 */
+			Size		freespace = 0;
+
+			empty_pages++;
+
+			/*
+			 * Perform checking of FSM after releasing lock, the fsm is
+			 * approximate, after all.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
-			LockBufferForCleanup(buf);
-			if (PageIsNew(page))
-			{
-				ereport(WARNING,
-						(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
-								relname, blkno)));
-				PageInit(page, BufferGetPageSize(buf), 0);
-				empty_pages++;
-			}
-			freespace = PageGetHeapFreeSpace(page);
-			MarkBufferDirty(buf);
 			UnlockReleaseBuffer(buf);
 
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			if (GetRecordedFreeSpace(onerel, blkno) == 0)
+				freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+			if (freespace > 0)
+			{
+				RecordPageWithFreeSpace(onerel, blkno, freespace);
+				elog(DEBUG1, "relation \"%s\" page %u is uninitialized and not in fsm, fixing",
+					 relname, blkno);
+			}
 			continue;
 		}
 
@@ -2030,7 +2030,6 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		if (PageIsNew(page) || PageIsEmpty(page))
 		{
-			/* PageIsNew probably shouldn't happen... */
 			UnlockReleaseBuffer(buf);
 			continue;
 		}

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#8)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2018-12-20 15:04:11 -0800, Andres Freund wrote:

On 2018-12-19 16:56:36 -0800, Andres Freund wrote:

On 2018-12-19 19:39:33 -0500, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Wed, Dec 19, 2018 at 5:37 PM Andres Freund <andres@anarazel.de> wrote:

What's gained by the logic of emitting that warning in VACUUM after a
crash? I don't really see any robustness advantages in it.

I don't know. I am just normally reluctant to change things
precipitously that are of long tenure.

Me too, but I think Andres has a point here. Those warnings in VACUUM
are ancient, probably predating the introduction of WAL :-(. At the
time there was good reason to be suspicious of zeroed pages in tables.
Now, though, we have (what we think is) a bulletproof crash recovery
procedure in which possibly-zeroed pages are to be expected; so we're
just causing users unnecessary alarm by warning about them. I think
it's reasonable to, if not remove the messages entirely, at least
downgrade them to a low DEBUG level.

Yea, I think that'd be reasonable.

I think we ought to add them, as new/zero pages, to the FSM. If we did
so both during vacuum and RelationAddExtraBlocks() we'd avoid the
redundant writes, and avoid depending on heap layout in
RelationAddExtraBlocks().

I wonder if it'd make sense to only log a DEBUG if there's no
corresponding FSM entry. That'd obviously not be bulltproof, but it'd
reduce the spammyness considerably.

Here's a prototype patch implementing this change. I'm not sure the new
DEBUG message is really worth it?

Looking at the surrounding code made me wonder about the wisdom of
entering empty pages as all-visible and all-frozen into the VM. That'll
mean we'll never re-discover them on a primary, after promotion. There's
no mechanism to add such pages to the FSM on a standby (in contrast to
e.g. pages where tuples are modified), as vacuum will never visit that
page again. Now obviously it has the advantage of avoiding
re-processing the page in the next vacuum, but is that really an
important goal? If there's frequent vacuums, there got to be a fair
amount of modifications, which ought to lead to re-use of such pages at
a not too far away point?

Any comments on the approach in this patch?

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#9)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Andres Freund <andres@anarazel.de> writes:

Looking at the surrounding code made me wonder about the wisdom of
entering empty pages as all-visible and all-frozen into the VM. That'll
mean we'll never re-discover them on a primary, after promotion. There's
no mechanism to add such pages to the FSM on a standby (in contrast to
e.g. pages where tuples are modified), as vacuum will never visit that
page again. Now obviously it has the advantage of avoiding
re-processing the page in the next vacuum, but is that really an
important goal? If there's frequent vacuums, there got to be a fair
amount of modifications, which ought to lead to re-use of such pages at
a not too far away point?

Any comments on the approach in this patch?

I agree with the concept of postponing page init till we're actually
going to do something with the page. However, the patch seems to need
some polish:

* There's a comment in RelationAddExtraBlocks, just above where you
changed, that

* Extend by one page. This should generally match the main-line
* extension code in RelationGetBufferForTuple, except that we hold
* the relation extension lock throughout.

This seems to be falsified by this patch, in that one of the two paths
does PageInit and the other doesn't.

* s/unitialized pages/uninitialized pages/

* This bit in vacuumlazy seems unnecessarily confusing:

+            Size        freespace = 0;
...
+            if (GetRecordedFreeSpace(onerel, blkno) == 0)
+                freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+            if (freespace > 0)
+            {
+                RecordPageWithFreeSpace(onerel, blkno, freespace);

I'd write that as just

+            if (GetRecordedFreeSpace(onerel, blkno) == 0)
+            {
+                Size    freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+                RecordPageWithFreeSpace(onerel, blkno, freespace);

I tend to agree that the DEBUG message isn't very necessary, or at least
could be lower than DEBUG1.

regards, tom lane

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#10)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-01-15 17:35:21 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Looking at the surrounding code made me wonder about the wisdom of
entering empty pages as all-visible and all-frozen into the VM. That'll
mean we'll never re-discover them on a primary, after promotion. There's
no mechanism to add such pages to the FSM on a standby (in contrast to
e.g. pages where tuples are modified), as vacuum will never visit that
page again. Now obviously it has the advantage of avoiding
re-processing the page in the next vacuum, but is that really an
important goal? If there's frequent vacuums, there got to be a fair
amount of modifications, which ought to lead to re-use of such pages at
a not too far away point?

Any comments on the approach in this patch?

I agree with the concept of postponing page init till we're actually
going to do something with the page. However, the patch seems to need
some polish:

Yea, as I'd written in an earlier message, it was really meant as a
prototype to see whether there's buyin to the change.

* There's a comment in RelationAddExtraBlocks, just above where you
changed, that

* Extend by one page. This should generally match the main-line
* extension code in RelationGetBufferForTuple, except that we hold
* the relation extension lock throughout.

This seems to be falsified by this patch, in that one of the two paths
does PageInit and the other doesn't.

* s/unitialized pages/uninitialized pages/

* This bit in vacuumlazy seems unnecessarily confusing:
+            Size        freespace = 0;
...
+            if (GetRecordedFreeSpace(onerel, blkno) == 0)
+                freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+
+            if (freespace > 0)
+            {
+                RecordPageWithFreeSpace(onerel, blkno, freespace);
I'd write that as just
+            if (GetRecordedFreeSpace(onerel, blkno) == 0)
+            {
+                Size    freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+                RecordPageWithFreeSpace(onerel, blkno, freespace);

Fixed, thanks for looking!

I tend to agree that the DEBUG message isn't very necessary, or at least
could be lower than DEBUG1.

After a bit of back and forth waffling, I've removed it now.

Besides a fair bit of comment changes the latest version has just one
functional change: lazy_check_needs_freeze() doesn't indicate requiring
freezing for new pages anymore, if we can't get a cleanup lock on those,
it's about to be written to, and we'd not do more than enter it into the
FSM.

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#11)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

So, I'd pushed the latest version. And longfin came back with an
interesting error:

ERROR: page 135 of relation "pg_class" should be empty but is not

The only way I can currently imagine this happening is that there's a
concurrent vacuum that discovers the page is empty, enters it into the
FSM (which now isn't happening under an extension lock anymore), and
then a separate backend starts to actually use that buffer. That seems
tight but possible. Seems we need to re-add the
LockRelationForExtension/UnlockRelationForExtension() logic :(

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#12)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-01-28 14:10:55 -0800, Andres Freund wrote:

So, I'd pushed the latest version. And longfin came back with an
interesting error:

ERROR: page 135 of relation "pg_class" should be empty but is not

The only way I can currently imagine this happening is that there's a
concurrent vacuum that discovers the page is empty, enters it into the
FSM (which now isn't happening under an extension lock anymore), and
then a separate backend starts to actually use that buffer. That seems
tight but possible. Seems we need to re-add the
LockRelationForExtension/UnlockRelationForExtension() logic :(

Hm, but thinking about this, isn't this a risk independent of this
change? The FSM isn't WAL logged, and given it's tree-ish structure it's
possible to "rediscover" free space (e.g. FreeSpaceMapVacuum()). ISTM
that after a crash the FSM might point to free space that doesn't
actually exist, and is rediscovered after independent changes. Not sure
if that's actually a realistic issue.

I'm inclined to put back the
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockRelationForExtension(onerel, ExclusiveLock);
UnlockRelationForExtension(onerel, ExclusiveLock);
LockBufferForCleanup(buf);
if (PageIsNew(page))
dance regardless, just to get the buildfarm to green?

But I do wonder if we should just make hio.c cope with this instead.

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#13)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Andres Freund <andres@anarazel.de> writes:

I'm inclined to put back the
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockRelationForExtension(onerel, ExclusiveLock);
UnlockRelationForExtension(onerel, ExclusiveLock);
LockBufferForCleanup(buf);
if (PageIsNew(page))
dance regardless, just to get the buildfarm to green?

The buildfarm's got half a dozen reports now of a failure of this ilk,
so you'd better do something.

But I do wonder if we should just make hio.c cope with this instead.

Probably should not try to go that way under time presssure.

regards, tom lane

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#14)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-01-28 18:08:59 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'm inclined to put back the
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockRelationForExtension(onerel, ExclusiveLock);
UnlockRelationForExtension(onerel, ExclusiveLock);
LockBufferForCleanup(buf);
if (PageIsNew(page))
dance regardless, just to get the buildfarm to green?

The buildfarm's got half a dozen reports now of a failure of this ilk,
so you'd better do something.

Yea, I was working on a patch. Was trying to come up with an explanation
of how this can be realistically hit on the BF, but failed. I've pushed
something now, let's see whether that fixes it.

But I do wonder if we should just make hio.c cope with this instead.

Probably should not try to go that way under time presssure.

Definitely.

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#15)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-01-28 15:49:33 -0800, Andres Freund wrote:

On 2019-01-28 18:08:59 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'm inclined to put back the
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockRelationForExtension(onerel, ExclusiveLock);
UnlockRelationForExtension(onerel, ExclusiveLock);
LockBufferForCleanup(buf);
if (PageIsNew(page))
dance regardless, just to get the buildfarm to green?

The buildfarm's got half a dozen reports now of a failure of this ilk,
so you'd better do something.

Yea, I was working on a patch. Was trying to come up with an explanation
of how this can be realistically hit on the BF, but failed. I've pushed
something now, let's see whether that fixes it.

It has not. Given that I don't understand what's happening here I'm
going to revert both commits unless I figure it out in the next ~30min.

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#16)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-01-28 16:40:36 -0800, Andres Freund wrote:

On 2019-01-28 15:49:33 -0800, Andres Freund wrote:

On 2019-01-28 18:08:59 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I'm inclined to put back the
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
LockRelationForExtension(onerel, ExclusiveLock);
UnlockRelationForExtension(onerel, ExclusiveLock);
LockBufferForCleanup(buf);
if (PageIsNew(page))
dance regardless, just to get the buildfarm to green?

The buildfarm's got half a dozen reports now of a failure of this ilk,
so you'd better do something.

Yea, I was working on a patch. Was trying to come up with an explanation
of how this can be realistically hit on the BF, but failed. I've pushed
something now, let's see whether that fixes it.

It has not. Given that I don't understand what's happening here I'm
going to revert both commits unless I figure it out in the next ~30min.

I did that now. I couldn't reproduce it locally, despite a lot of
runs. Looking at the buildfarm it looks like the failures were,
excluding handfish which failed without recognizable symptoms before and
after, on BSD derived platforms (netbsd, freebsd, OX), which certainly
is interesting. I asked Thomas Munro whether he could run on freebsd,
and he gave me a login, where I just now reproduced the issue (2 of 5
make check runs failing).

Greetings,

Andres Freund

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Andres Freund (#17)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Andres Freund <andres@anarazel.de> writes:

I did that now. I couldn't reproduce it locally, despite a lot of
runs. Looking at the buildfarm it looks like the failures were,
excluding handfish which failed without recognizable symptoms before and
after, on BSD derived platforms (netbsd, freebsd, OX), which certainly
is interesting.

Isn't it now. Something about the BSD scheduler perhaps? But we've
got four or five different BSD-ish platforms that reported failures,
and it's hard to believe they've all got identical schedulers.

That second handfish failure does match the symptoms elsewhere:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2019-01-29%2000%3A20%3A22

--- /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-thread.stderr	2018-10-30 20:11:45.551967381 -0300
+++ /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-thread.stderr	2019-01-28 22:38:20.614211568 -0200
@@ -0,0 +1,20 @@
+SQL error: page 0 of relation "test_thread" should be empty but is not on line 125

so it's not quite 100% BSD, but certainly the failure rate on BSD is
way higher than elsewhere. Puzzling.

regards, tom lane

andres@anarazel.de

almost 7 years ago

In reply to: Tom Lane (#18)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On 2019-01-28 22:37:53 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

I did that now. I couldn't reproduce it locally, despite a lot of
runs. Looking at the buildfarm it looks like the failures were,
excluding handfish which failed without recognizable symptoms before and
after, on BSD derived platforms (netbsd, freebsd, OX), which certainly
is interesting.

Isn't it now. Something about the BSD scheduler perhaps? But we've
got four or five different BSD-ish platforms that reported failures,
and it's hard to believe they've all got identical schedulers.

That second handfish failure does match the symptoms elsewhere:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2019-01-29%2000%3A20%3A22
--- /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-thread.stderr	2018-10-30 20:11:45.551967381 -0300
+++ /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-thread.stderr	2019-01-28 22:38:20.614211568 -0200
@@ -0,0 +1,20 @@
+SQL error: page 0 of relation "test_thread" should be empty but is not on line 125
so it's not quite 100% BSD, but certainly the failure rate on BSD is
way higher than elsewhere. Puzzling.

Interesting.

While chatting with Robert about this issue I came across the following
section of code:

/*
* If the FSM knows nothing of the rel, try the last page before we
* give up and extend. This avoids one-tuple-per-page syndrome during
* bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

I think that explains the issue (albeit not why it is much more frequent
on BSDs). Because we're not going through the FSM, it's perfectly
possible to find a page that is uninitialized, *and* is not yet in the
FSM. The only reason this wasn't previously actively broken, I think, is
that while we previously *also* looked that page (before the extending
backend acquired a lock!), when looking at the page
PageGetHeapFreeSpace(), via PageGetFreeSpace(), decides there's no free
space because it just interprets the zeroes in pd_upper - pd_lower as no
free space.

Hm, thinking about what a good solution here could be.

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#19)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On 2019-01-29 11:25:41 -0800, Andres Freund wrote:

On 2019-01-28 22:37:53 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:

I did that now. I couldn't reproduce it locally, despite a lot of
runs. Looking at the buildfarm it looks like the failures were,
excluding handfish which failed without recognizable symptoms before and
after, on BSD derived platforms (netbsd, freebsd, OX), which certainly
is interesting.

Isn't it now. Something about the BSD scheduler perhaps? But we've
got four or five different BSD-ish platforms that reported failures,
and it's hard to believe they've all got identical schedulers.

That second handfish failure does match the symptoms elsewhere:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2019-01-29%2000%3A20%3A22
--- /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-thread.stderr	2018-10-30 20:11:45.551967381 -0300
+++ /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-thread.stderr	2019-01-28 22:38:20.614211568 -0200
@@ -0,0 +1,20 @@
+SQL error: page 0 of relation "test_thread" should be empty but is not on line 125
so it's not quite 100% BSD, but certainly the failure rate on BSD is
way higher than elsewhere. Puzzling.
Interesting.

While chatting with Robert about this issue I came across the following
section of code:

/*
* If the FSM knows nothing of the rel, try the last page before we
* give up and extend. This avoids one-tuple-per-page syndrome during
* bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

I think that explains the issue (albeit not why it is much more frequent
on BSDs). Because we're not going through the FSM, it's perfectly
possible to find a page that is uninitialized, *and* is not yet in the
FSM. The only reason this wasn't previously actively broken, I think, is
that while we previously *also* looked that page (before the extending
backend acquired a lock!), when looking at the page
PageGetHeapFreeSpace(), via PageGetFreeSpace(), decides there's no free
space because it just interprets the zeroes in pd_upper - pd_lower as no
free space.

Hm, thinking about what a good solution here could be.

I wonder if we should just expand the logic we have for
RBM_ZERO_AND_LOCK so it can be and use it in hio.c (we probably could
just use it without any changes, but the name seems a bit confusing) -
because that'd prevent the current weirdness that it's possible that the
buffer can be locked by somebody between the ReadBufferBI(P_NEW) and and
the LockBuffer(BUFFER_LOCK_EXCLUSIVE). I think that'd allow us to
alltogether drop the cleanup lock logic we currently have, and also
protect us against the FSM issue I'd outlined upthread?

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#19)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On 2019-01-29 11:25:41 -0800, Andres Freund wrote:

While chatting with Robert about this issue I came across the following
section of code:

/*
* If the FSM knows nothing of the rel, try the last page before we
* give up and extend. This avoids one-tuple-per-page syndrome during
* bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

I think that explains the issue (albeit not why it is much more frequent
on BSDs). Because we're not going through the FSM, it's perfectly
possible to find a page that is uninitialized, *and* is not yet in the
FSM. The only reason this wasn't previously actively broken, I think, is
that while we previously *also* looked that page (before the extending
backend acquired a lock!), when looking at the page
PageGetHeapFreeSpace(), via PageGetFreeSpace(), decides there's no free
space because it just interprets the zeroes in pd_upper - pd_lower as no
free space.

FWIW, after commenting out that block and adapting a few regression
tests to changed plans, I could not reproduce the issue on a FreeBSD
machine in 31 runs, where it previously triggered in roughly 1/3 cases.

Still don't quite understand why so much more likely on BSD...

Greetings,

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#20)

1 attachment(s)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-01-29 12:23:51 -0800, Andres Freund wrote:

On 2019-01-29 11:25:41 -0800, Andres Freund wrote:
On 2019-01-28 22:37:53 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:

I did that now. I couldn't reproduce it locally, despite a lot of
runs. Looking at the buildfarm it looks like the failures were,
excluding handfish which failed without recognizable symptoms before and
after, on BSD derived platforms (netbsd, freebsd, OX), which certainly
is interesting.

Isn't it now. Something about the BSD scheduler perhaps? But we've
got four or five different BSD-ish platforms that reported failures,
and it's hard to believe they've all got identical schedulers.

That second handfish failure does match the symptoms elsewhere:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2019-01-29%2000%3A20%3A22
--- /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-thread.stderr	2018-10-30 20:11:45.551967381 -0300
+++ /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-thread.stderr	2019-01-28 22:38:20.614211568 -0200
@@ -0,0 +1,20 @@
+SQL error: page 0 of relation "test_thread" should be empty but is not on line 125
so it's not quite 100% BSD, but certainly the failure rate on BSD is
way higher than elsewhere. Puzzling.
Interesting.

While chatting with Robert about this issue I came across the following
section of code:

/*
* If the FSM knows nothing of the rel, try the last page before we
* give up and extend. This avoids one-tuple-per-page syndrome during
* bootstrapping or in a recently-started system.
*/
if (targetBlock == InvalidBlockNumber)
{
BlockNumber nblocks = RelationGetNumberOfBlocks(relation);

if (nblocks > 0)
targetBlock = nblocks - 1;
}

I think that explains the issue (albeit not why it is much more frequent
on BSDs). Because we're not going through the FSM, it's perfectly
possible to find a page that is uninitialized, *and* is not yet in the
FSM. The only reason this wasn't previously actively broken, I think, is
that while we previously *also* looked that page (before the extending
backend acquired a lock!), when looking at the page
PageGetHeapFreeSpace(), via PageGetFreeSpace(), decides there's no free
space because it just interprets the zeroes in pd_upper - pd_lower as no
free space.

Hm, thinking about what a good solution here could be.
I wonder if we should just expand the logic we have for
RBM_ZERO_AND_LOCK so it can be and use it in hio.c (we probably could
just use it without any changes, but the name seems a bit confusing) -
because that'd prevent the current weirdness that it's possible that the
buffer can be locked by somebody between the ReadBufferBI(P_NEW) and and
the LockBuffer(BUFFER_LOCK_EXCLUSIVE). I think that'd allow us to
alltogether drop the cleanup lock logic we currently have, and also
protect us against the FSM issue I'd outlined upthread?

Here's a version of the patch implementing this approach. I assume this
solves the FreeBSD issue, but I'm running tests in a loop on Thomas'
machine.

I did not rename RBM_ZERO_AND_LOCK. New buffers are zeroed too, so that
still seems apt enough.

Greetings,

Andres Freund

Attachments:

v3-0001-Move-page-initialization-from-RelationAddExtraBlo.patchtext/x-diff; charset=us-asciiDownload

From 2f211ba82bc3a9c0935bfd0cd7fb58355134b929 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 1 Feb 2019 07:08:10 -0800
Subject: [PATCH v3] Move page initialization from RelationAddExtraBlocks() to
 use, take 2.

Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).

That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING:  relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.

Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else.  We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.

For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.

Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the fsm; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space.  The lack of handling this properly
required reverting the previous attempt in 684200543b.

This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks.

Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
---
 src/backend/access/heap/hio.c        | 120 ++++++++++++++++++---------
 src/backend/access/heap/vacuumlazy.c |  84 ++++++++++---------
 2 files changed, 127 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 3da0b49ccc4..5a108b7fe66 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -74,23 +74,31 @@ RelationPutHeapTuple(Relation relation,
 }
 
 /*
- * Read in a buffer, using bulk-insert strategy if bistate isn't NULL.
+ * Read in a buffer in mode, using bulk-insert strategy if bistate isn't NULL.
  */
 static Buffer
 ReadBufferBI(Relation relation, BlockNumber targetBlock,
-			 BulkInsertState bistate)
+			 ReadBufferMode mode, BulkInsertState bistate)
 {
 	Buffer		buffer;
 
 	/* If not bulk-insert, exactly like ReadBuffer */
 	if (!bistate)
-		return ReadBuffer(relation, targetBlock);
+		return ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
+								  mode, NULL);
 
 	/* If we have the desired block already pinned, re-pin and return it */
 	if (bistate->current_buf != InvalidBuffer)
 	{
 		if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
 		{
+			/*
+			 * Currently the LOCK variants are only used for extending
+			 * relation, which should never reach this branch.
+			 */
+			Assert(mode != RBM_ZERO_AND_LOCK &&
+				   mode != RBM_ZERO_AND_CLEANUP_LOCK);
+
 			IncrBufferRefCount(bistate->current_buf);
 			return bistate->current_buf;
 		}
@@ -101,7 +109,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
 
 	/* Perform a read using the buffer strategy */
 	buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
-								RBM_NORMAL, bistate->strategy);
+								mode, bistate->strategy);
 
 	/* Save the selected block as target for future inserts */
 	IncrBufferRefCount(buffer);
@@ -204,11 +212,10 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 		/*
 		 * Extend by one page.  This should generally match the main-line
 		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout.
+		 * the relation extension lock throughout, and we don't immediately
+		 * initialize the page (see below).
 		 */
-		buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 		page = BufferGetPage(buffer);
 
 		if (!PageIsNew(page))
@@ -216,18 +223,18 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 				 BufferGetBlockNumber(buffer),
 				 RelationGetRelationName(relation));
 
-		PageInit(page, BufferGetPageSize(buffer), 0);
-
 		/*
-		 * We mark all the new buffers dirty, but do nothing to write them
-		 * out; they'll probably get used soon, and even if they are not, a
-		 * crash will leave an okay all-zeroes page on disk.
+		 * Add the page to the FSM without initializing. If we were to
+		 * initialize here the page would potentially get flushed out to disk
+		 * before we add any useful content. There's no guarantee that that'd
+		 * happen before a potential crash, so we need to deal with
+		 * uninitialized pages anyway, thus avoid the potential for
+		 * unnecessary writes.
 		 */
-		MarkBufferDirty(buffer);
 
 		/* we'll need this info below */
 		blockNum = BufferGetBlockNumber(buffer);
-		freespace = PageGetHeapFreeSpace(page);
+		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
 
 		UnlockReleaseBuffer(buffer);
 
@@ -412,7 +419,7 @@ loop:
 		if (otherBuffer == InvalidBuffer)
 		{
 			/* easy case */
-			buffer = ReadBufferBI(relation, targetBlock, bistate);
+			buffer = ReadBufferBI(relation, targetBlock, RBM_NORMAL, bistate);
 			if (PageIsAllVisible(BufferGetPage(buffer)))
 				visibilitymap_pin(relation, targetBlock, vmbuffer);
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -479,6 +486,18 @@ loop:
 		 * we're done.
 		 */
 		page = BufferGetPage(buffer);
+
+		/*
+		 * Initialize page, it'll be used soon.  We could avoid dirtying the
+		 * buffer here, and rely on the caller to do so whenever it puts a
+		 * tuple onto the page, but there seems not much benefit in doing so.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BufferGetPageSize(buffer), 0);
+			MarkBufferDirty(buffer);
+		}
+
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 		if (len + saveFreeSpace <= pageFreeSpace)
 		{
@@ -571,28 +590,7 @@ loop:
 	 * it worth keeping an accurate file length in shared memory someplace,
 	 * rather than relying on the kernel to do it for us?
 	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
@@ -607,6 +605,52 @@ loop:
 			 RelationGetRelationName(relation));
 
 	PageInit(page, BufferGetPageSize(buffer), 0);
+	MarkBufferDirty(buffer);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * Lock the other buffer. It's guaranteed to be of a lower page number
+	 * than the new page. To conform with the deadlock prevent rules, we ought
+	 * to lock otherBuffer first, but that would give other backends a chance
+	 * to put tuples on our page. To reduce the likelihood of that, attempt to
+	 * lock the other buffer conditionally, that's very likely to work.
+	 * Otherwise we need to lock buffers in the correct order, and retry if
+	 * the space has been used in the mean time.
+	 *
+	 * Alternatively, we could acquire the lock on otherBuffer before
+	 * extending the relation, but that'd require holding the lock while
+	 * performing IO, which seems worse than an unlikely retry.
+	 */
+	if (otherBuffer != InvalidBuffer)
+	{
+		Assert(otherBuffer != buffer);
+
+		if (unlikely(!ConditionalLockBuffer(otherBuffer)))
+		{
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 * Because the buffer was unlocked for a while, it's possible,
+			 * although unlikely, that the page was filled. If so, just retry
+			 * from start.
+			 */
+			if (len > PageGetHeapFreeSpace(page))
+			{
+				LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+				UnlockReleaseBuffer(buffer);
+
+				goto loop;
+			}
+		}
+	}
 
 	if (len > PageGetHeapFreeSpace(page))
 	{
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 37aa484ec3a..26dfb0c7e0f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -860,43 +860,46 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 		if (PageIsNew(page))
 		{
+			bool		still_new;
+
 			/*
-			 * An all-zeroes page could be left over if a backend extends the
-			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
+			 * All-zeroes pages can be left over if either a backend extends
+			 * the relation by a single page, but crashes before the newly
+			 * initialized page has been written out, or when bulk-extending
+			 * the relation (which creates a number of empty pages at the tail
+			 * end of the relation, but enters them into the FSM).
 			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * protect against that, release the buffer lock, grab the
-			 * relation extension lock momentarily, and re-lock the buffer. If
-			 * the page is still uninitialized by then, it must be left over
-			 * from a crashed backend, and we can initialize it.
+			 * Make sure these pages are in the FSM, to ensure they can be
+			 * reused. Do that by testing if there's any space recorded for
+			 * the page. If not, enter it.
 			 *
-			 * We don't really need the relation lock when this is a new or
-			 * temp relation, but it's probably not worth the code space to
-			 * check that, since this surely isn't a critical path.
-			 *
-			 * Note: the comparable code in vacuum.c need not worry because
-			 * it's got exclusive lock on the whole relation.
+			 * Note we do not enter the page into the visibilitymap. That has
+			 * the downside that we repeatedly visit this page in subsequent
+			 * vacuums, but otherwise we'll never not discover the space on a
+			 * promoted standby. The harm of repeated checking ought to
+			 * normally not be too bad - the space usually should be used at
+			 * some point, otherwise there wouldn't be any regular vacuums.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
-			LockBufferForCleanup(buf);
-			if (PageIsNew(page))
-			{
-				ereport(WARNING,
-						(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
-								relname, blkno)));
-				PageInit(page, BufferGetPageSize(buf), 0);
-				empty_pages++;
-			}
-			freespace = PageGetHeapFreeSpace(page);
-			MarkBufferDirty(buf);
+
+			/*
+			 * Perform checking of FSM after releasing lock, the fsm is
+			 * approximate, after all.
+			 */
+			still_new = PageIsNew(page);
 			UnlockReleaseBuffer(buf);
 
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			if (still_new)
+			{
+				empty_pages++;
+
+				if (GetRecordedFreeSpace(onerel, blkno) == 0)
+				{
+					Size		freespace;
+
+					freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+					RecordPageWithFreeSpace(onerel, blkno, freespace);
+				}
+			}
 			continue;
 		}
 
@@ -905,7 +908,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			empty_pages++;
 			freespace = PageGetHeapFreeSpace(page);
 
-			/* empty pages are always all-visible and all-frozen */
+			/*
+			 * Empty pages are always all-visible and all-frozen (note that
+			 * the same is currently not true for new pages, see above).
+			 */
 			if (!PageIsAllVisible(page))
 			{
 				START_CRIT_SECTION();
@@ -1639,12 +1645,13 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
 
 	*hastup = false;
 
-	/* If we hit an uninitialized page, we want to force vacuuming it. */
-	if (PageIsNew(page))
-		return true;
-
-	/* Quick out for ordinary empty page. */
-	if (PageIsEmpty(page))
+	/*
+	 * New and empty pages, obviously, don't contain tuples. We could make
+	 * sure that the page is registered in the FSM, but it doesn't seem worth
+	 * waiting for a cleanup lock just for that, especially because it's
+	 * likely that the pin holder will do so.
+	 */
+	if (PageIsNew(page) || PageIsEmpty(page))
 		return false;
 
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -2029,7 +2036,6 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		if (PageIsNew(page) || PageIsEmpty(page))
 		{
-			/* PageIsNew probably shouldn't happen... */
 			UnlockReleaseBuffer(buf);
 			continue;
 		}
-- 
2.18.0.rc2.dirty

andres@anarazel.de

almost 7 years ago

In reply to: Andres Freund (#22)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Hi,

On 2019-02-01 07:14:11 -0800, Andres Freund wrote:

Here's a version of the patch implementing this approach. I assume this
solves the FreeBSD issue, but I'm running tests in a loop on Thomas'
machine.

I pushed this, and the buildfarm so far is showing more love.

There's a failure on jacana, but it looks like it's a machine issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-02-03%2013%3A17%3A57
Feb 03 08:36:16 rm: cannot remove `/home/pgrunner/bf/root/HEAD/pgsql.build/tmp_install/home/pgrunner/bf/root/HEAD/inst/bin/postgres.exe': Permission denied

And one on eelpout, which appears to be unrelated as well:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2019-02-03%2011%3A54%3A13

# Failed test 'intermediate client certificate is missing: matches'
# at t/001_ssltests.pl line 385.
# 'psql: server closed the connection unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# could not send startup packet: Connection reset by peer
# '
# doesn't match '(?^:SSL error)'
# Looks like you failed 1 test of 71.
[12:15:36] t/001_ssltests.pl ..
Dubious, test returned 1 (wstat 256, 0x100)

Greetings,

Andres Freund

amit.kapila16@gmail.com

almost 7 years ago

In reply to: Andres Freund (#23)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On Sun, Feb 3, 2019 at 8:48 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-02-01 07:14:11 -0800, Andres Freund wrote:

Here's a version of the patch implementing this approach. I assume this
solves the FreeBSD issue, but I'm running tests in a loop on Thomas'
machine.

I pushed this, and the buildfarm so far is showing more love.

There's a failure on jacana, but it looks like it's a machine issue:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2019-02-03%2013%3A17%3A57
Feb 03 08:36:16 rm: cannot remove `/home/pgrunner/bf/root/HEAD/pgsql.build/tmp_install/home/pgrunner/bf/root/HEAD/inst/bin/postgres.exe': Permission denied

And one on eelpout, which appears to be unrelated as well:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2019-02-03%2011%3A54%3A13

# Failed test 'intermediate client certificate is missing: matches'
# at t/001_ssltests.pl line 385.
# 'psql: server closed the connection unexpectedly
# This probably means the server terminated abnormally
# before or while processing the request.
# could not send startup packet: Connection reset by peer
# '
# doesn't match '(?^:SSL error)'
# Looks like you failed 1 test of 71.
[12:15:36] t/001_ssltests.pl ..
Dubious, test returned 1 (wstat 256, 0x100)

None of these failures seem to be related to your commit.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Amit Kapila (#24)

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

On Mon, Feb 4, 2019 at 1:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Feb 3, 2019 at 8:48 PM Andres Freund <andres@anarazel.de> wrote:

And one on eelpout, which appears to be unrelated as well:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=eelpout&dt=2019-02-03%2011%3A54%3A13

None of these failures seem to be related to your commit.

Yeah, that seems to be a problem with recent OpenSSL. I have some
analysis over here, but I don't know what exactly is going on yet:

/messages/by-id/CAEepm=2n6Nv+5tFfe8YnkUm1fXgvxR0Mm1FoD+QKG-vLNGLyKg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com