[PATCH] Btree BackwardScan race condition on Standby during VACUUM

Started by Michail Nikolaevalmost 6 years ago15 messages
#1Michail Nikolaev
michail.nikolaev@gmail.com
1 attachment(s)

Hello, hackers.

------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.

------ REMARK ------
While working on support for index hint bits on standby [1]/messages/by-id/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ@mail.gmail.com I have
started to getting
"ERROR: could not find left sibling of block XXXX in index XXXX"
during stress tests.

I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later... I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.

------ HOW TO REPRODUCE ------
It is not easy to reproduce the issue but you can try (tested on
REL_12_STABLE and master):

1) Setup master (sync replica and 'remote_apply' are not required -
just make scripts simpler):
autovacuum = off
synchronous_standby_names = '*'
synchronous_commit = 'remote_apply'

2) Setup standby:
primary_conninfo = 'user=postgres host=127.0.0.1 port=5432
sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres
target_session_attrs=any'
port = 6543

3) Prepare pgbench file with content (test.bench):
BEGIN;
select * from pgbench_accounts order by aid desc limit 1;
END;

4) Prepare the index:
./pgbench -i -s 10 -U postgres
./psql -U postgres -c "delete from pgbench_accounts where aid IN
(select aid from pgbench_accounts order by aid desc limit 500000)"

5) Start index scans on the standby:
./pgbench -f test.bench -j 1 -c ${NUMBER_OF_CORES} -n -P 1 -T 10000 -U
postgres -p 6543

6) Run vacuum on the master:
./psql -U postgres -c "vacuum pgbench_accounts"

7) You should see something like this:

progress: 1.0 s, 5.0 tps, lat 614.530 ms stddev 95.902
.....
progress: 5.0 s, 10.0 tps, lat 508.561 ms stddev 82.338
client 3 script 0 aborted in command 1 query 0: ERROR: could not find left sibling of block 1451 in index "pgbench_accounts_pkey"
progress: 6.0 s, 47.0 tps, lat 113.001 ms stddev 55.709
.....
progress: 12.0 s, 84.0 tps, lat 48.451 ms stddev 7.238
client 2 script 0 aborted in command 1 query 0: ERROR: could not find left sibling of block 2104 in index "pgbench_accounts_pkey"
progress: 13.0 s, 87.0 tps, lat 39.338 ms stddev 5.417
.....
progress: 16.0 s, 158.0 tps, lat 18.988 ms stddev 3.251
client 4 script 0 aborted in command 1 query 0: ERROR: could not find left sibling of block 2501 in index "pgbench_accounts_pkey"

I was able to reproduce issue with vanilla PG_12 on different systems
including the Windows machine.
On some servers it happens 100% times. On others - very seldom.

It is possible to radically increase chance to reproduce the issue by
adding a sleep in btree_xlog_unlink_page[7]https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696:

+ pg_usleep(10 * 1000L);

/* Rewrite target page as empty deleted page */
buffer = XLogInitBufferForRedo(record, 0);

------ WHAT HAPPENS ------
It is race condition issue between btree_xlog_unlink_page and _bt_walk_left.

btree_xlog_unlink_page removes page from btree changing btpo_prev and
btpo_next of its left and right siblings to point
to the each other and marks target page as removed. All these
operations are done using one-page-at-a-time locking because of[4]https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581:

* In normal operation, we would lock all the pages this WAL record
* touches before changing any of them. In WAL replay, it should be okay
* to lock just one page at a time, since no concurrent index updates can
* be happening, and readers should not care whether they arrive at the
* target page or not (since it's surely empty).

_bt_walk_left walks left in very tricky way. Please refer to
src/backend/access/nbtree/README for details[5]https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314:

Moving left in a backward scan is complicated because we must consider
the possibility that the left sibling was just split (meaning we must find
the rightmost page derived from the left sibling), plus the possibility
that the page we were just on has now been deleted and hence isn't in the
sibling chain at all anymore.

So, this is how race is happens:

0) this is target page (B) and its siblings.
A <---> B <---> C ---> END

1) walreceiver starts btree_xlog_unlink_page for the B. It is changes
the links from C to A and from A to C (I hope my scheme will be
displayed correctly):
A <---- B ----> C ---> END
^ ^
\_____________/

But B is not marked as BTP_DELETED yet - walreceiver stops at nbtxlog:697[2]https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697.

2) other backend starts _bt_walk_left from B.
It checks A, goes to from A to C by updated btpo_next and later sees
end of the btree.
So, next step is to check if B was deleted (nbtsearch:2011)[3]https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011 and try
to recover.

But B is not yet deleted! It will be marked as BTP_DELETED after a few
millis by walreceiver but not yet.
So, nbtsearch:2046[6]https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046 is happens.

------ HOW TO FIX ------
The first idea was to mark page as BTP_DELETED before updating siblings links.

Second - to update pages in the following order:
* change btpo_next
* mark as BTP_DELETED
* change btpo_prev

Such a changes fix the exactly described above race condition... but
cause a more tricky ones to start happening.
And I think it is better to avoid any too complex unclear solutions here..

Another idea - to sleep a little waiting walreceiver to mark the page
as deleted. But it seems to feel too ugly. Also it is unclear how long
to wait.

So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.

Patch is attached (on top of REL_12_STABLE).

Thanks,
Michail.

[1]: /messages/by-id/CANtu0ohOvgteBYmCMc2KERFiJUvpWGB0bRTbK_WseQH-L1jkrQ@mail.gmail.com
[2]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L697
[3]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2011
[4]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L575-L581
[5]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/README#L289-L314
[6]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtsearch.c#L2046
[7]: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/backend/access/nbtree/nbtxlog.c#L696

Attachments:

btree-race.patchtext/x-patch; charset=US-ASCII; name=btree-race.patchDownload
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c1aa..4e302fdd69 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -649,7 +649,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
-	Buffer		buffer;
+	Buffer		leafbuf, lbuff = InvalidBuffer, rbuff, buff;
 	Page		page;
 	BTPageOpaque pageop;
 
@@ -657,47 +657,29 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	rightsib = xlrec->rightsib;
 
 	/*
-	 * In normal operation, we would lock all the pages this WAL record
-	 * touches before changing any of them.  In WAL replay, it should be okay
-	 * to lock just one page at a time, since no concurrent index updates can
-	 * be happening, and readers should not care whether they arrive at the
-	 * target page or not (since it's surely empty).
+	 * We have to lock the pages we need to modify in the moving right order.
+	 * Else we will go into the race against _bt_walk_left.
 	 */
 
-	/* Fix left-link of right sibling */
-	if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-		pageop->btpo_prev = leftsib;
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
 	/* Fix right-link of left sibling, if any */
 	if (leftsib != P_NONE)
 	{
-		if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+		if (XLogReadBufferForRedo(record, 1, &lbuff) == BLK_NEEDS_REDO)
 		{
-			page = (Page) BufferGetPage(buffer);
+			page = (Page) BufferGetPage(lbuff);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 			pageop->btpo_next = rightsib;
 
 			PageSetLSN(page, lsn);
-			MarkBufferDirty(buffer);
+			MarkBufferDirty(lbuff);
 		}
-		if (BufferIsValid(buffer))
-			UnlockReleaseBuffer(buffer);
 	}
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogInitBufferForRedo(record, 0);
-	page = (Page) BufferGetPage(buffer);
+	buff = XLogInitBufferForRedo(record, 0);
+	page = (Page) BufferGetPage(buff);
 
-	_bt_pageinit(page, BufferGetPageSize(buffer));
+	_bt_pageinit(page, BufferGetPageSize(buff));
 	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	pageop->btpo_prev = leftsib;
@@ -707,9 +689,26 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	pageop->btpo_cycleid = 0;
 
 	PageSetLSN(page, lsn);
-	MarkBufferDirty(buffer);
-	UnlockReleaseBuffer(buffer);
+	MarkBufferDirty(buff);
 
+	/* Fix left-link of right sibling */
+	if (XLogReadBufferForRedo(record, 2, &rbuff) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(rbuff);
+		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		pageop->btpo_prev = leftsib;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(rbuff);
+	}
+
+	/* Release all buffers */
+	if (BufferIsValid(lbuff))
+		UnlockReleaseBuffer(lbuff);
+	UnlockReleaseBuffer(buff);
+	if (BufferIsValid(rbuff))
+		UnlockReleaseBuffer(rbuff);
+
 	/*
 	 * If we deleted a parent of the targeted leaf page, instead of the leaf
 	 * itself, update the leaf to point to the next remaining child in the
@@ -723,10 +722,10 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		 */
 		IndexTupleData trunctuple;
 
-		buffer = XLogInitBufferForRedo(record, 3);
-		page = (Page) BufferGetPage(buffer);
+		leafbuf = XLogInitBufferForRedo(record, 3);
+		page = (Page) BufferGetPage(leafbuf);
 
-		_bt_pageinit(page, BufferGetPageSize(buffer));
+		_bt_pageinit(page, BufferGetPageSize(leafbuf));
 		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -745,8 +744,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 			elog(ERROR, "could not add dummy high key to half-dead page");
 
 		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-		UnlockReleaseBuffer(buffer);
+		MarkBufferDirty(leafbuf);
+		UnlockReleaseBuffer(leafbuf);
 	}
 
 	/* Update metapage if needed */
#2Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Michail Nikolaev (#1)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Hi Michail!

Very interesting bug.

16 марта 2020 г., в 19:07, Michail Nikolaev <michail.nikolaev@gmail.com> написал(а):

So, I think right way is to lock all three pages as it is done on the
primary. As far as I can see it is not causes any real performance
regression.

It seems to me that it's exactly the same check that I was trying to verify in amcheck patch [0]https://commitfest.postgresql.org/24/2254/.
But there it was verified inside amcheck, but here it is verified by index scan.

Basically, one cannot check that two vice-versa pointers are in agreement without locking both.
As a result, they must be changed under lock too.

In my view, lock coupling is necessary here. I'm not sure we really need to lock three pages though.

Is there a reason why concurrency protocol on standby should not be exactly the same as on primary?

Best regards, Andrey Borodin.

[0]: https://commitfest.postgresql.org/24/2254/

In reply to: Michail Nikolaev (#1)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Mon, Mar 16, 2020 at 7:08 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

------ ABSTRACT ------
There is a race condition between btree_xlog_unlink_page and _bt_walk_left.
A lot of versions are affected including 12 and new-coming 13.
Happens only on standby. Seems like could not cause invalid query results.

(CC'ing Heikki, just in case.)

Good catch! I haven't tried to reproduce the problem here just yet,
but your explanation is very easy for me to believe.

As you pointed out, the best solution is likely to involve having the
standby imitate the buffer lock acquisitions that take place on the
primary. We don't do that for page splits and page deletions. I think
that it's okay in the case of page splits, since we're only failing to
perform the same bottom-up lock coupling (I added something about that
specific thing to the README recently). Even btree_xlog_unlink_page()
would probably be safe if we didn't have to worry about backwards
scans, which are really a special case. But we do.

FWIW, while I agree that this issue is more likely to occur due to the
effects of commit 558a9165, especially when running your test case, my
own work on B-Tree indexes for Postgres 12 might also be a factor. I
won't get into the reasons now, since they're very subtle, but I have
observed that the Postgres 12 work tends to make page deletion occur
far more frequently with certain workloads. This was really obvious
when I examined the structure of B-Tree indexes over many hours while
BenchmarkSQL/TPC-C [1]https://github.com/petergeoghegan/benchmarksql -- Peter Geoghegan ran, for example.

[1]: https://github.com/petergeoghegan/benchmarksql -- Peter Geoghegan
--
Peter Geoghegan

In reply to: Andrey M. Borodin (#2)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Mon, Mar 16, 2020 at 10:20 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

It seems to me that it's exactly the same check that I was trying to verify in amcheck patch [0].
But there it was verified inside amcheck, but here it is verified by index scan.

Maybe we can accept your patch after fixing this bug. My objection to
the patch was that it couples locks in a way that's not compatible
with btree_xlog_unlink_page(). But the problem now seems to have been
btree_xlog_unlink_page() itself. It's possible that there are problems
elsewhere, but my recollection is that btree_xlog_unlink_page() was
the problem.

--
Peter Geoghegan

#5Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Peter Geoghegan (#4)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

18 марта 2020 г., в 00:37, Peter Geoghegan <pg@bowt.ie> написал(а):

On Mon, Mar 16, 2020 at 10:20 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:

It seems to me that it's exactly the same check that I was trying to verify in amcheck patch [0].
But there it was verified inside amcheck, but here it is verified by index scan.

Maybe we can accept your patch after fixing this bug. My objection to
the patch was that it couples locks in a way that's not compatible
with btree_xlog_unlink_page(). But the problem now seems to have been
btree_xlog_unlink_page() itself. It's possible that there are problems
elsewhere, but my recollection is that btree_xlog_unlink_page() was
the problem.

The problem was that btree_xlog_split() and btree_xlog_unlink_page() do not couple locks during fixing left links.
Probably, patch in this thread should fix this in btree_xlog_split() too?

Best regards, Andrey Borodin.

#6Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Andrey M. Borodin (#5)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Hello.

Probably, patch in this thread should fix this in btree_xlog_split() too?

I have spent some time trying to find any possible race condition
between btree_xlog_split and _bt_walk_left… But I can’t find any.
Also, I have tried to cause any issue by putting pg_sleep put into
btree_xlog_split (between releasing and taking of locks) but without
any luck.

I agree it is better to keep the same locking logic for primary and
standby in general. But it is a possible scope of another patch.

Thanks,
Michail.

In reply to: Michail Nikolaev (#6)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Fri, Mar 27, 2020 at 8:58 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

I have spent some time trying to find any possible race condition
between btree_xlog_split and _bt_walk_left… But I can’t find any.
Also, I have tried to cause any issue by putting pg_sleep put into
btree_xlog_split (between releasing and taking of locks) but without
any luck.

I pushed a commit that tries to clear up some of the details around
how locking works during page splits. See commit 9945ad6e.

I agree it is better to keep the same locking logic for primary and
standby in general. But it is a possible scope of another patch.

It seems useful, but only up to a point. We don't need to hold locks
across related atomic operations (i.e. across each phase of a page
split or page deletion). In particular, the lock coupling across page
levels that we perform on the primary when ascending the tree
following a page split doesn't need to occur on standbys. I added
something about this to the nbtree README in commit 9f83468b353.

I'm not surprised that you didn't find any problems in
btree_xlog_split(). It is already conservative about locking the
sibling/child pages. It could hardly be more conservative (though see
the code and comments at the end of btree_xlog_split(), which mention
locking and backwards scans directly).

--
Peter Geoghegan

In reply to: Michail Nikolaev (#1)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Mon, Mar 16, 2020 at 7:08 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

I was sure I have broken something in btree and spent a lot of time
trying to figure what.
And later... I realized what it is bug in btree since a very old times...
Because of much faster scans with LP_DEAD support on a standby it
happens much more frequently in my case.

On second thought, I wonder how commit 558a9165 could possibly be
relevant here. nbtree VACUUM doesn't care about the LP_DEAD bit at
all. Sure, btree_xlog_delete_get_latestRemovedXid() is not going to
have to run on the standby on Postgres 12, but that only ever happened
at the point where we might have to split the page on the primary
(i.e. when _bt_delitems_delete() is called on the primary) anyway.
_bt_delitems_delete()/btree_xlog_delete_get_latestRemovedXid() are not
related to page deletion by VACUUM.

It's true that VACUUM will routinely kill tuples that happen to have
their LP_DEAD bit set, but it isn't actually influenced by the fact
that somebody set (or didn't set) any tuple's LP_DEAD bit. VACUUM has
its own strategy for generating recovery conflicts (it relies on
conflicts generated during the pruning phase of heap VACUUMing).
VACUUM is not willing to generate ad-hoc conflicts (in the style of
_bt_delitems_delete()) just to kill a few more tuples in relatively
uncommon cases -- cases where some LP_DEAD bits were set after a
VACUUM process started, but before the VACUUM process reached an
affected (LP_DEAD bits set) leaf page.

Again, I suspect that the problem is more likely to occur on Postgres
12 in practice because page deletion is more likely to occur on that
version. IOW, due to my B-Tree work for Postgres 12: commit dd299df8,
and related commits. That's probably all that there is to it.

--
Peter Geoghegan

#9Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Peter Geoghegan (#8)
1 attachment(s)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Hello, Peter.

I added
something about this to the nbtree README in commit 9f83468b353.

I have added some updates to your notes in the updated patch version.

I also was trying to keep the original wrapping of the paragraph, so
the patch looks too wordy.

Thanks,
Michail.

Attachments:

btree-race-and-docs.patchapplication/octet-stream; name=btree-race-and-docs.patchDownload
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 2d0f8f4b79..0aefb1a5fb 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -550,21 +550,21 @@ right to recover from a "concurrent" page split or page deletion, just
 like on the primary.)
 
 However, there are a couple of differences in how pages are locked by
-replay/the startup process as compared to the original write operation
-on the primary. The exceptions involve page splits and page deletions.
-The first phase and second phase of a page split are processed
-independently during replay, since they are independent atomic actions.
-We do not attempt to recreate the coupling of parent and child page
-write locks that took place on the primary. This is safe because readers
-never care about the incomplete split flag anyway. Holding on to an
-extra write lock on the primary is only necessary so that a second
-writer cannot observe the incomplete split flag before the first writer
-finishes the split. If we let concurrent writers on the primary observe
-an incomplete split flag on the same page, each writer would attempt to
-complete the unfinished split, corrupting the parent page.  (Similarly,
-replay of page deletion records does not hold a write lock on the target
-leaf page throughout; only the primary needs to block out concurrent
-writers that insert on to the page being deleted.)
+replay/the startup process as compared to the original write operation on
+the primary. This is mainly about the page splits. The first phase and
+second phase of a page split are processed independently during replay,
+since they are independent atomic actions. We do not attempt to recreate
+the coupling of parent and child page write locks that took place on the
+primary. This is safe because readers never care about the incomplete
+split flag anyway. Holding on to an extra write lock on the primary is
+only necessary so that a second writer cannot observe the incomplete split
+flag before the first writer finishes the split. If we let concurrent
+writers on the primary observe an incomplete split flag on the same page,
+each writer would attempt to complete the unfinished split, corrupting the
+parent page. (However, although only the primary needs to block out
+concurrent writers that insert on to the page being deleted – page
+deletion lock coupling on standby is the same as on primary. It is
+required to avoid race condition during the scan in a backward direction.)
 
 During recovery all index scans start with ignore_killed_tuples = false
 and we never set kill_prior_tuple. We do this because the oldest xmin
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 99d0914e72..679fc5866e 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -772,7 +772,7 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
-	Buffer		buffer;
+	Buffer		leafbuf, lbuff = InvalidBuffer, rbuff, buff;
 	Page		page;
 	BTPageOpaque pageop;
 
@@ -780,47 +780,29 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	rightsib = xlrec->rightsib;
 
 	/*
-	 * In normal operation, we would lock all the pages this WAL record
-	 * touches before changing any of them.  In WAL replay, it should be okay
-	 * to lock just one page at a time, since no concurrent index updates can
-	 * be happening, and readers should not care whether they arrive at the
-	 * target page or not (since it's surely empty).
+	 * We have to lock the pages we need to modify in the moving right order.
+	 * Else we will go into the race against _bt_walk_left.
 	 */
 
-	/* Fix left-link of right sibling */
-	if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-		pageop->btpo_prev = leftsib;
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
 	/* Fix right-link of left sibling, if any */
 	if (leftsib != P_NONE)
 	{
-		if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+		if (XLogReadBufferForRedo(record, 1, &lbuff) == BLK_NEEDS_REDO)
 		{
-			page = (Page) BufferGetPage(buffer);
+			page = (Page) BufferGetPage(lbuff);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 			pageop->btpo_next = rightsib;
 
 			PageSetLSN(page, lsn);
-			MarkBufferDirty(buffer);
+			MarkBufferDirty(lbuff);
 		}
-		if (BufferIsValid(buffer))
-			UnlockReleaseBuffer(buffer);
 	}
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogInitBufferForRedo(record, 0);
-	page = (Page) BufferGetPage(buffer);
+	buff = XLogInitBufferForRedo(record, 0);
+	page = (Page) BufferGetPage(buff);
 
-	_bt_pageinit(page, BufferGetPageSize(buffer));
+	_bt_pageinit(page, BufferGetPageSize(buff));
 	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	pageop->btpo_prev = leftsib;
@@ -830,8 +812,25 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	pageop->btpo_cycleid = 0;
 
 	PageSetLSN(page, lsn);
-	MarkBufferDirty(buffer);
-	UnlockReleaseBuffer(buffer);
+	MarkBufferDirty(buff);
+
+	/* Fix left-link of right sibling */
+	if (XLogReadBufferForRedo(record, 2, &rbuff) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(rbuff);
+		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		pageop->btpo_prev = leftsib;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(rbuff);
+	}
+
+	/* Release all buffers */
+	if (BufferIsValid(lbuff))
+		UnlockReleaseBuffer(lbuff);
+	UnlockReleaseBuffer(buff);
+	if (BufferIsValid(rbuff))
+		UnlockReleaseBuffer(rbuff);
 
 	/*
 	 * If we deleted a parent of the targeted leaf page, instead of the leaf
@@ -846,10 +845,10 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		 */
 		IndexTupleData trunctuple;
 
-		buffer = XLogInitBufferForRedo(record, 3);
-		page = (Page) BufferGetPage(buffer);
+		leafbuf = XLogInitBufferForRedo(record, 3);
+		page = (Page) BufferGetPage(leafbuf);
 
-		_bt_pageinit(page, BufferGetPageSize(buffer));
+		_bt_pageinit(page, BufferGetPageSize(leafbuf));
 		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -868,8 +867,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 			elog(ERROR, "could not add dummy high key to half-dead page");
 
 		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-		UnlockReleaseBuffer(buffer);
+		MarkBufferDirty(leafbuf);
+		UnlockReleaseBuffer(leafbuf);
 	}
 
 	/* Update metapage if needed */
In reply to: Michail Nikolaev (#9)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Hi Michail,

On Sun, Apr 5, 2020 at 10:04 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

I added
something about this to the nbtree README in commit 9f83468b353.

I have added some updates to your notes in the updated patch version.

My apologies for the extended delay here.

My intention is to commit this patch to the master branch only. While
it seems low risk, I don't see any reason to accept even a small risk
given the lack of complaints from users. We know that this bug existed
many years before you discovered it.

Thanks
--
Peter Geoghegan

#11Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Peter Geoghegan (#10)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Hello, Peter.

Thanks for the update.

Yes, it is the right decision.
I have started to spot that bug only while working on a faster scan
using hint bits on replicas [1]/messages/by-id/CANtu0ojmkN_6P7CQWsZ=uEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw@mail.gmail.com, so it is unlikely to hit it in
production at the moment.

Thanks,
Michail.

[1]: /messages/by-id/CANtu0ojmkN_6P7CQWsZ=uEgeFnSmpCiqCxyYaHnhYpTZHj7Ubw@mail.gmail.com

In reply to: Michail Nikolaev (#1)
1 attachment(s)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Mon, Mar 16, 2020 at 7:08 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

While working on support for index hint bits on standby [1] I have
started to getting
"ERROR: could not find left sibling of block XXXX in index XXXX"
during stress tests.

I reproduced the bug using your steps (including the pg_usleep() hack)
today. It was fairly easy to confirm the problem.

Attached is a revised version of your patch. It renames the buffer
variable names, and changes the precise order in which the locks are
released (for consistency with _bt_unlink_halfdead_page()). It also
changes the comments, and adds a new paragraph to the README. The
existing paragraph was about cross-level differences, this new one is
about same-level differences (plus a second new paragraph to talk
about backwards scans + page deletion).

This revised version is essentially the same as your original patch --
I have only made superficial adjuments. I think that I will be able to
commit this next week, barring objections.

--
Peter Geoghegan

Attachments:

v3-0001-Avoid-backwards-scan-page-deletion-standby-race.patchapplication/octet-stream; name=v3-0001-Avoid-backwards-scan-page-deletion-standby-race.patchDownload
From c225826208c31a980e41e1fad0298bb552f6114c Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Fri, 31 Jul 2020 16:57:35 -0700
Subject: [PATCH v3] Avoid backwards scan page deletion standby race.

Author: Michail Nikolaev
Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
---
 src/backend/access/nbtree/README    | 18 ++++++
 src/backend/access/nbtree/nbtxlog.c | 85 ++++++++++++++++++-----------
 2 files changed, 70 insertions(+), 33 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 32ad9e339a..a08c3de3f3 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -572,6 +572,24 @@ replay of page deletion records does not hold a write lock on the target
 leaf page throughout; only the primary needs to block out concurrent
 writers that insert on to the page being deleted.)
 
+There are also locking differences between the primary and WAL replay
+for the first stage of a page split (i.e. same-level differences in
+locking).  Replay of the first phase of a page split can get away with
+locking and updating the original right sibling page (which is also the
+new right sibling page's right sibling) after locks on the original page
+and its new right sibling have been released.  Again, this is okay
+because there are no writers.  Page deletion cannot get away with being
+lax about same-level locking during replay, though -- doing so risks
+confusing concurrent backwards scans.
+
+Page deletion's second phase locks the left sibling page, target page,
+and right page in order on the standby, just like on the primary.  This
+allows backwards scans running on a standby to reason about page
+deletion on the leaf level; a page cannot appear deleted without that
+being reflected in the sibling pages.  It's probably possible to be more
+lax about how locks are acquired on the standby during the second phase
+of page deletion, but that hardly seems worth it.
+
 During recovery all index scans start with ignore_killed_tuples = false
 and we never set kill_prior_tuple. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 5d346da84f..8e46437dbe 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -774,7 +774,10 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
-	Buffer		buffer;
+	Buffer		leftbuf;
+	Buffer		target;
+	Buffer		rightbuf;
+	Buffer		leafbuf;
 	Page		page;
 	BTPageOpaque pageop;
 
@@ -783,46 +786,38 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
-	 * touches before changing any of them.  In WAL replay, it should be okay
-	 * to lock just one page at a time, since no concurrent index updates can
-	 * be happening, and readers should not care whether they arrive at the
-	 * target page or not (since it's surely empty).
+	 * touches before changing any of them.  In WAL replay, we at least lock
+	 * the pages in the same standard left-to-right order (leftsib, target,
+	 * rightsib).
+	 *
+	 * btree_xlog_split() can get away with fixing its right sibling page's
+	 * left link last of all, after dropping all other locks.  We prefer to
+	 * avoid dropping locks on same-level pages early compared to normal
+	 * operation.  This keeps things simple for backwards scans.  See
+	 * nbtree/README.
 	 */
 
-	/* Fix left-link of right sibling */
-	if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-		pageop->btpo_prev = leftsib;
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
 	/* Fix right-link of left sibling, if any */
 	if (leftsib != P_NONE)
 	{
-		if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+		if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
 		{
-			page = (Page) BufferGetPage(buffer);
+			page = (Page) BufferGetPage(leftbuf);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 			pageop->btpo_next = rightsib;
 
 			PageSetLSN(page, lsn);
-			MarkBufferDirty(buffer);
+			MarkBufferDirty(leftbuf);
 		}
-		if (BufferIsValid(buffer))
-			UnlockReleaseBuffer(buffer);
 	}
+	else
+		leftbuf = InvalidBuffer;
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogInitBufferForRedo(record, 0);
-	page = (Page) BufferGetPage(buffer);
+	target = XLogInitBufferForRedo(record, 0);
+	page = (Page) BufferGetPage(target);
 
-	_bt_pageinit(page, BufferGetPageSize(buffer));
+	_bt_pageinit(page, BufferGetPageSize(target));
 	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	pageop->btpo_prev = leftsib;
@@ -832,8 +827,27 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	pageop->btpo_cycleid = 0;
 
 	PageSetLSN(page, lsn);
-	MarkBufferDirty(buffer);
-	UnlockReleaseBuffer(buffer);
+	MarkBufferDirty(target);
+
+	/* Fix left-link of right sibling */
+	if (XLogReadBufferForRedo(record, 2, &rightbuf) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(rightbuf);
+		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		pageop->btpo_prev = leftsib;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(rightbuf);
+	}
+
+	/* Release siblings */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	if (BufferIsValid(rightbuf))
+		UnlockReleaseBuffer(rightbuf);
+
+	/* Release target */
+	UnlockReleaseBuffer(target);
 
 	/*
 	 * If we deleted a parent of the targeted leaf page, instead of the leaf
@@ -845,13 +859,18 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		/*
 		 * There is no real data on the page, so we just re-create it from
 		 * scratch using the information from the WAL record.
+		 *
+		 * Note that we don't end up here when the target page is also the
+		 * leafbuf page.  There is no need to add a dummy hikey item with a
+		 * top parent link when deleting leafbuf because it's the last page
+		 * we'll delete in the subtree undergoing deletion.
 		 */
 		IndexTupleData trunctuple;
 
-		buffer = XLogInitBufferForRedo(record, 3);
-		page = (Page) BufferGetPage(buffer);
+		leafbuf = XLogInitBufferForRedo(record, 3);
+		page = (Page) BufferGetPage(leafbuf);
 
-		_bt_pageinit(page, BufferGetPageSize(buffer));
+		_bt_pageinit(page, BufferGetPageSize(leafbuf));
 		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -870,8 +889,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 			elog(ERROR, "could not add dummy high key to half-dead page");
 
 		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-		UnlockReleaseBuffer(buffer);
+		MarkBufferDirty(leafbuf);
+		UnlockReleaseBuffer(leafbuf);
 	}
 
 	/* Update metapage if needed */
-- 
2.25.1

#13Daniel Gustafsson
daniel@yesql.se
In reply to: Peter Geoghegan (#12)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On 1 Aug 2020, at 20:30, Peter Geoghegan <pg@bowt.ie> wrote:

This revised version is essentially the same as your original patch --
I have only made superficial adjuments. I think that I will be able to
commit this next week, barring objections.

As we're out of time for the July CF where this is registered, I've moved this
to 2020-09. Based on the above comment, I've marked it Ready for Committer.

cheers ./daniel

#14Michail Nikolaev
michail.nikolaev@gmail.com
In reply to: Peter Geoghegan (#12)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

Hello, Peter.

Attached is a revised version of your patch

Thanks for your work, the patch is looking better now.

Michail.

In reply to: Michail Nikolaev (#14)
Re: [PATCH] Btree BackwardScan race condition on Standby during VACUUM

On Sun, Aug 2, 2020 at 9:07 AM Michail Nikolaev
<michail.nikolaev@gmail.com> wrote:

Thanks for your work, the patch is looking better now.

Pushed -- thanks!

--
Peter Geoghegan