heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

Started by Noah Mischabout 1 year ago4 messages

noah@leadboat.com

about 1 year ago

intra-grant-inplace-db.spec got a novel failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58

The isolationtester_waiting query is supposed to detect that step vac2 is
blocked. vac2 didn't finish within the timeout, but isolationtester_waiting
never considered it blocked. Most-relevant postmaster.log lines:

===
2024-10-26 14:22:51.791 UTC [2444667:6] isolation/intra-grant-inplace-db/s1 LOG: statement: BEGIN;
2024-10-26 14:22:51.791 UTC [2444667:7] isolation/intra-grant-inplace-db/s1 LOG: statement:
GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;

2024-10-26 14:22:51.791 UTC [2444668:6] isolation/intra-grant-inplace-db/s2 LOG: statement: VACUUM (FREEZE);
2024-10-26 14:22:51.813 UTC [2444663:1] LOG: skipping analyze of "pg_attribute" --- lock not available
2024-10-26 14:22:51.815 UTC [2444666:7] isolation/intra-grant-inplace-db/control connection LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{2444667,2444668,2444669}')
2024-10-26 14:22:51.815 UTC [2444666:8] isolation/intra-grant-inplace-db/control connection DETAIL: Parameters: $1 = '2444668'
[... omitting most other lines from PID 2444666: >37000 isolationtester_waiting calls ...]
2024-10-26 14:22:51.851 UTC [2444663:2] LOG: skipping analyze of "pg_class" --- lock not available
2024-10-26 14:22:51.892 UTC [2444663:3] LOG: skipping analyze of "pg_depend" --- lock not available
2024-10-26 14:23:52.833 UTC [2444663:4] ERROR: canceling autovacuum task
2024-10-26 14:23:52.833 UTC [2444663:5] CONTEXT: while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:24:52.768 UTC [2447791:1] ERROR: canceling autovacuum task
2024-10-26 14:24:52.768 UTC [2447791:2] CONTEXT: while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:25:52.777 UTC [2451018:1] ERROR: canceling autovacuum task
2024-10-26 14:25:52.777 UTC [2451018:2] CONTEXT: while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:26:52.789 UTC [2454071:1] ERROR: canceling autovacuum task
2024-10-26 14:26:52.789 UTC [2454071:2] CONTEXT: while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:27:16.322 UTC [2442306:1] LOG: checkpoint starting: time
2024-10-26 14:27:52.798 UTC [2457311:1] ERROR: canceling autovacuum task
2024-10-26 14:27:52.798 UTC [2457311:2] CONTEXT: while updating tuple (0,20) in relation "pg_database"
2024-10-26 14:28:51.755 UTC [2444666:37889] isolation/intra-grant-inplace-db/control connection LOG: execute isolationtester_waiting: SELECT pg_catalog.pg_isolation_test_session_is_blocked($1, '{2444667,2444668,2444669}')
2024-10-26 14:28:51.755 UTC [2444666:37890] isolation/intra-grant-inplace-db/control connection DETAIL: Parameters: $1 = '2444668'
2024-10-26 14:28:51.810 UTC [2463672:1] [unknown] LOG: connection received: host=[local]
2024-10-26 14:28:51.812 UTC [2444668:7] isolation/intra-grant-inplace-db/s2 ERROR: canceling statement due to user request
2024-10-26 14:28:51.812 UTC [2444668:8] isolation/intra-grant-inplace-db/s2 CONTEXT: while scanning block 0 of relation "pg_catalog.pg_database"
2024-10-26 14:28:51.812 UTC [2444668:9] isolation/intra-grant-inplace-db/s2 STATEMENT: VACUUM (FREEZE);
===

I would have expected these locking-relevant events:

- GRANT stamped a pg_database tuple with an xmax.
- auto-analyze sought to update the same tuple, so it acquired LOCKTAG_TUPLE
and entered XactLockTableWait(xmax-from-GRANT).
- Non-auto VACUUM from step vac2 blocked on acquiring LOCKTAG_TUPLE.
pg_isolation_test_session_is_blocked() considers this not blocked, since
blocking on autovacuum doesn't count for test output purposes.
- Deadlock detector notices auto-analyze is blocking vac2, hence "canceling
autovacuum task" after 1s.
- vac2 finishes its LOCKTAG_TUPLE acquisition and enters
XactLockTableWait(xmax-from-GRANT).
- pg_isolation_test_session_is_blocked() now considers vac2 blocked.

The timeout suggests some subset of that didn't happen. It's odd that a new
auto-analyze starts every minute, each of which exits due to blocking vac2. I
likely got something wrong in commit aac2c9b or its parent (2024-09-24). I'll
work on reproducing this.

Thanks,
nm

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Noah Misch (#1)

2 attachment(s)

Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:

intra-grant-inplace-db.spec got a novel failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58

The isolationtester_waiting query is supposed to detect that step vac2 is
blocked. vac2 didn't finish within the timeout, but isolationtester_waiting
never considered it blocked.

... work on reproducing this.

I'm running loops of three workloads that might reproduce this, on s390x:

- A buildfarm config like buildfarm member sarus.

- A workload where autovacuum vac_update_datfrozenxid() takes >10s, but
non-auto "VACUUM (ONLY_DATABASE_STATS)" runs in a loop under
lock_timeout=5s. This notices if the non-auto VACUUM ever fails to cancel
autovacuum. I apply the attached inplace200-bench-vac-v0.1.patch, then run
contrib/amcheck/t/089_vac.pl.

- A workload to make vac2's deadlock detector run when an autovacuum worker is
starting. That's one of my candidate explanations of the events behind the
failure log. I apply the attached inplace190-repro-autovacuum-v0.1.patch,
then run intra-grant-inplace-db.spec with debug_parallel_query=regress and
autovacuum_naptime=1s. This regularly gets vac2 to cancel autovacuum.

So far, those have seen nothing like the failed run.

I'm inclined to make isolationtester log pg_stat_activity and pg_locks
whenever a step times out. That would have ruled out many explanations for
what happened on sarus. However, logging on timeout won't help much until
something can reproduce the timeout.

It's odd that a new
auto-analyze starts every minute, each of which exits due to blocking vac2.

Every autovacuum_naptime, a new worker cancels the old one due to wanting the
old worker's LOCKTAG_DATABASE_FROZEN_IDS. One can see this with "BEGIN; GRANT
TEMP ON DATABASE postgres TO pg_monitor;". That open transaction's xmax
prevents vac_update_datfrozenxid() from finishing. While that transaction
remains open, every autovacuum_naptime after the first will log a cancellation
of the previous autovacuum worker. (This assumes each autovacuum worker,
having negligible real work to do, reaches vac_update_datfrozenxid() before
the next autovacuum_naptime.) That's likely not ideal, but it doesn't explain
the sarus failure.

Attachments:

inplace200-bench-vac-v0.1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    

diff --git a/contrib/amcheck/t/089_vac.pl b/contrib/amcheck/t/089_vac.pl
new file mode 100644
index 0000000..a28ebd9
--- /dev/null
+++ b/contrib/amcheck/t/089_vac.pl
@@ -0,0 +1,50 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test VACUUM vs. slow autovac
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+my ($node, $result);
+
+#
+# Test set-up
+#
+$node = PostgreSQL::Test::Cluster->new('main');
+$node->init;
+$node->append_conf('postgresql.conf', 'autovacuum_naptime = 1s');
+$node->append_conf('postgresql.conf', 'log_lock_waits = on');
+$node->append_conf('postgresql.conf', 'deadlock_timeout = 10ms');
+$node->append_conf('postgresql.conf', q{backtrace_functions = 'ProcessInterrupts'});
+$node->start;
+
+my $grant_h = $node->background_psql('postgres');
+
+# This running would block all vac_update_datfrozenxid => lock_timeout always
+# $grant_h->query_safe(
+# 	q(
+# BEGIN;
+# GRANT TEMP ON DATABASE postgres TO pg_monitor;
+# ));
+
+{
+	local $ENV{PGOPTIONS} = '-clock_timeout=5s';
+$node->pgbench(
+	'--no-vacuum --client=1 --time=600',
+	0,
+	[qr{actually processed}],
+	[qr{^$}],
+	'vac',
+	{
+		'vac' => 'VACUUM (ONLY_DATABASE_STATS);',
+	});
+}
+
+$grant_h->quit;
+$node->stop;
+done_testing();
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index ac8f5d9..20fe426 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1582,6 +1582,7 @@ vac_update_relstats(Relation relation,
  *		commits.  As with vac_update_relstats, this avoids leaving dead tuples
  *		behind after a VACUUM.
  */
+#include "utils/fmgrprotos.h"
 void
 vac_update_datfrozenxid(void)
 {
@@ -1607,6 +1608,9 @@ vac_update_datfrozenxid(void)
 	 */
 	LockDatabaseFrozenIds(ExclusiveLock);
 
+	if (AmAutoVacuumWorkerProcess())
+		DirectFunctionCall1(pg_sleep, Float8GetDatum(10.0));
+
 	/*
 	 * Initialize the "min" calculation with
 	 * GetOldestNonRemovableTransactionId(), which is a reasonable

inplace190-repro-autovacuum-v0.1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    

diff --git a/src/test/isolation/Makefile b/src/test/isolation/Makefile
index ade2256..ca487de 100644
--- a/src/test/isolation/Makefile
+++ b/src/test/isolation/Makefile
@@ -64,6 +64,9 @@ installcheck: all
 check: all
 	$(pg_isolation_regress_check) --schedule=$(srcdir)/isolation_schedule
 
+nmcheck: all temp-install
+	$(pg_isolation_regress_check) intra-grant-inplace-db
+
 # Non-default tests.  It only makes sense to run these if set up to use
 # prepared transactions, via TEMP_CONFIG for the check case, or via the
 # postgresql.conf for the installcheck case.
diff --git a/src/test/isolation/expected/intra-grant-inplace-db.out b/src/test/isolation/expected/intra-grant-inplace-db.out
index a91402c..118cbaa 100644
--- a/src/test/isolation/expected/intra-grant-inplace-db.out
+++ b/src/test/isolation/expected/intra-grant-inplace-db.out
@@ -8,6 +8,12 @@ step snap3:
 step b1: BEGIN;
 step grant1: 
 	GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;
+	SELECT pg_sleep(random(0.000, 1.200));
+
+pg_sleep
+--------
+        
+(1 row)
 
 step vac2: VACUUM (FREEZE); <waiting ...>
 step snap3: 
diff --git a/src/test/isolation/isolationtester.c b/src/test/isolation/isolationtester.c
index 0b342b5..8a5e585 100644
--- a/src/test/isolation/isolationtester.c
+++ b/src/test/isolation/isolationtester.c
@@ -863,7 +863,7 @@ try_complete_step(TestSpec *testspec, PermutationStep *pstep, int flags)
 	while (PQisBusy(conn))
 	{
 		FD_SET(sock, &read_set);
-		timeout.tv_sec = 0;
+		timeout.tv_sec = 8;
 		timeout.tv_usec = 10000;	/* Check for lock waits every 10ms. */
 
 		ret = select(sock + 1, &read_set, NULL, NULL, &timeout);
diff --git a/src/test/isolation/specs/intra-grant-inplace-db.spec b/src/test/isolation/specs/intra-grant-inplace-db.spec
index 9de40ec..408e23f 100644
--- a/src/test/isolation/specs/intra-grant-inplace-db.spec
+++ b/src/test/isolation/specs/intra-grant-inplace-db.spec
@@ -18,6 +18,7 @@ session s1
 step b1	{ BEGIN; }
 step grant1	{
 	GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;
+	SELECT pg_sleep(random(0.000, 1.200));
 }
 step c1	{ COMMIT; }

Alexander Lakhin

exclusion@gmail.com

about 1 year ago

In reply to: Noah Misch (#2)

Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

Hello Noah,

27.10.2024 07:09, Noah Misch wrote:

On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:

intra-grant-inplace-db.spec got a novel failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58

The isolationtester_waiting query is supposed to detect that step vac2 is
blocked. vac2 didn't finish within the timeout, but isolationtester_waiting
never considered it blocked.
... work on reproducing this.

FWIW, there was a similar failure in August: [1]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57, and I also could not
reproduce that locally, yet wrote a preliminary analysis at [2]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures in the
Unsorted section, in the hope to see it again and continue investigation.

[1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
[2]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

Best regards,
Alexander

Noah Misch

noah@leadboat.com

about 1 year ago

In reply to: Alexander Lakhin (#3)

2 attachment(s)

Re: heap_inplace_lock vs. autovacuum w/ LOCKTAG_TUPLE

On Sun, Oct 27, 2024 at 08:00:00AM +0300, Alexander Lakhin wrote:

27.10.2024 07:09, Noah Misch wrote:

On Sat, Oct 26, 2024 at 11:49:36AM -0700, Noah Misch wrote:

intra-grant-inplace-db.spec got a novel failure today:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sarus&dt=2024-10-26%2014%3A08%3A58

The isolationtester_waiting query is supposed to detect that step vac2 is
blocked. vac2 didn't finish within the timeout, but isolationtester_waiting
never considered it blocked.
... work on reproducing this.

The attached demo reproduces $SUBJECT for me. The clue was 'CONTEXT: while
scanning block 0 of relation "pg_catalog.pg_database"'. That
vacuum_error_callback() message indicated VACUUM_ERRCB_PHASE_SCAN_HEAP as the
current phase. That implies vac2 had not reached any inplace update, any
LOCKTAG_TUPLE, or any part of vac_update_datfrozenxid(). I bet vac2 was stuck
in LockBufferForCleanup(). Concurrently, a sequence of autovacuum workers
held a buffer pin blocking that cleanup. The demo makes two changes:

a. Start intra-grant-inplace-db.spec step vac2 just after some autovacuum
worker is blocked on the xid of the uncommitted GRANT. While blocked, the
worker holds a pin on the one pg_database page. vac2 needs
LockBufferForCleanup($THAT_PAGE).
b. Make LockBufferForCleanup() sleep 10ms after WAIT_EVENT_BUFFER_PIN, to give
the next autovacuum worker time to win the pinning race.

LockBufferForCleanup() waits for a pin count to fall to 1. While waiting, its
techniques are less sophisticated than what we have for lock.c heavyweight
locks. UnpinBufferNoOwner() notifies the cleanup waiter, but nothing stops
another backend from pinning before the cleanup waiter reacts. We have code
to cancel an autovacuum for the purpose of liberating a heavyweight lock, but
we don't have code like that to free a pin.
pg_isolation_test_session_is_blocked() doesn't detect buffer pin blockages.

Two fix candidates look most promising:

1. Unpin before heap_inplace_lock() sleeps.
2. Change intra-grant-inplace-db.spec to test freezing only MXIDs, not XIDs.

Let's do (1), as attached. Apart from some arguably-convoluted call stacks,
this has no disadvantages. An earlier version did unpin.
postgr.es/m/20240822073200.4f.nmisch@google.com stopped that, arguing,
"heap_update() doesn't optimize that way". In light of $SUBJECT, I no longer
see heap_update() as the applicable standard. Since autovacuum can run
anytime, it has an extra duty to be non-disruptive. heap_update() during
auto-analyze won't wait much, because analyze takes a self-exclusive table
lock and is the sole writer of stats tables. Inplace updates during
autovacuum are different. They do contend with other transactions. Since we
lack code to cancel an autovacuum to free a pin, a long-lived pin in
autovacuum is more disruptive than a long-lived lock in autovacuum.

While I've not tried it, I expect (2) would work as follows. pg_database
won't contain MXIDs, so lazy_scan_noprune() would approve continuing without
the cleanup lock. Unlike (1), this wouldn't help concurrency outside the
test.

FWIW, there was a similar failure in August: [1], and I also could not
reproduce that locally, yet wrote a preliminary analysis at [2] in the
Unsorted section, in the hope to see it again and continue investigation.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=iguana&dt=2024-08-29%2013%3A57%3A57
[2] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures

Thanks. I had not known about iguana's failure. What are the chances that
the buffer pin could explain that one?

Attachments:

demo-inplace210-pin-starvation-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    [not for commit] demo pin starvation from autovacuum inplace update
    
    Test with intra-grant-inplace-db.spec.  Changes here:
    
    a. Start intra-grant-inplace-db.spec step vac2 just after some
       autovacuum worker is blocked on the xid of the uncommitted GRANT.
       While blocked, the worker holds a pin on the one pg_database page.
       vac2 needs LockBufferForCleanup(that-page).
    b. Make LockBufferForCleanup() sleep 10ms after WAIT_EVENT_BUFFER_PIN,
       to give the next autovacuum worker time to win the pinning race.

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0f02bf6..5268294 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -5344,8 +5344,17 @@ LockBufferForCleanup(Buffer buffer)
 			SetStartupBufferPinWaitBufId(-1);
 		}
 		else
+		{
 			ProcWaitForSignal(WAIT_EVENT_BUFFER_PIN);
 
+			/*
+			 * Sleep to induce starvation.  With heavyweight locks, the
+			 * releaser grants to the next queue member, preventing
+			 * starvation.  Buffer pins have no such mechanism.
+			 */
+			pg_usleep(10000);
+		}
+
 		/*
 		 * Remove flag marking us as waiter. Normally this will not be set
 		 * anymore, but ProcWaitForSignal() can return for other signals as
diff --git a/src/test/isolation/Makefile b/src/test/isolation/Makefile
index ade2256..ca487de 100644
--- a/src/test/isolation/Makefile
+++ b/src/test/isolation/Makefile
@@ -64,6 +64,9 @@ installcheck: all
 check: all
 	$(pg_isolation_regress_check) --schedule=$(srcdir)/isolation_schedule
 
+nmcheck: all temp-install
+	$(pg_isolation_regress_check) intra-grant-inplace-db
+
 # Non-default tests.  It only makes sense to run these if set up to use
 # prepared transactions, via TEMP_CONFIG for the check case, or via the
 # postgresql.conf for the installcheck case.
diff --git a/src/test/isolation/expected/intra-grant-inplace-db.out b/src/test/isolation/expected/intra-grant-inplace-db.out
index a91402c..c019952 100644
--- a/src/test/isolation/expected/intra-grant-inplace-db.out
+++ b/src/test/isolation/expected/intra-grant-inplace-db.out
@@ -1,6 +1,6 @@
 Parsed test spec with 3 sessions
 
-starting permutation: snap3 b1 grant1 vac2 snap3 c1 cmp3
+starting permutation: snap3 b1 grant1 wait vac2 snap3 c1 cmp3
 step snap3: 
 	INSERT INTO frozen_witness
 	SELECT datfrozenxid FROM pg_database WHERE datname = current_catalog;
@@ -9,6 +9,14 @@ step b1: BEGIN;
 step grant1: 
 	GRANT TEMP ON DATABASE isolation_regression TO regress_temp_grantee;
 
+step wait: 
+	DO $$
+	BEGIN
+		WHILE NOT EXISTS(SELECT 1 FROM pg_locks WHERE NOT granted AND transactionid IS NOT NULL) LOOP
+			NULL;
+		END LOOP;
+	END$$;
+
 step vac2: VACUUM (FREEZE); <waiting ...>
 step snap3: 
 	INSERT INTO frozen_witness
diff --git a/src/test/isolation/specs/intra-grant-inplace-db.spec b/src/test/isolation/specs/intra-grant-inplace-db.spec
index 9de40ec..7a9cae4 100644
--- a/src/test/isolation/specs/intra-grant-inplace-db.spec
+++ b/src/test/isolation/specs/intra-grant-inplace-db.spec
@@ -40,6 +40,19 @@ step cmp3	{
 	WHERE datname = current_catalog
 		AND age(datfrozenxid) > (SELECT min(age(x)) FROM frozen_witness);
 }
+# Busy-wait until some autovacuum workers's vac_update_datfrozenxid() blocks
+# in heap_inplace_wait(), waiting for the xid of s1 to end.  A vac2 at that
+# point will block trying to acquire a cleanup lock on the page autovacuum has
+# pinned.  That's because systable_inplace_update_begin() acquires a pin that
+# we hold until systable_inplace_update_finish().
+step wait	{
+	DO $$
+	BEGIN
+		WHILE NOT EXISTS(SELECT 1 FROM pg_locks WHERE NOT granted AND transactionid IS NOT NULL) LOOP
+			NULL;
+		END LOOP;
+	END$$;
+}
 
 
-permutation snap3 b1 grant1 vac2(c1) snap3 c1 cmp3
+permutation snap3 b1 grant1 wait vac2(c1) snap3 c1 cmp3

inplace220-unpin-v1.patchtext/plain; charset=us-asciiDownload

Author:     Noah Misch <noah@leadboat.com>
Commit:     Noah Misch <noah@leadboat.com>

    Unpin buffer before inplace update waits for an XID to end.
    
    Commit a07e03fd8fa7daf4d1356f7cb501ffe784ea6257 changed inplace updates
    to wait for heap_update() commands like GRANT TABLE and GRANT DATABASE.
    By keeping the pin during that wait, a sequence of autovacuum workers
    and an uncommitted GRANT starved one foreground LockBufferForCleanup()
    for six minutes, on buildfarm member sarus.  Prevent, at the cost of a
    bit of complexity.  Back-patch to v12, like the earlier commit.  That
    commit and heap_inplace_lock() have not yet appeared in any release.
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20241026184936.ae.nmisch@google.com

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 4c8febd..f4f1cb4 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6165,8 +6165,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
  * transaction.  If compatible, return true with the buffer exclusive-locked,
  * and the caller must release that by calling
  * heap_inplace_update_and_unlock(), calling heap_inplace_unlock(), or raising
- * an error.  Otherwise, return false after blocking transactions, if any,
- * have ended.
+ * an error.  Otherwise, call release_callback(arg), wait for blocking
+ * transactions to end, and return false.
  *
  * Since this is intended for system catalogs and SERIALIZABLE doesn't cover
  * DDL, this doesn't guarantee any particular predicate locking.
@@ -6200,7 +6200,8 @@ heap_abort_speculative(Relation relation, ItemPointer tid)
  */
 bool
 heap_inplace_lock(Relation relation,
-				  HeapTuple oldtup_ptr, Buffer buffer)
+				  HeapTuple oldtup_ptr, Buffer buffer,
+				  void (*release_callback) (void *), void *arg)
 {
 	HeapTupleData oldtup = *oldtup_ptr; /* minimize diff vs. heap_update() */
 	TM_Result	result;
@@ -6265,6 +6266,7 @@ heap_inplace_lock(Relation relation,
 										lockmode, NULL))
 			{
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+				release_callback(arg);
 				ret = false;
 				MultiXactIdWait((MultiXactId) xwait, mxact_status, infomask,
 								relation, &oldtup.t_self, XLTW_Update,
@@ -6280,6 +6282,7 @@ heap_inplace_lock(Relation relation,
 		else
 		{
 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			release_callback(arg);
 			ret = false;
 			XactLockTableWait(xwait, relation, &oldtup.t_self,
 							  XLTW_Update);
@@ -6291,6 +6294,7 @@ heap_inplace_lock(Relation relation,
 		if (!ret)
 		{
 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			release_callback(arg);
 		}
 	}
 
diff --git a/src/backend/access/index/genam.c b/src/backend/access/index/genam.c
index 69c3608..60c6103 100644
--- a/src/backend/access/index/genam.c
+++ b/src/backend/access/index/genam.c
@@ -814,6 +814,7 @@ systable_inplace_update_begin(Relation relation,
 	int			retries = 0;
 	SysScanDesc scan;
 	HeapTuple	oldtup;
+	BufferHeapTupleTableSlot *bslot;
 
 	/*
 	 * For now, we don't allow parallel updates.  Unlike a regular update,
@@ -835,10 +836,9 @@ systable_inplace_update_begin(Relation relation,
 	Assert(IsInplaceUpdateRelation(relation) || !IsSystemRelation(relation));
 
 	/* Loop for an exclusive-locked buffer of a non-updated tuple. */
-	for (;;)
+	do
 	{
 		TupleTableSlot *slot;
-		BufferHeapTupleTableSlot *bslot;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -864,11 +864,9 @@ systable_inplace_update_begin(Relation relation,
 		slot = scan->slot;
 		Assert(TTS_IS_BUFFERTUPLE(slot));
 		bslot = (BufferHeapTupleTableSlot *) slot;
-		if (heap_inplace_lock(scan->heap_rel,
-							  bslot->base.tuple, bslot->buffer))
-			break;
-		systable_endscan(scan);
-	};
+	} while (!heap_inplace_lock(scan->heap_rel,
+								bslot->base.tuple, bslot->buffer,
+								(void (*) (void *)) systable_endscan, scan));
 
 	*oldtupcopy = heap_copytuple(oldtup);
 	*state = scan;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b951466..96cf82f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -338,7 +338,8 @@ extern TM_Result heap_lock_tuple(Relation relation, HeapTuple tuple,
 								 Buffer *buffer, struct TM_FailureData *tmfd);
 
 extern bool heap_inplace_lock(Relation relation,
-							  HeapTuple oldtup_ptr, Buffer buffer);
+							  HeapTuple oldtup_ptr, Buffer buffer,
+							  void (*release_callback) (void *), void *arg);
 extern void heap_inplace_update_and_unlock(Relation relation,
 										   HeapTuple oldtup, HeapTuple tuple,
 										   Buffer buffer);