refactoring relation extension and BufferAlloc(), faster COPY

Started by Andres Freundabout 3 years ago57 messages

andres@anarazel.de

about 3 years ago

12 attachment(s)

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.

The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.

My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.

To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0]/messages/by-id/3b108afd19fa52ed20c464a69f64d545e4a14772.camel@postgrespro.ru, but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.

Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
does, nearly doubles the amount of writes. First the kernel ends up writing
out all the zeroed out buffers after a while, then when we write out the
actual buffer contents.

The best fix for that seems to be to optionally use posix_fallocate() to
reserve space, without dirtying pages in the kernel page cache. However, it
looks like that's only beneficial when extending by multiple pages at once,
because it ends up causing one filesystem-journal entry for each extension
on at least some filesystems.

I added 'smgrzeroextend()' that can extend by multiple blocks, without the
caller providing a buffer to write out. When extending by 8 or more blocks,
posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
used to extend the file.

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.

c) Contention on the FSM and the pages returned by it is a serious bottleneck
after a) and b).

The biggest issue is that the current bulk insertion logic in hio.c enters
all but one of the new pages into the freespacemap. That will immediately
cause all the other backends to contend on the first few pages returned the
FSM, and cause contention on the FSM pages itself.

I've, partially, addressed that by using the information about the required
number of pages from b). Whether we bulk insert or not, the number of pages
we know we're going to need for one heap_multi_insert() don't need to be
added to the FSM - we're going to use them anyway.

I've stashed the number of free blocks in the BulkInsertState for now, but
I'm not convinced that that's the right place.

If I revert just this part, the "concurrent COPY into unlogged table"
benchmark goes from ~240 tps to ~190 tps.

Even after that change the FSM is a major bottleneck. Below I included
benchmarks showing this by just removing the use of the FSM, but I haven't
done anything further about it. The contention seems to be both from
updating the FSM, as well as thundering-herd like symptoms from accessing
the FSM.

The update part could likely be addressed to some degree with a batch
update operation updating the state for multiple pages.

The access part could perhaps be addressed by adding an operation that gets
a page and immediately marks it as fully used, so other backends won't also
try to access it.

d) doing
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);

under the extension lock is surprisingly expensive on my two socket
workstation (but much less noticable on my laptop).

If I move the MemSet back under the extension lock, the "concurrent COPY
into unlogged table" benchmark goes from ~240 tps to ~200 tps.

e) When running a few benchmarks for this email, I noticed that there was a
sharp performance dropoff for the patched code for a pgbench -S -s100 on a
database with 1GB s_b, start between 512 and 1024 clients. This started with
the patch only acquiring one buffer partition lock at a time. Lots of
debugging ensued, resulting in [3]/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de.

The problem isn't actually related to the change, it just makes it more
visible, because the "lock chains" between two partitions reduce the
average length of the wait queues substantially, by distribution them
between more partitions. [3]/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de has a reproducer that's entirely independent
of this patchset.

Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.

The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock). That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.

I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.

server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1]COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);, ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2]COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);,
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 58.63 207 50.22 242 54.35 224
2 32.67 372 25.82 472 27.30 446
4 22.53 540 13.30 916 14.33 851
8 15.14 804 7.43 1640 7.48 1632
16 14.69 829 4.79 2544 4.50 2718
32 15.28 797 4.41 2763 3.32 3710
64 15.34 794 5.22 2334 3.06 4061
128 15.49 786 4.97 2452 3.13 3926
256 15.85 768 5.02 2427 3.26 3769
512 16.02 760 5.29 2303 3.54 3471

test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 68.18 178 59.41 205 63.43 192
2 39.71 306 33.10 368 34.99 348
4 27.26 446 19.75 617 20.09 607
8 18.84 646 12.86 947 12.68 962
16 15.96 763 9.62 1266 8.51 1436
32 15.43 789 8.20 1486 7.77 1579
64 16.11 756 8.91 1367 8.90 1383
128 16.41 742 10.00 1218 9.74 1269
256 17.33 702 11.91 1023 10.89 1136
512 18.46 659 14.07 866 11.82 1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 63.27s 192 56.14 217 59.25 205
2 40.17s 303 29.88 407 31.50 386
4 27.57s 442 16.16 754 17.18 709
8 21.26s 573 11.89 1025 11.09 1099
16 21.25s 573 10.68 1141 10.22 1192
32 21.00s 580 10.72 1136 10.35 1178
64 20.64s 590 10.15 1200 9.76 1249
128 skipped
256 skipped
512 skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 71.89s 169 65.57 217 69.09 69.09
2 47.36s 257 36.22 407 38.71 38.71
4 33.10s 368 21.76 754 22.78 22.78
8 26.62s 457 15.89 1025 15.30 15.30
16 24.89s 489 17.08 1141 15.20 15.20
32 25.15s 484 17.41 1136 16.14 16.14
64 26.11s 466 17.89 1200 16.76 16.76
128 skipped
256 skipped
512 skipped

Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 34.14 357 28.04 434 29.46 413
2 22.67 537 14.42 845 14.75 826
4 16.63 732 7.62 1599 7.69 1587
8 13.48 904 4.36 2795 4.13 2959
16 14.37 848 3.78 3224 2.74 4493
32 14.79 823 4.20 2902 2.07 5974
64 14.76 825 5.03 2423 2.21 5561
128 14.95 815 4.36 2796 2.30 5343
256 15.18 802 4.31 2828 2.49 4935
512 15.41 790 4.59 2656 2.84 4327

s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.55 194 54.22 224
2 37.11 328 28.94 421
4 25.97 469 16.42 742
8 20.01 609 11.92 1022
16 19.55 623 11.05 1102
32 19.34 630 11.27 1081
64 19.07 639 12.04 1012
128 19.22 634 12.27 993
256 19.34 630 12.28 992
512 19.60 621 11.74 1038

test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.71 169 63.63 191
2 46.93 259 36.31 335
4 30.37 401 22.41 543
8 22.86 533 16.90 721
16 20.18 604 14.07 866
32 19.64 620 13.06 933
64 19.71 618 15.08 808
128 19.95 610 15.47 787
256 20.48 595 16.53 737
512 21.56 565 16.86 722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.65 194 55.74 218
2 40.25 302 29.45 413
4 27.37 445 16.26 749
8 22.07 552 11.75 1037
16 21.29 572 10.64 1145
32 20.98 580 10.70 1139
64 20.65 590 10.21 1193
128 skipped
256 skipped
512 skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.72 169 65.12 187
2 46.46 262 35.74 341
4 32.61 373 21.60 564
8 26.69 456 16.30 747
16 25.31 481 17.00 716
32 24.96 488 17.47 697
64 26.05 467 17.90 680
128 skipped
256 skipped
512 skipped

test: unlogged_small_files, format: binary, files: 1024, 9505MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 37.62 323 32.77 371
2 28.35 429 18.89 645
4 20.87 583 12.18 1000
8 19.37 629 10.38 1173
16 19.41 627 10.36 1176
32 18.62 654 11.04 1103
64 18.33 664 11.89 1024
128 18.41 661 11.91 1023
256 18.52 658 12.10 1007
512 18.78 648 11.49 1060

benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.

The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.

s_b=1GB
tps tps
clients master patched
1 49541 48805
2 85342 90010
4 167340 168918
8 308194 303222
16 524294 523678
32 649516 649100
64 932547 937702
128 908249 906281
256 856496 903979
512 764254 934702
1024 653886 925113
2048 569695 917262
4096 526782 903258

s_b=128MB:
tps tps
clients master patched
1 40407 39854
2 73180 72252
4 143334 140860
8 240982 245331
16 429265 420810
32 544593 540127
64 706408 726678
128 713142 718087
256 611030 695582
512 552751 686290
1024 508248 666370
2048 474108 656735
4096 448582 633040

As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0]: /messages/by-id/3b108afd19fa52ed20c464a69f64d545e4a14772.camel@postgrespro.ru
[1]: COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2]: COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3]: /messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Attachments:

v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patchtext/x-diff; charset=us-asciiDownload

From 57306a6a2a4e7300206d3a5e40a1787b427765b8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 27 Oct 2022 10:18:13 -0700
Subject: [PATCH v1 01/12] wip: lwlock: fix quadratic behaviour with very long
 wait lists

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/proc.h        |  3 ++-
 src/backend/storage/lmgr/lwlock.c | 29 +++++++++++++++--------------
 2 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 8d096fdeeb1..9a2615666a1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -217,7 +217,8 @@ struct PGPROC
 	bool		recoveryConflictPending;
 
 	/* Info about LWLock the process is currently waiting for, if any. */
-	bool		lwWaiting;		/* true if waiting for an LW lock */
+	int			lwWaiting;		/* 0 if not waiting, 1 if on waitlist, 2 if
+								 * waiting to be woken */
 	uint8		lwWaitMode;		/* lwlock mode being waited for */
 	proclist_node lwWaitLink;	/* position in LW lock wait list */
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index d274c9b1dc9..2d1c8fae06e 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -987,6 +987,9 @@ LWLockWakeup(LWLock *lock)
 			wokeup_somebody = true;
 		}
 
+		/* signal that the process isn't on the wait list anymore */
+		waiter->lwWaiting = 2;
+
 		/*
 		 * Once we've woken up an exclusive lock, there's no point in waking
 		 * up anybody else.
@@ -1044,7 +1047,7 @@ LWLockWakeup(LWLock *lock)
 		 * another lock.
 		 */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = 0;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
@@ -1073,7 +1076,7 @@ LWLockQueueSelf(LWLock *lock, LWLockMode mode)
 	/* setting the flag is protected by the spinlock */
 	pg_atomic_fetch_or_u32(&lock->state, LW_FLAG_HAS_WAITERS);
 
-	MyProc->lwWaiting = true;
+	MyProc->lwWaiting = 1;
 	MyProc->lwWaitMode = mode;
 
 	/* LW_WAIT_UNTIL_FREE waiters are always at the front of the queue */
@@ -1101,7 +1104,6 @@ static void
 LWLockDequeueSelf(LWLock *lock)
 {
 	bool		found = false;
-	proclist_mutable_iter iter;
 
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -1114,17 +1116,14 @@ LWLockDequeueSelf(LWLock *lock)
 	LWLockWaitListLock(lock);
 
 	/*
-	 * Can't just remove ourselves from the list, but we need to iterate over
-	 * all entries as somebody else could have dequeued us.
+	 * Remove ourselves from the waitlist, unless we've already been
+	 * removed. The removal happens with the wait list lock held, so there's
+	 * no race in this check.
 	 */
-	proclist_foreach_modify(iter, &lock->waiters, lwWaitLink)
+	if (MyProc->lwWaiting == 1)
 	{
-		if (iter.cur == MyProc->pgprocno)
-		{
-			found = true;
-			proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
-			break;
-		}
+		proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+		found = true;
 	}
 
 	if (proclist_is_empty(&lock->waiters) &&
@@ -1138,7 +1137,7 @@ LWLockDequeueSelf(LWLock *lock)
 
 	/* clear waiting state again, nice for debugging */
 	if (found)
-		MyProc->lwWaiting = false;
+		MyProc->lwWaiting = 0;
 	else
 	{
 		int			extraWaits = 0;
@@ -1772,6 +1771,8 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 
 		proclist_delete(&lock->waiters, iter.cur, lwWaitLink);
 		proclist_push_tail(&wakeup, iter.cur, lwWaitLink);
+
+		waiter->lwWaiting = 2;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -1787,7 +1788,7 @@ LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 val)
 		proclist_delete(&wakeup, iter.cur, lwWaitLink);
 		/* check comment in LWLockWakeup() about this barrier */
 		pg_write_barrier();
-		waiter->lwWaiting = false;
+		waiter->lwWaiting = 0;
 		PGSemaphoreUnlock(waiter->sem);
 	}
 }
-- 
2.38.0

v1-0002-aio-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload

From 58ffcee444a3b28b37cd7458e179f891fe1578ae Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v1 02/12] aio: Add some error checking around pinning.

---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 42 ++++++++++++++++++++---------
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6f4dfa09602..2c1f12d3066 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -120,6 +120,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
+extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6b953814812..dfb7cfe0ef3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1707,6 +1707,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	Assert(!BufferIsLocal(b));
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1852,6 +1854,8 @@ UnpinBuffer(BufferDesc *buf)
 	PrivateRefCountEntry *ref;
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 
+	Assert(!BufferIsLocal(b));
+
 	/* not moving as we're likely deleting it soon anyway */
 	ref = GetPrivateRefCountEntry(b, false);
 	Assert(ref != NULL);
@@ -4209,6 +4213,25 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 LocalRefCount[-buffer - 1]);
+	}
+	else
+	{
+		/* There should be exactly one local pin */
+		if (GetPrivateRefCount(buffer) != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 GetPrivateRefCount(buffer));
+	}
+}
+
 /*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
@@ -4236,20 +4259,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	if (BufferIsLocal(buffer))
-	{
-		/* There should be exactly one pin */
-		if (LocalRefCount[-buffer - 1] != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 LocalRefCount[-buffer - 1]);
-		/* Nobody else to wait for */
-		return;
-	}
+	BufferCheckOneLocalPin(buffer);
 
-	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		elog(ERROR, "incorrect local pin count: %d",
-			 GetPrivateRefCount(buffer));
+	/* Nobody else to wait for */
+	if (BufferIsLocal(buffer))
+		return;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -4757,6 +4771,8 @@ LockBufHdr(BufferDesc *desc)
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
 	init_local_spin_delay(&delayStatus);
 
 	while (true)
-- 
2.38.0

v1-0003-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload

From 0ee5910af08188ff894fb4f06bbe7907f9d7da86 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v1 03/12] hio: Release extension lock before initializing page
 / pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.
---
 src/backend/access/heap/hio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index b0ece666293..bc4035a6221 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

v1-0004-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload

From ad8d317335092201682f9759cf88ee6382918f7d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:25:46 -0700
Subject: [PATCH v1 04/12] Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/fd.h        |   3 +
 src/include/storage/md.h        |   2 +
 src/include/storage/smgr.h      |   2 +
 src/backend/storage/file/fd.c   | 105 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   | 103 +++++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  21 +++++++
 6 files changed, 236 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index c0a212487d9..3d309edb152 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -106,6 +106,9 @@ extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_i
 extern int	FileRead(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, char *buffer, int amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
+extern int	FileZero(File file, off_t offset, off_t len, uint32 wait_event_info);
+extern int	FileFallocate(File file, off_t offset, off_t len, uint32 wait_event_info);
+
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
 extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 10aa1b0109b..d0597f8a575 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,6 +28,8 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a07715356ba..503310e82ba 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 4151cafec54..47b620649aa 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -93,6 +93,7 @@
 #include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_iovec.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
 #include "storage/fd.h"
@@ -2205,6 +2206,105 @@ FileSync(File file, uint32 wait_event_info)
 	return returnCode;
 }
 
+/* So that FileZero() doesn't have to re-zero a block on every call */
+static const PGAlignedBlock zerobuf = {0};
+
+int
+FileZero(File file, off_t offset, off_t len, uint32 wait_event_info)
+{
+	int			returnCode;
+	int			numblocks;
+	struct iovec iov[PG_IOV_MAX];
+
+	/*
+	 * FIXME: Quick-and-dirty implementation, to be replaced by
+	 * pg_pwrite_zeros() from
+	 * https://postgr.es/m/Y1oc%2BFjiyVjNZa%2BL%40paquier.xyz
+	 *
+	 * Otherwise it'd not at all be ok to rely on len being a multiple of
+	 * BLCKSZ.
+	 */
+	Assert((len % BLCKSZ) == 0);
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	numblocks = len / BLCKSZ;
+
+	for (int i = 0; i < Min(numblocks, lengthof(iov)); ++i)
+	{
+		iov[i].iov_base = (char *) zerobuf.data;
+		iov[i].iov_len = BLCKSZ;
+	}
+
+	while (numblocks > 0)
+	{
+		int			iovcnt = Min(numblocks, lengthof(iov));
+		off_t		seekpos_l = offset;
+		ssize_t		ret;
+
+		pgstat_report_wait_start(wait_event_info);
+		ret = pg_pwritev_with_retry(VfdCache[file].fd, iov, iovcnt, seekpos_l);
+		pgstat_report_wait_end();
+
+		if (ret < 0)
+			return -1;
+
+		Assert(ret == iovcnt * BLCKSZ);
+		offset += iovcnt * BLCKSZ;
+		numblocks -= iovcnt;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to reserve file space with posix_fallocate(). If posix_fallocate() is
+ * not implemented on the operating system or fails with EINVAL / EOPNOTSUPP,
+ * use FileZero() instead.
+ *
+ * Note that at least glibc() implements posix_fallocate() in userspace if not
+ * implemented by the filesystem. That's not the case for all environments
+ * though.
+ */
+int
+FileFallocate(File file, off_t offset, off_t len, uint32 wait_event_info)
+{
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+#ifdef HAVE_POSIX_FALLOCATE
+	pgstat_report_wait_start(wait_event_info);
+	returnCode = posix_fallocate(VfdCache[file].fd, offset, len);
+	pgstat_report_wait_end();
+
+	if (returnCode == 0)
+		return 0;
+
+	/* for compatibility with %m printing etc */
+	errno = returnCode;
+
+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return returnCode;
+
+	if (returnCode == 0 ||
+		(returnCode != EINVAL && returnCode != EINVAL))
+		return returnCode;
+#endif
+
+	return FileZero(file, offset, len, wait_event_info);
+}
+
 off_t
 FileSize(File file)
 {
@@ -2277,6 +2377,11 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file) < 0)
+		return -1;
+
+	FileAccess(file);
 	return VfdCache[file].fd;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index a515bb36ac1..eee3cae7c4e 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -486,6 +487,108 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	MdfdVec    *v;
+	BlockNumber curblocknum = blocknum;
+	int         remblocks = nblocks;
+
+	Assert(nblocks > 0);
+
+	/* This assert is too expensive to have on normally ... */
+#ifdef CHECK_WRITE_VS_EXTEND
+	Assert(blocknum >= mdnblocks(reln, forknum));
+#endif
+
+	/*
+	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
+	 * more --- we mustn't create a block whose number actually is
+	 * InvalidBlockNumber or larger.
+	 */
+	if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend file \"%s\" beyond %u blocks",
+						relpath(reln->smgr_rlocator, forknum),
+						InvalidBlockNumber)));
+
+	while (remblocks > 0)
+	{
+		int			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		int			segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks;
+		off_t       seekpos = (off_t) BLCKSZ * segstartblock;
+		int			numblocks;
+
+		if (segendblock > RELSEG_SIZE)
+			segendblock = RELSEG_SIZE;
+
+		numblocks = segendblock - segstartblock;
+
+		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+
+		Assert(segstartblock < RELSEG_SIZE);
+		Assert(segendblock <= RELSEG_SIZE);
+
+		/*
+		 * If available use posix_fallocate() to extend the relation. That's
+		 * often more efficient than using write(), as it commonly won't cause
+		 * the kernel to allocate page cache space for the extended pages.
+		 *
+		 * However, we shouldn't use fallocate() for small extensions, it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 */
+		if (numblocks > 8)
+		{
+			int         ret;
+
+			ret = FileFallocate(v->mdfd_vfd, seekpos,
+								(off_t) BLCKSZ * numblocks,
+								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret != 0)
+			{
+				ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+									FilePathName(v->mdfd_vfd)),
+							 errhint("Check free disk space.")));
+			}
+		}
+		else
+		{
+			int         ret;
+
+			/*
+			 * Even if we don't have fallocate, we can still extend a bit more
+			 * efficiently than writing each 8kB block individually.
+			 * FileZero() uses pg_writev[with_retry] with a single zeroed
+			 * buffer to avoid needing a zeroed buffer for the whole length of
+			 * the extension.
+			 */
+			ret = FileZero(v->mdfd_vfd, seekpos,
+						   (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not extend file \"%s\": %m",
+								FilePathName(v->mdfd_vfd)),
+						 errhint("Check free disk space.")));
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+		remblocks -= segendblock - segstartblock;
+		curblocknum += segendblock - segstartblock;
+	}
+}
+
 /*
  *	mdopenfork() -- Open one fork of the specified relation.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c1a5febcbfd..6fb693cb062 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -75,6 +77,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
 		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
@@ -507,6 +510,24 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+void
+smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   int nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
+											 nblocks, skipFsync);
+
+	/*
+	 * Normally we expect this to increase nblocks by nblocks, but if the
+	 * cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+}
+
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
-- 
2.38.0

v1-0005-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From a8804720e75a2cc8828091ea9eec967a2658fc3f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v1 05/12] bufmgr: Add Pin/UnpinLocalBuffer()

So far these were open-coded in quite a few places, without a good reason.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h   |  2 +
 src/backend/storage/buffer/bufmgr.c   | 30 +++----------
 src/backend/storage/buffer/localbuf.c | 62 +++++++++++++++++----------
 3 files changed, 46 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 406db6be783..1cd6a968284 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -413,6 +413,8 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
+extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
+extern void UnpinLocalBuffer(Buffer buffer);
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dfb7cfe0ef3..a7342388a41 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -644,20 +644,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		/* Is it still valid and holding the right tag? */
 		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * Bump buffer's ref and usage counts. This is equivalent of
-			 * PinBuffer for a shared buffer.
-			 */
-			if (LocalRefCount[b] == 0)
-			{
-				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-				{
-					buf_state += BUF_USAGECOUNT_ONE;
-					pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-				}
-			}
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner, recent_buffer);
+			PinLocalBuffer(bufHdr, true);
 
 			pgBufferUsage.local_blks_hit++;
 
@@ -1660,8 +1647,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
 				BufTagGetForkNum(&bufHdr->tag) == forkNum)
 				return buffer;
-			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-			LocalRefCount[-buffer - 1]--;
+			UnpinLocalBuffer(buffer);
 		}
 		else
 		{
@@ -3938,15 +3924,9 @@ ReleaseBuffer(Buffer buffer)
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
-	{
-		ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-
-		Assert(LocalRefCount[-buffer - 1] > 0);
-		LocalRefCount[-buffer - 1]--;
-		return;
-	}
-
-	UnpinBuffer(GetBufferDescriptor(buffer - 1));
+		UnpinLocalBuffer(buffer);
+	else
+		UnpinBuffer(GetBufferDescriptor(buffer - 1));
 }
 
 /*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 30d67d1c40d..e6349536a16 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -136,27 +136,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
 				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
 #endif
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		/* this part is equivalent to PinBuffer for a shared buffer */
-		if (LocalRefCount[b] == 0)
-		{
-			if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-			{
-				buf_state += BUF_USAGECOUNT_ONE;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-		}
-		LocalRefCount[b]++;
-		ResourceOwnerRememberBuffer(CurrentResourceOwner,
-									BufferDescriptorGetBuffer(bufHdr));
-		if (buf_state & BM_VALID)
-			*foundPtr = true;
-		else
-		{
-			/* Previous read attempt must have failed; try again */
-			*foundPtr = false;
-		}
+		*foundPtr = PinLocalBuffer(bufHdr, true);
 		return bufHdr;
 	}
 
@@ -193,9 +174,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			else
 			{
 				/* Found a usable buffer */
-				LocalRefCount[b]++;
-				ResourceOwnerRememberBuffer(CurrentResourceOwner,
-											BufferDescriptorGetBuffer(bufHdr));
+				PinLocalBuffer(bufHdr, false);
 				break;
 			}
 		}
@@ -483,6 +462,43 @@ InitLocalBuffers(void)
 	NLocBuffer = nbufs;
 }
 
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+	uint32		buf_state;
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	int			bufid = -(buffer + 1);
+
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	if (LocalRefCount[bufid] == 0)
+	{
+		if (adjust_usagecount &&
+			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
+		{
+			buf_state += BUF_USAGECOUNT_ONE;
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		}
+	}
+	LocalRefCount[bufid]++;
+	ResourceOwnerRememberBuffer(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf_hdr));
+
+	return buf_state & BM_VALID;
+}
+
+void
+UnpinLocalBuffer(Buffer buffer)
+{
+	int buffid = -buffer - 1;
+
+	Assert(BufferIsLocal(buffer));
+	Assert(LocalRefCount[buffid] > 0);
+
+	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
+	LocalRefCount[buffid]--;
+}
+
 /*
  * GUC check_hook for temp_buffers
  */
-- 
2.38.0

v1-0006-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload

From 7e4a1c30a70a1c60fcf6cd505bf0c17463e7342b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:29:57 -0700
Subject: [PATCH v1 06/12] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c   | 570 ++++++++++++++------------
 src/backend/storage/buffer/localbuf.c | 115 +++---
 2 files changed, 381 insertions(+), 304 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a7342388a41..50550a8a70e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,6 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
+static Buffer GetVictimBuffer(BufferAccessStrategy strategy);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
@@ -1111,14 +1112,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
-	uint32		oldFlags;
-	int			buf_id;
-	BufferDesc *buf;
-	bool		valid;
-	uint32		buf_state;
+	int			existing_buf_id;
+
+	Buffer		victim_buffer;
+	BufferDesc *victim_buf_hdr;
+	uint32		victim_buf_state;
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -1129,15 +1127,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+	existing_buf_id = BufTableLookup(&newTag, newHash);
+	if (existing_buf_id >= 0)
 	{
+		BufferDesc *buf;
+		bool		valid;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
 		 * buffer pool, and check to see if the correct data has been loaded
 		 * into the buffer.
 		 */
-		buf = GetBufferDescriptor(buf_id);
+		buf = GetBufferDescriptor(existing_buf_id);
 
 		valid = PinBuffer(buf, strategy);
 
@@ -1174,266 +1175,96 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
-	/* Loop here in case we have to try another victim buffer */
-	for (;;)
+	/*
+	 * Acquire a victim buffer. Somebody else might try to do the same, we
+	 * don't hold any conflicting locks. If so we'll have to undo our work
+	 * later.
+	 */
+	victim_buffer = GetVictimBuffer(strategy);
+	victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. If
+	 * somebody else inserted another buffer for the tag, we'll release the
+	 * victim buffer we acquired and use the already inserted one.
+	 */
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
+	if (existing_buf_id >= 0)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
+		BufferDesc *existing_buf_hdr;
+		bool		valid;
 
 		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
+		 * Got a collision. Someone has already done what we were about to
+		 * do. We'll just handle this as if it were found in the buffer pool
+		 * in the first place.  First, give up the buffer we were planning to
+		 * use.
+		 *
+		 * We could do this after releasing the partition lock, but then we'd
+		 * have to call ResourceOwnerEnlargeBuffers() &
+		 * ReservePrivateRefCountEntry() before acquiring the lock, for the
+		 * rare case of such a collision.
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		UnpinBuffer(victim_buf_hdr);
 
-		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+		/* FIXME: Should we put the victim buffer onto the freelist? */
 
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf_state & BUF_FLAG_MASK;
+		/* remaining code should match code at top of routine */
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-										 LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
+		valid = PinBuffer(existing_buf_hdr, strategy);
 
-					/* Read the LSN while holding buffer header lock */
-					buf_state = LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf, buf_state);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(BufferDescriptorGetContentLock(buf));
-						UnpinBuffer(buf);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-														  smgr->smgr_rlocator.locator.spcOid,
-														  smgr->smgr_rlocator.locator.dbOid,
-														  smgr->smgr_rlocator.locator.relNumber);
-
-				FlushBuffer(buf, NULL);
-				LWLockRelease(BufferDescriptorGetContentLock(buf));
-
-				ScheduleBufferTagForWriteback(&BackendWritebackContext,
-											  &buf->tag);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-														 smgr->smgr_rlocator.locator.spcOid,
-														 smgr->smgr_rlocator.locator.dbOid,
-														 smgr->smgr_rlocator.locator.relNumber);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf);
-				continue;
-			}
-		}
-
-		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
-		 */
-		if (oldFlags & BM_TAG_VALID)
-		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* remember we have no old-partition lock or tag */
-			oldPartitionLock = NULL;
-			/* keep the compiler quiet about uninitialized variables */
-			oldHash = 0;
-		}
-
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
-		/*
-		 * Need to lock the buffer header too in order to change its tag.
-		 */
-		buf_state = LockBufHdr(buf);
-
-		/*
-		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
-		 */
-		oldFlags = buf_state & BUF_FLAG_MASK;
-		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
-			break;
-
-		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading
+			 * in the page, or (b) a previous read attempt failed.  We
+			 * have to wait for any active read attempt to finish, and
+			 * then set up our own read attempt if the page is still not
+			 * BM_VALID.  StartBufferIO does it all.
+			 */
+			if (StartBufferIO(existing_buf_hdr, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer
+				 * must have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return existing_buf_hdr;
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
-	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
-	 *
+	 * Need to lock the buffer header too in order to change its tag.
+	 */
+	victim_buf_state = LockBufHdr(victim_buf_hdr);
+
+	/* some sanity checks while we hold the buffer header lock */
+	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+
+	victim_buf_hdr->tag = newTag;
+
+	/*
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
 	 */
-	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
+	victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
-	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		victim_buf_state |= BM_PERMANENT;
 
-	UnlockBufHdr(buf, buf_state);
-
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
+	UnlockBufHdr(victim_buf_hdr, victim_buf_state);
 
 	LWLockRelease(newPartitionLock);
 
@@ -1443,12 +1274,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * to read it before we did, so there's nothing left for BufferAlloc() to
 	 * do.
 	 */
-	if (StartBufferIO(buf, true))
+	if (StartBufferIO(victim_buf_hdr, true))
 		*foundPtr = false;
 	else
 		*foundPtr = true;
 
-	return buf;
+	return victim_buf_hdr;
 }
 
 /*
@@ -1557,6 +1388,239 @@ retry:
 	StrategyFreeBuffer(buf);
 }
 
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called with the buffer pinned, but without the buffer header
+ * spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+	uint32 buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+
+	/* can't change while we're holding the pin */
+	if (buf_state & BM_TAG_VALID)
+	{
+		uint32      hash;
+		LWLock     *partition_lock;
+		BufferTag   tag;
+
+		/* have buffer pinned, so it's safe to read tag without lock */
+		tag = buf_hdr->tag;
+
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		/* lock the buffer header */
+		buf_state = LockBufHdr(buf_hdr);
+
+		Assert(BufferTagsEqual(&buf_hdr->tag, &tag));
+
+		/*
+		 * We have the buffer pinned nobody else should have been able to
+		 * unset this concurrently.
+		 */
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		/*
+		 * If somebody else pinned the buffer since, or even worse, dirtied it,
+		 * give up on this buffer: It's clearly in use.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1 || (buf_state & BM_DIRTY))
+		{
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+			UnlockBufHdr(buf_hdr, buf_state);
+			LWLockRelease(partition_lock);
+
+			return false;
+		}
+
+		/*
+		 * Clear out the buffer's tag and flags and usagecount.  We must do
+		 * this to ensure that linear scans of the buffer array don't think
+		 * the buffer is valid.
+		 *
+		 * XXX: This is a pre-existing comment I just moved, but isn't it
+		 * entirely bogus with regard to the tag? We can't do anything with
+		 * the buffer without taking BM_VALID / BM_TAG_VALID into
+		 * account. Likely doesn't matter because we're already dirtying the
+		 * cacheline, but still.
+		 *
+		 */
+		ClearBufferTag(&buf_hdr->tag);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		BufTableDelete(&tag, hash);
+
+		LWLockRelease(partition_lock);
+	}
+
+	Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID)));
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0);
+
+	return true;
+}
+
+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy)
+{
+	Buffer cur_buf;
+	BufferDesc *cur_buf_hdr = NULL;
+	uint32 cur_buf_state;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free
+	 * refcount entry.
+	 */
+	ReservePrivateRefCountEntry();
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+again:
+
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	cur_buf_hdr = StrategyGetBuffer(strategy, &cur_buf_state);
+	cur_buf = BufferDescriptorGetBuffer(cur_buf_hdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(cur_buf_state) == 0);
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * We shouldn't have any other pins for this buffer.
+	 */
+	BufferCheckOneLocalPin(cur_buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released the
+	 * buffer header lock above, or even while we are writing it out (since
+	 * our share-lock won't prevent hint-bit updates).  We will recheck the
+	 * dirty bit after re-locking the buffer header.
+	 */
+	if (cur_buf_state & BM_DIRTY)
+	{
+		LWLock *content_lock;
+
+		Assert(cur_buf_state & BM_TAG_VALID);
+		Assert(cur_buf_state & BM_VALID);
+
+		/*
+		 * We need a share-lock on the buffer contents to write it out
+		 * (else we might write invalid data, eg because someone else is
+		 * compacting the page contents while we write).  We must use a
+		 * conditional lock acquisition here to avoid deadlock.  Even
+		 * though the buffer was not pinned (and therefore surely not
+		 * locked) when StrategyGetBuffer returned it, someone else could
+		 * have pinned and exclusive-locked it by the time we get here. If
+		 * we try to get the lock unconditionally, we'd block waiting for
+		 * them; if they later block waiting for us, deadlock ensues.
+		 * (This has been observed to happen when two backends are both
+		 * trying to split btree index pages, and the second one just
+		 * happens to be trying to split the page the first one got from
+		 * StrategyGetBuffer.)
+		 */
+		content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop
+			 * back to get another one.
+			 */
+			UnpinBuffer(cur_buf_hdr);
+			goto again;
+		}
+
+		/*
+		 * If using a nondefault strategy, and writing the buffer would
+		 * require a WAL flush, let the strategy decide whether to go ahead
+		 * and write/reuse the buffer or to choose another victim.  We need
+		 * lock to inspect the page LSN, so this can't be done inside
+		 * StrategyGetBuffer.
+		 */
+		if (strategy != NULL)
+		{
+			XLogRecPtr	lsn;
+
+			/* Read the LSN while holding buffer header lock */
+			cur_buf_state = LockBufHdr(cur_buf_hdr);
+			lsn = BufferGetLSN(cur_buf_hdr);
+			UnlockBufHdr(cur_buf_hdr, cur_buf_state);
+
+			if (XLogNeedsFlush(lsn)
+				&& StrategyRejectBuffer(strategy, cur_buf_hdr))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+				goto again;
+			}
+		}
+
+		/* OK, do the I/O */
+		/* FIXME: These used the wrong smgr before afaict? */
+		{
+			SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&cur_buf_hdr->tag),
+										 InvalidBackendId);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(cur_buf_hdr->tag.forkNum,
+													  cur_buf_hdr->tag.blockNum,
+													  smgr->smgr_rlocator.locator.spcOid,
+													  smgr->smgr_rlocator.locator.dbOid,
+													  smgr->smgr_rlocator.locator.relNumber);
+
+			FlushBuffer(cur_buf_hdr, smgr);
+			LWLockRelease(content_lock);
+
+			ScheduleBufferTagForWriteback(&BackendWritebackContext,
+										  &cur_buf_hdr->tag);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(cur_buf_hdr->tag.forkNum,
+													 cur_buf_hdr->tag.blockNum,
+													 smgr->smgr_rlocator.locator.spcOid,
+													 smgr->smgr_rlocator.locator.dbOid,
+													 smgr->smgr_rlocator.locator.relNumber);
+		}
+	}
+
+	/*
+	 * If the buffer has an entry in the buffer mapping table, delete it. This
+	 * can fail because another backend could have pinned or dirtied the
+	 * buffer.
+	 */
+	if (!InvalidateVictimBuffer(cur_buf_hdr))
+	{
+		UnpinBuffer(cur_buf_hdr);
+		goto again;
+	}
+
+	/* a final set of sanity checks */
+	cur_buf_state = pg_atomic_read_u32(&cur_buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(cur_buf_state) == 1);
+	Assert(!(cur_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+
+	BufferCheckOneLocalPin(cur_buf);
+
+	return cur_buf;
+}
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index e6349536a16..6a3f325ea6d 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -44,13 +44,14 @@ BufferDesc *LocalBufferDescriptors = NULL;
 Block	   *LocalBufferBlockPointers = NULL;
 int32	   *LocalRefCount = NULL;
 
-static int	nextFreeLocalBuf = 0;
+static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
+static Buffer GetLocalVictimBuffer(void);
 
 
 /*
@@ -112,10 +113,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
-	int			b;
-	int			trycounter;
+	Buffer		victim_buffer;
+	int			bufid;
 	bool		found;
-	uint32		buf_state;
 
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
@@ -129,23 +129,51 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 	if (hresult)
 	{
-		b = hresult->id;
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufid = hresult->id;
+		bufHdr = GetLocalBufferDescriptor(bufid);
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
-#ifdef LBDEBUG
-		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
-#endif
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
-		return bufHdr;
+	}
+	else
+	{
+		uint32		buf_state;
+
+		victim_buffer = GetLocalVictimBuffer();
+		bufid = -(victim_buffer + 1);
+		bufHdr = GetLocalBufferDescriptor(bufid);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &newTag, HASH_ENTER, &found);
+		if (found)					/* shouldn't happen */
+			elog(ERROR, "local buffer hash table corrupted");
+		hresult->id = bufid;
+
+		/*
+		 * it's all ours now.
+		 */
+		bufHdr->tag = newTag;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+		*foundPtr = false;
 	}
 
-#ifdef LBDEBUG
-	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-			smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum,
-			-nextFreeLocalBuf - 1);
-#endif
+	return bufHdr;
+}
+
+static Buffer
+GetLocalVictimBuffer(void)
+{
+	int			victim_bufid;
+	int			trycounter;
+	uint32		buf_state;
+	BufferDesc *bufHdr;
+
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
 	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
@@ -154,14 +182,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	trycounter = NLocBuffer;
 	for (;;)
 	{
-		b = nextFreeLocalBuf;
+		victim_bufid = nextFreeLocalBufId;
 
-		if (++nextFreeLocalBuf >= NLocBuffer)
-			nextFreeLocalBuf = 0;
+		if (++nextFreeLocalBufId >= NLocBuffer)
+			nextFreeLocalBufId = 0;
 
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufHdr = GetLocalBufferDescriptor(victim_bufid);
 
-		if (LocalRefCount[b] == 0)
+		if (LocalRefCount[victim_bufid] == 0)
 		{
 			buf_state = pg_atomic_read_u32(&bufHdr->state);
 
@@ -184,6 +212,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 					 errmsg("no empty local buffer available")));
 	}
 
+	/*
+	 * lazy memory allocation: allocate space on first use of a buffer.
+	 */
+	if (LocalBufHdrGetBlock(bufHdr) == NULL)
+	{
+		/* Set pointer for use by BufferGetBlock() macro */
+		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
+	}
+
 	/*
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
@@ -213,19 +250,12 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	}
 
 	/*
-	 * lazy memory allocation: allocate space on first use of a buffer.
-	 */
-	if (LocalBufHdrGetBlock(bufHdr) == NULL)
-	{
-		/* Set pointer for use by BufferGetBlock() macro */
-		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
-	}
-
-	/*
-	 * Update the hash table: remove old entry, if any, and make new one.
+	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
 	if (buf_state & BM_TAG_VALID)
 	{
+		LocalBufferLookupEnt *hresult;
+
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, (void *) &bufHdr->tag,
 						HASH_REMOVE, NULL);
@@ -233,28 +263,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			elog(ERROR, "local buffer hash table corrupted");
 		/* mark buffer invalid just in case hash insert fails */
 		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BM_VALID | BM_TAG_VALID);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 	}
 
-	hresult = (LocalBufferLookupEnt *)
-		hash_search(LocalBufHash, (void *) &newTag, HASH_ENTER, &found);
-	if (found)					/* shouldn't happen */
-		elog(ERROR, "local buffer hash table corrupted");
-	hresult->id = b;
-
-	/*
-	 * it's all ours now.
-	 */
-	bufHdr->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
-	buf_state |= BM_TAG_VALID;
-	buf_state &= ~BUF_USAGECOUNT_MASK;
-	buf_state += BUF_USAGECOUNT_ONE;
-	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-
-	*foundPtr = false;
-	return bufHdr;
+	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
@@ -423,7 +436,7 @@ InitLocalBuffers(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	nextFreeLocalBuf = 0;
+	nextFreeLocalBufId = 0;
 
 	/* initialize fields that need to start off nonzero */
 	for (i = 0; i < nbufs; i++)
-- 
2.38.0

v1-0007-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload

From 40b53b7cfdbe8ff8090e31e593f82477f6a574aa Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v1 07/12] bufmgr: Support multiple in-progress IOs by using
 resowner

---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/utils/resowner_private.h  |  5 ++
 src/backend/access/transam/xact.c     |  4 +-
 src/backend/postmaster/autovacuum.c   |  1 -
 src/backend/postmaster/bgwriter.c     |  1 -
 src/backend/postmaster/checkpointer.c |  1 -
 src/backend/postmaster/walwriter.c    |  1 -
 src/backend/storage/buffer/bufmgr.c   | 86 ++++++++++++---------------
 src/backend/utils/resowner/resowner.c | 58 ++++++++++++++++++
 9 files changed, 103 insertions(+), 56 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2c1f12d3066..723243aeb96 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -167,7 +167,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern void AbortBufferIO(void);
+extern void AbortBufferIO(Buffer buffer);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(struct WritebackContext *wb_context);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index d01cccc27c1..61a72ed52f5 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -30,6 +30,11 @@ extern void ResourceOwnerEnlargeBuffers(ResourceOwner owner);
 extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
+/* support for IO-in-progress management */
+extern void ResourceOwnerEnlargeBufferIOs(ResourceOwner owner);
+extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
+extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
+
 /* support for local lock management */
 extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index fd5103a78e2..71b298d38fe 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2707,8 +2707,7 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/* Clean up buffer I/O and buffer context locks, too */
-	AbortBufferIO();
+	/* Clean up buffer context locks, too */
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5051,7 +5050,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-	AbortBufferIO();
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 1e90b72b740..09b8986597a 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -526,7 +526,6 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 91e6f6ea18a..372cf21464e 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -167,7 +167,6 @@ BackgroundWriterMain(void)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 5fc076fc149..a3954e7cc0e 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -271,7 +271,6 @@ CheckpointerMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index beb46dcb55c..2bea4947c1b 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -163,7 +163,6 @@ WalWriterMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 50550a8a70e..b1404bd77dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,10 +159,6 @@ int			checkpoint_flush_after = 0;
 int			bgwriter_flush_after = 0;
 int			backend_flush_after = 0;
 
-/* local state for StartBufferIO and related functions */
-static BufferDesc *InProgressBuf = NULL;
-static bool IsForInput;
-
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	AbortBufferIO();
 	UnlockBuffers();
 
 	CheckForBufferLeaks();
@@ -4618,7 +4613,7 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 {
 	uint32		buf_state;
 
-	Assert(!InProgressBuf);
+	ResourceOwnerEnlargeBufferIOs(CurrentResourceOwner);
 
 	for (;;)
 	{
@@ -4642,8 +4637,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = buf;
-	IsForInput = forInput;
+	ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+								  BufferDescriptorGetBuffer(buf));
 
 	return true;
 }
@@ -4669,8 +4664,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 {
 	uint32		buf_state;
 
-	Assert(buf == InProgressBuf);
-
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
@@ -4682,13 +4675,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = NULL;
+	ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
 }
 
 /*
- * AbortBufferIO: Clean up any active buffer I/O after an error.
+ * AbortBufferIO: Clean up active buffer I/O after an error.
  *
  *	All LWLocks we might have held have been released,
  *	but we haven't yet released buffer pins, so the buffer is still pinned.
@@ -4697,46 +4691,42 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
  *	possible the error condition wasn't related to the I/O.
  */
 void
-AbortBufferIO(void)
+AbortBufferIO(Buffer buf)
 {
-	BufferDesc *buf = InProgressBuf;
+	BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 
-	if (buf)
+	buf_state = LockBufHdr(buf_hdr);
+	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+
+	if (!(buf_state & BM_VALID))
 	{
-		uint32		buf_state;
-
-		buf_state = LockBufHdr(buf);
-		Assert(buf_state & BM_IO_IN_PROGRESS);
-		if (IsForInput)
-		{
-			Assert(!(buf_state & BM_DIRTY));
-
-			/* We'd better not think buffer is valid yet */
-			Assert(!(buf_state & BM_VALID));
-			UnlockBufHdr(buf, buf_state);
-		}
-		else
-		{
-			Assert(buf_state & BM_DIRTY);
-			UnlockBufHdr(buf, buf_state);
-			/* Issue notice if this is not the first failure... */
-			if (buf_state & BM_IO_ERROR)
-			{
-				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
-
-				path = relpathperm(BufTagGetRelFileLocator(&buf->tag),
-								   BufTagGetForkNum(&buf->tag));
-				ereport(WARNING,
-						(errcode(ERRCODE_IO_ERROR),
-						 errmsg("could not write block %u of %s",
-								buf->tag.blockNum, path),
-						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
-			}
-		}
-		TerminateBufferIO(buf, false, BM_IO_ERROR);
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
 	}
+	else
+	{
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		/* Issue notice if this is not the first failure... */
+		if (buf_state & BM_IO_ERROR)
+		{
+			/* Buffer is pinned, so we can read tag without spinlock */
+			char	   *path;
+
+			path = relpathperm(BufTagGetRelFileLocator(&buf_hdr->tag),
+							   BufTagGetForkNum(&buf_hdr->tag));
+			ereport(WARNING,
+					(errcode(ERRCODE_IO_ERROR),
+					 errmsg("could not write block %u of %s",
+							buf_hdr->tag.blockNum, path),
+					 errdetail("Multiple failures --- write error might be permanent.")));
+			pfree(path);
+		}
+	}
+
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 37b43ee1f8c..62311831348 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
 
 	/* We have built-in support for remembering: */
 	ResourceArray bufferarr;	/* owned buffers */
+	ResourceArray bufferioarr;	/* in-progress buffer IO */
 	ResourceArray catrefarr;	/* catcache references */
 	ResourceArray catlistrefarr;	/* catcache-list pins */
 	ResourceArray relrefarr;	/* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	ResourceArrayInit(&(owner->bufferarr), BufferGetDatum(InvalidBuffer));
+	ResourceArrayInit(&(owner->bufferioarr), BufferGetDatum(InvalidBuffer));
 	ResourceArrayInit(&(owner->catrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->catlistrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->relrefarr), PointerGetDatum(NULL));
@@ -517,6 +519,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 	if (phase == RESOURCE_RELEASE_BEFORE_LOCKS)
 	{
+		/*
+		 * Abort failed buffer IO. AbortBufferIO()->TerminateBufferIO() calls
+		 * ResourceOwnerForgetBufferIOs(), so we just have to iterate till
+		 * there are none.
+		 *
+		 * Needs to be before we release buffer pins.
+		 *
+		 * During a commit, there shouldn't be any in-progress IO.
+		 */
+		while (ResourceArrayGetAny(&(owner->bufferioarr), &foundres))
+		{
+			Buffer		res = DatumGetBuffer(foundres);
+
+			if (isCommit)
+				elog(PANIC, "lost track of buffer IO on buffer %u", res);
+			AbortBufferIO(res);
+		}
+
 		/*
 		 * Release buffer pins.  Note that ReleaseBuffer will remove the
 		 * buffer entry from our array, so we just have to iterate till there
@@ -976,6 +996,44 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 			 buffer, owner->name);
 }
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * buffer array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeBufferIOs(ResourceOwner owner)
+{
+	/* We used to allow pinning buffers without a resowner, but no more */
+	Assert(owner != NULL);
+	ResourceArrayEnlarge(&(owner->bufferioarr));
+}
+
+/*
+ * Remember that a buffer IO is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeBufferIOs()
+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+
+/*
+ * Forget that a buffer IO is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	if (!ResourceArrayRemove(&(owner->bufferioarr), BufferGetDatum(buffer)))
+		elog(PANIC, "buffer IO %d is not owned by resource owner %s",
+			 buffer, owner->name);
+}
+
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
  *
-- 
2.38.0

v1-0008-bufmgr-Move-relation-extension-handling-into-Bulk.patchtext/x-diff; charset=us-asciiDownload

From e5c1076a9961aba8175e723bd4cdda4bd438c288 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:44:02 -0700
Subject: [PATCH v1 08/12] bufmgr: Move relation extension handling into
 [Bulk]ExtendRelationBuffered()

---
 src/include/storage/buf_internals.h   |   5 +
 src/include/storage/bufmgr.h          |  13 +
 src/backend/storage/buffer/bufmgr.c   | 546 ++++++++++++++++++--------
 src/backend/storage/buffer/localbuf.c | 134 ++++++-
 src/backend/utils/probes.d            |   6 +-
 doc/src/sgml/monitoring.sgml          |  11 +-
 6 files changed, 546 insertions(+), 169 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 1cd6a968284..34b28eea906 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -420,6 +420,11 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
+extern BlockNumber BulkExtendLocalRelationBuffered(SMgrRelation smgr,
+												   ForkNumber fork,
+												   ReadBufferMode mode,
+												   uint32 *num_pages,
+												   Buffer *buffers);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 723243aeb96..a9cb876c777 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -123,6 +123,19 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
+extern Buffer ExtendRelationBuffered(Relation reln, struct SMgrRelationData *smgr,
+									 bool skip_extension_lock,
+									 char relpersistence,
+									 ForkNumber forkNum, ReadBufferMode mode,
+									 BufferAccessStrategy strategy);
+extern BlockNumber BulkExtendRelationBuffered(Relation rel, struct SMgrRelationData *smgr,
+											  bool skip_extension_lock,
+											  char relpersistence,
+											  ForkNumber fork, ReadBufferMode mode,
+											  BufferAccessStrategy strategy,
+											  uint32 *num_pages,
+											  uint32 num_locked_pages,
+											  Buffer *buffers);
 
 extern void InitBufferPoolAccess(void);
 extern void AtEOXact_Buffers(bool isCommit);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b1404bd77dd..196b7d1ff25 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -459,6 +460,15 @@ static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
+static BlockNumber BulkExtendSharedRelationBuffered(Relation rel,
+													SMgrRelation smgr,
+													bool skip_extension_lock,
+													char relpersistence,
+													ForkNumber fork, ReadBufferMode mode,
+													BufferAccessStrategy strategy,
+													uint32 *num_pages,
+													uint32 num_locked_pages,
+													Buffer *buffers);
 static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
@@ -793,6 +803,73 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * Convenience wrapper around BulkExtendRelationBuffered() extending by one
+ * block.
+ */
+Buffer
+ExtendRelationBuffered(Relation rel, struct SMgrRelationData *smgr,
+					   bool skip_extension_lock,
+					   char relpersistence, ForkNumber forkNum,
+					   ReadBufferMode mode, BufferAccessStrategy strategy)
+{
+	Buffer buf;
+	uint32 num_pages = 1;
+
+	BulkExtendRelationBuffered(rel, smgr, skip_extension_lock, relpersistence,
+							   forkNum, mode, strategy, &num_pages, num_pages, &buf);
+
+	return buf;
+}
+
+
+BlockNumber
+BulkExtendRelationBuffered(Relation rel,
+						   SMgrRelation smgr,
+						   bool skip_extension_lock,
+						   char relpersistence,
+						   ForkNumber fork, ReadBufferMode mode,
+						   BufferAccessStrategy strategy,
+						   uint32 *num_pages,
+						   uint32 num_locked_pages,
+						   Buffer *buffers)
+{
+	BlockNumber first_block;
+
+	Assert(rel != NULL || smgr != NULL);
+	Assert(rel != NULL || skip_extension_lock);
+
+	if (smgr == NULL)
+		smgr = RelationGetSmgr(rel);
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
+										 smgr->smgr_rlocator.locator.spcOid,
+										 smgr->smgr_rlocator.locator.dbOid,
+										 smgr->smgr_rlocator.locator.relNumber,
+										 smgr->smgr_rlocator.backend,
+										 num_pages);
+
+	if (SmgrIsTemp(smgr))
+		first_block = BulkExtendLocalRelationBuffered(smgr,
+													  fork, mode,
+													  num_pages, buffers);
+	else
+		first_block = BulkExtendSharedRelationBuffered(rel, smgr,
+													   skip_extension_lock, relpersistence,
+													   fork, mode, strategy,
+													   num_pages, num_locked_pages,
+													   buffers);
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
+										smgr->smgr_rlocator.locator.spcOid,
+										smgr->smgr_rlocator.locator.dbOid,
+										smgr->smgr_rlocator.locator.relNumber,
+										smgr->smgr_rlocator.backend,
+										num_pages,
+										first_block);
+
+	return first_block;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -807,43 +884,32 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
-	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
 	*hit = false;
 
+	/*
+	 * Backward compatibility path, most code should use
+	 * ExtendRelationBuffered() instead, as acquiring the extension lock
+	 * inside ExtendRelationBuffered() scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+		return ExtendRelationBuffered(NULL, smgr, true, relpersistence, forkNum, mode, strategy);
+
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
-	isExtend = (blockNum == P_NEW);
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
 									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend,
-									   isExtend);
-
-	/* Substitute proper block number if caller asked for P_NEW */
-	if (isExtend)
-	{
-		blockNum = smgrnblocks(smgr, forkNum);
-		/* Fail if relation is already at maximum possible length */
-		if (blockNum == P_NEW)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot extend relation %s beyond %u blocks",
-							relpath(smgr->smgr_rlocator, forkNum),
-							P_NEW)));
-	}
+									   smgr->smgr_rlocator.backend);
 
 	if (isLocalBuf)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.local_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
@@ -858,8 +924,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.shared_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
@@ -870,168 +934,88 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
-		if (!isExtend)
-		{
-			/* Just need to update stats before we exit */
-			*hit = true;
-			VacuumPageHit++;
+		/* Just need to update stats before we exit */
+		*hit = true;
+		VacuumPageHit++;
 
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
 
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-											  smgr->smgr_rlocator.locator.spcOid,
-											  smgr->smgr_rlocator.locator.dbOid,
-											  smgr->smgr_rlocator.locator.relNumber,
-											  smgr->smgr_rlocator.backend,
-											  isExtend,
-											  found);
-
-			/*
-			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
-			 * locked on return.
-			 */
-			if (!isLocalBuf)
-			{
-				if (mode == RBM_ZERO_AND_LOCK)
-					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-								  LW_EXCLUSIVE);
-				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-			}
-
-			return BufferDescriptorGetBuffer(bufHdr);
-		}
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  smgr->smgr_rlocator.locator.spcOid,
+										  smgr->smgr_rlocator.locator.dbOid,
+										  smgr->smgr_rlocator.locator.relNumber,
+										  smgr->smgr_rlocator.backend,
+										  found);
 
 		/*
-		 * We get here only in the corner case where we are trying to extend
-		 * the relation but we found a pre-existing buffer marked BM_VALID.
-		 * This can happen because mdread doesn't complain about reads beyond
-		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
-		 * read a block beyond EOF could have left a "valid" zero-filled
-		 * buffer.  Unfortunately, we have also seen this case occurring
-		 * because of buggy Linux kernels that sometimes return an
-		 * lseek(SEEK_END) result that doesn't account for a recent write. In
-		 * that situation, the pre-existing buffer would contain valid data
-		 * that we don't want to overwrite.  Since the legitimate case should
-		 * always have left a zero-filled buffer, complain if not PageIsNew.
+		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
+		 * locked on return.
 		 */
-		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-		if (!PageIsNew((Page) bufBlock))
-			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rlocator, forkNum)),
-					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
-
-		/*
-		 * We *must* do smgrextend before succeeding, else the page will not
-		 * be reserved by the kernel, and the next P_NEW call will decide to
-		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
-		 * call that BufferAlloc didn't, and proceed.
-		 */
-		if (isLocalBuf)
+		if (!isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			Assert(buf_state & BM_VALID);
-			buf_state &= ~BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			if (mode == RBM_ZERO_AND_LOCK)
+				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+							  LW_EXCLUSIVE);
+			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
+				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
 		}
-		else
-		{
-			/*
-			 * Loop to handle the very small possibility that someone re-sets
-			 * BM_VALID between our clearing it and StartBufferIO inspecting
-			 * it.
-			 */
-			do
-			{
-				uint32		buf_state = LockBufHdr(bufHdr);
 
-				Assert(buf_state & BM_VALID);
-				buf_state &= ~BM_VALID;
-				UnlockBufHdr(bufHdr, buf_state);
-			} while (!StartBufferIO(bufHdr, true));
-		}
+		return BufferDescriptorGetBuffer(bufHdr);
 	}
 
 	/*
 	 * if we have gotten to this point, we have allocated a buffer for the
 	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
 	 * if it's a shared buffer.
-	 *
-	 * Note: if smgrextend fails, we will end up with a buffer that is
-	 * allocated but not marked BM_VALID.  P_NEW will still select the same
-	 * block number (because the relation didn't get any longer on disk) and
-	 * so future attempts to extend the relation will find the same buffer (if
-	 * it's not been recycled) but come right back here to try smgrextend
-	 * again.
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
-	if (isExtend)
-	{
-		/* new buffers are zero-filled */
+	/*
+	 * Read in the page, unless the caller intends to overwrite it and
+	 * just wants us to allocate a buffer.
+	 */
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
-
-		/*
-		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
-		 * although we're essentially performing a write. At least on linux
-		 * doing so defeats the 'delayed allocation' mechanism, leading to
-		 * increased file fragmentation.
-		 */
-	}
 	else
 	{
-		/*
-		 * Read in the page, unless the caller intends to overwrite it and
-		 * just wants us to allocate a buffer.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			MemSet((char *) bufBlock, 0, BLCKSZ);
-		else
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
+		smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
+
+		if (track_io_timing)
 		{
-			instr_time	io_start,
-						io_time;
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-
-			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
-
-			if (track_io_timing)
+		/* check for garbage data */
+		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
+									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		{
+			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
 			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-					MemSet((char *) bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
+				ereport(WARNING,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s; zeroing out page",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
+				MemSet((char *) bufBlock, 0, BLCKSZ);
 			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
 		}
 	}
 
@@ -1074,7 +1058,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									  smgr->smgr_rlocator.locator.dbOid,
 									  smgr->smgr_rlocator.locator.relNumber,
 									  smgr->smgr_rlocator.backend,
-									  isExtend,
 									  found);
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1617,6 +1600,251 @@ again:
 
 	return cur_buf;
 }
+
+/*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, as otherwise the operation likely
+ * cannot be performed at all.
+ *
+ * The number of allowed pins for a backend is computed based on
+ * shared_buffers and the maximum number of connections possible. That's very
+ * pessimistic, but oustide of toy-sized shared_buffers it should allow
+ * sufficient pins.
+ */
+static void
+LimitAdditionalPins(uint32 *additional_pins)
+{
+	uint32 max_backends;
+	int max_proportional_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	/*
+	 * Subtract the approximate number of buffers already pinned by this
+	 * backend. We get the number of "overflowed" pins for free, but don't
+	 * know the number of pins in PrivateRefCountArray. The cost of
+	 * calculating that exactly doesn't seem worth it, so just assume the max.
+	 */
+	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	if (max_proportional_pins < 0)
+		max_proportional_pins = 1;
+
+	if (*additional_pins > max_proportional_pins)
+		*additional_pins = max_proportional_pins;
+}
+
+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
+{
+	BlockNumber first_block;
+
+	LimitAdditionalPins(num_pages);
+
+	/*
+	 * FIXME: limit num_pages / buffers based on NBuffers / MaxBackends or
+	 * such. Also keep MAX_SIMUL_LWLOCKS in mind.
+	 */
+
+	pgBufferUsage.shared_blks_written += *num_pages;
+
+	/*
+	 * Acquire victim buffers for extension without holding extension
+	 * lock. Writing out victim buffers is the most expensive part of
+	 * extending the relation, particularly when doing so requires WAL
+	 * flushes. Zeroing out the buffers is also quite expensive, so do that
+	 * before holding the extension lock as well.
+	 *
+	 * These pages are pinned by us and not valid. While we hold the pin
+	 * they can't be acquired as victim buffers by another backend.
+	 */
+	for (uint32 i = 0; i < *num_pages; i++)
+	{
+		Block		buf_block;
+
+		buffers[i] = GetVictimBuffer(strategy);
+		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	if (!skip_extension_lock)
+		LockRelationForExtension(rel, ExclusiveLock);
+
+	first_block = smgrnblocks(smgr, fork);
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + *num_pages >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	/*
+	 * Insert buffers into buffer table, mark as IO_IN_PROGRESS.
+	 *
+	 * This needs to happen before we extend the relation, because as soon as
+	 * we do, other backends can start to read in those pages.
+	 */
+	for (int i = 0; i < *num_pages; i++)
+	{
+		Buffer		victim_buf = buffers[i];
+		BufferDesc *victim_buf_hdr = GetBufferDescriptor(victim_buf - 1);
+		BufferTag	tag;
+		uint32		hash;
+		LWLock     *partition_lock;
+		int			existing_id;
+
+		InitBufferTag(&tag, &smgr->smgr_rlocator.locator, fork, first_block + i);
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		existing_id = BufTableInsert(&tag, hash, victim_buf_hdr->buf_id);
+
+		/*
+		 * We get here only in the corner case where we are trying to extend
+		 * the relation but we found a pre-existing buffer. This can happen
+		 * because a prior attempt at extending the relation failed, and
+		 * because mdread doesn't complain about reads beyond EOF (when
+		 * zero_damaged_pages is ON) and so a previous attempt to read a block
+		 * beyond EOF could have left a "valid" zero-filled buffer.
+		 * Unfortunately, we have also seen this case occurring because of
+		 * buggy Linux kernels that sometimes return an lseek(SEEK_END) result
+		 * that doesn't account for a recent write. In that situation, the
+		 * pre-existing buffer would contain valid data that we don't want to
+		 * overwrite.  Since the legitimate cases should always have left a
+		 * zero-filled buffer, complain if not PageIsNew.
+		 */
+		if (existing_id >= 0)
+		{
+			BufferDesc *existing_hdr = GetBufferDescriptor(existing_id);
+			Block		buf_block;
+			bool		valid;
+
+			/*
+			 * Pin the existing buffer before releasing the partition lock,
+			 * preventing it from being evicted.
+			 */
+			valid = PinBuffer(existing_hdr, strategy);
+
+			LWLockRelease(partition_lock);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(victim_buf_hdr);
+			UnpinBuffer(victim_buf_hdr);
+
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+			buf_block = BufHdrGetBlock(existing_hdr);
+
+			if (valid && !PageIsNew((Page) buf_block))
+				ereport(ERROR,
+						(errmsg("unexpected data beyond EOF in block %u of relation %s",
+								existing_hdr->tag.blockNum, relpath(smgr->smgr_rlocator, fork)),
+						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
+
+			/*
+			 * We *must* do smgr[zero]extend before succeeding, else the page
+			 * will not be reserved by the kernel, and the next P_NEW call
+			 * will decide to return the same page.  Clear the BM_VALID bit,
+			 * do StartBufferIO() and proceed.
+			 *
+			 * Loop to handle the very small possibility that someone re-sets
+			 * BM_VALID between our clearing it and StartBufferIO inspecting
+			 * it.
+			 */
+			do
+			{
+				uint32		buf_state = LockBufHdr(existing_hdr);
+
+				buf_state &= ~BM_VALID;
+				UnlockBufHdr(existing_hdr, buf_state);
+			} while (!StartBufferIO(existing_hdr, true));
+		}
+		else
+		{
+			uint32		buf_state;
+
+			buf_state = LockBufHdr(victim_buf_hdr);
+
+			/* some sanity checks while we hold the buffer header lock */
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+			if (relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+				buf_state |= BM_PERMANENT;
+
+			UnlockBufHdr(victim_buf_hdr, buf_state);
+
+			LWLockRelease(partition_lock);
+
+			/* XXX: could combine the locked operations in it with the above */
+			StartBufferIO(victim_buf_hdr, true);
+		}
+	}
+
+	/*
+	 * Note: if smgzerorextend fails, we will end up with buffers that are
+	 * allocated but not marked BM_VALID.  The next relation extension will
+	 * still select the same block number (because the relation didn't get any
+	 * longer on disk) and so future attempts to extend the relation will find
+	 * the same buffers (if they have not been recycled) but come right back
+	 * here to try smgrzeroextend again.
+	 *
+	 * We don't need to set checksum for all-zero pages.
+	 */
+	smgrzeroextend(smgr, fork, first_block, *num_pages, false);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 *
+	 * We remove IO_IN_PROGRESS after this, as zeroing the buffer contents and
+	 * waking up waiting backends waiting can take noticeable time.
+	 */
+	if (!skip_extension_lock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* Set BM_VALID, terminate IO, and wake up any waiters */
+	for (int i = 0; i < *num_pages; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+
+		if (i < num_locked_pages &&
+			(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK))
+			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+
+		TerminateBufferIO(buf_hdr, false, BM_VALID);
+	}
+
+	return first_block;
+
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6a3f325ea6d..5e93d8ba839 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -48,6 +48,9 @@ static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
+/* number of local buffers pinned at least once */
+static int	NLocalPinnedBuffers = 0;
+
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
@@ -270,6 +273,132 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see LimitAdditionalPins() */
+static void
+LimitAdditionalLocalPins(uint32 *additional_pins)
+{
+	uint32 max_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	/*
+	 * In contrast to LimitAdditionalPins() other backends don't play a role
+	 * here. We can allow up to NLocBuffer pins in total.
+	 */
+	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+
+	if (*additional_pins >= max_pins)
+		*additional_pins = max_pins;
+}
+
+BlockNumber
+BulkExtendLocalRelationBuffered(SMgrRelation smgr,
+								ForkNumber fork,
+								ReadBufferMode mode,
+								uint32 *num_pages,
+								Buffer *buffers)
+{
+	BlockNumber first_block;
+
+	/* Initialize local buffers if first request in this session */
+	if (LocalBufHash == NULL)
+		InitLocalBuffers();
+
+	LimitAdditionalLocalPins(num_pages);
+
+	pgBufferUsage.temp_blks_written += *num_pages;
+
+	for (uint32 i = 0; i < *num_pages; i++)
+	{
+		BufferDesc *buf_hdr;
+		Block		buf_block;
+
+		buffers[i] = GetLocalVictimBuffer();
+		buf_hdr = GetLocalBufferDescriptor(-(buffers[i] + 1));
+		buf_block = LocalBufHdrGetBlock(buf_hdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	first_block = smgrnblocks(smgr, fork);
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + *num_pages >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	for (int i = 0; i < *num_pages; i++)
+	{
+		int			victim_buf_id;
+		BufferDesc *victim_buf_hdr;
+		BufferTag	tag;
+		LocalBufferLookupEnt *hresult;
+		bool		found;
+
+		victim_buf_id = -(buffers[i] + 1);
+		victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
+
+		InitBufferTag(&tag, &smgr->smgr_rlocator.locator, fork, first_block + i);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
+		if (found)
+		{
+			BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			uint32		buf_state;
+
+			UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
+
+			existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			PinLocalBuffer(existing_hdr, false);
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+
+			buf_state = pg_atomic_read_u32(&existing_hdr->state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(!(buf_state & BM_DIRTY));
+			buf_state &= BM_VALID;
+			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+		}
+		else
+		{
+			uint32		buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
+
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
+
+			hresult->id = victim_buf_id;
+		}
+	}
+
+	/* actually extend relation */
+	smgrzeroextend(smgr, fork, first_block, *num_pages, false);
+
+	for (int i = 0; i < *num_pages; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr;
+		uint32		buf_state;
+
+		buf_hdr = GetLocalBufferDescriptor(-(buf + 1));
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	return first_block;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
@@ -486,6 +615,7 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 
 	if (LocalRefCount[bufid] == 0)
 	{
+		NLocalPinnedBuffers++;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
@@ -507,9 +637,11 @@ UnpinLocalBuffer(Buffer buffer)
 
 	Assert(BufferIsLocal(buffer));
 	Assert(LocalRefCount[buffid] > 0);
+	Assert(NLocalPinnedBuffers > 0);
 
 	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-	LocalRefCount[buffid]--;
+	if (--LocalRefCount[buffid] == 0)
+		NLocalPinnedBuffers--;
 }
 
 /*
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 3ebbcf88ebe..12b4d995cfa 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,10 +55,12 @@ provider postgresql {
 	probe sort__start(int, bool, int, int, bool, int);
 	probe sort__done(bool, long);
 
-	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
+	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
+	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, int);
+	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5d622d5147..b58a30ef5a8 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7344,7 +7344,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
     </row>
     <row>
      <entry><literal>buffer-read-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int)</literal></entry>
      <entry>Probe that fires when a buffer read is started.
       arg0 and arg1 contain the fork and block numbers of the page (but
       arg1 will be -1 if this is a relation extension request).
@@ -7352,12 +7352,11 @@ FROM pg_stat_get_backend_idset() AS backendid;
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.</entry>
+      </entry>
     </row>
     <row>
      <entry><literal>buffer-read-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
      <entry>Probe that fires when a buffer read is complete.
       arg0 and arg1 contain the fork and block numbers of the page (if this
       is a relation extension request, arg1 now contains the block number
@@ -7366,9 +7365,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.
-      arg7 is true if the buffer was found in the pool, false if not.</entry>
+      arg6 is true if the buffer was found in the pool, false if not.</entry>
     </row>
     <row>
      <entry><literal>buffer-flush-start</literal></entry>
-- 
2.38.0

v1-0009-Convert-a-few-places-to-ExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload

From 7a6abff9eb31cf812fdfa4804a853091a27c38e7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:18:18 -0700
Subject: [PATCH v1 09/12] Convert a few places to ExtendRelationBuffered

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/brin/brin.c         | 13 ++++++----
 src/backend/access/brin/brin_pageops.c |  4 +++
 src/backend/access/brin/brin_revmap.c  | 17 ++++--------
 src/backend/access/gin/gininsert.c     | 14 +++++-----
 src/backend/access/gin/ginutil.c       | 15 +++--------
 src/backend/access/gin/ginvacuum.c     |  8 ++++++
 src/backend/access/gist/gist.c         |  6 +++--
 src/backend/access/gist/gistutil.c     | 16 +++---------
 src/backend/access/gist/gistvacuum.c   |  3 +++
 src/backend/access/nbtree/nbtpage.c    | 36 ++++++++------------------
 src/backend/access/nbtree/nbtree.c     |  3 +++
 src/backend/access/spgist/spgutils.c   | 15 +++--------
 contrib/bloom/blutils.c                | 14 +++-------
 13 files changed, 70 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 20b7d65b948..27266b84b0f 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendRelationBuffered(index, NULL, true,
+								  index->rd_rel->relpersistence,
+								  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+								  NULL);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -896,9 +898,10 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	metabuf = ExtendRelationBuffered(index, NULL, true,
+									 index->rd_rel->relpersistence,
+									 INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index f17aad51b63..8525d6be918 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -730,6 +730,10 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * There's not enough free space in any existing index page,
 			 * according to the FSM: extend the relation to obtain a shiny new
 			 * page.
+			 *
+			 * XXX: It's likely possible to use RBM_ZERO_AND_LOCK here,
+			 * which'd avoid the need to hold the extension lock during buffer
+			 * reclaim.
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 6e392a551ad..8b7c72f38d3 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -538,7 +538,6 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
-	bool		needLock = !RELATION_IS_LOCAL(irel);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -570,10 +569,10 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
-
-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendRelationBuffered(irel, NULL, false,
+									 irel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
@@ -582,17 +581,11 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * up and have caller start over.  We will have to evacuate that
 			 * page from under whoever is using it.
 			 */
-			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buf);
+			UnlockReleaseBuffer(buf);
 			return;
 		}
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
-
-		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index ea1c4184fbf..795efa199b0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,12 +440,14 @@ ginbuildempty(Relation index)
 				MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
-	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	MetaBuffer = ExtendRelationBuffered(index, NULL, true,
+										index->rd_rel->relpersistence,
+										INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+										NULL);
+	RootBuffer = ExtendRelationBuffered(index, NULL, true,
+										index->rd_rel->relpersistence,
+										INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+										NULL);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 6df7f2eaeba..12a7b6d03e1 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -298,7 +298,6 @@ Buffer
 GinNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -326,16 +325,10 @@ GinNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendRelationBuffered(index, NULL, false,
+									index->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index b4fa5f6bf81..8ec7b399287 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -736,6 +736,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	needLock = !RELATION_IS_LOCAL(index);
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	npages = RelationGetNumberOfBlocks(index);
@@ -786,6 +790,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	stats->pages_free = totFreePages;
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 30069f139c7..014abdc53c0 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -133,8 +133,10 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = ExtendRelationBuffered(index, NULL, true,
+									index->rd_rel->relpersistence,
+									INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 1532462317b..22d8c49b3f7 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -824,7 +824,6 @@ Buffer
 gistNewBuffer(Relation r)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -877,17 +876,10 @@ gistNewBuffer(Relation r)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(r);
-
-	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
-
-	buffer = ReadBuffer(r, P_NEW);
-	LockBuffer(buffer, GIST_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+	buffer = ExtendRelationBuffered(r, NULL, false,
+									r->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 0aa6e58a62f..d19afc1e4b1 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -203,6 +203,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * we must already have processed any tuples due to be moved into such a
 	 * page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 8b96708b3ea..b4367b73630 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -881,7 +881,6 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	}
 	else
 	{
-		bool		needLock;
 		Page		page;
 
 		Assert(access == BT_WRITE);
@@ -962,31 +961,18 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		}
 
 		/*
-		 * Extend the relation by one page.
-		 *
-		 * We have to use a lock to ensure no one else is extending the rel at
-		 * the same time, else we will both try to initialize the same new
-		 * page.  We can skip locking for new or temp relations, however,
-		 * since no one else could be accessing them.
+		 * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
+		 * we risk a race condition against btvacuumscan --- see comments
+		 * therein. This forces us to repeat the valgrind request that
+		 * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
+		 * without introducing a race.
 		 */
-		needLock = !RELATION_IS_LOCAL(rel);
-
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-
-		buf = ReadBuffer(rel, P_NEW);
-
-		/* Acquire buffer lock on new page */
-		_bt_lockbuf(rel, buf, BT_WRITE);
-
-		/*
-		 * Release the file-extension lock; it's now OK for someone else to
-		 * extend the relation some more.  Note that we cannot release this
-		 * lock before we have buffer lock on the new page, or we risk a race
-		 * condition against btvacuumscan --- see comments therein.
-		 */
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+		buf = ExtendRelationBuffered(rel, NULL, false,
+									 rel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b52eca8f38b..904ec32354f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -969,6 +969,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * write-lock on the left page before it adds a right page, so we must
 	 * already have processed any tuples due to be moved into such a page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 2c661fcf96f..bd0993a0501 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -365,7 +365,6 @@ Buffer
 SpGistNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -405,16 +404,10 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendRelationBuffered(index, NULL, false,
+									index->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a434cf93efd..39b5ad47e18 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -353,7 +353,6 @@ Buffer
 BloomNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -387,15 +386,10 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendRelationBuffered(index, NULL, false,
+									index->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
-- 
2.38.0

v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload

From 9882dcba636ce45443b0a377efbc87d8a069853d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v1 10/12] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/hio.h         | 14 +++++++-
 src/backend/access/heap/heapam.c | 60 +++++++++++++++++++++++++++++---
 src/backend/access/heap/hio.c    |  8 ++++-
 3 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index bb90c6fad81..a7326e28d9a 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -30,6 +30,17 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+
+	/*
+	 * State for bulk extensions. Further pages that were unused at the time
+	 * of the extension. They might be in use by the time we use them though,
+	 * so rechecks are needed.
+	 *
+	 * FIXME: Perhaps these should live in RelationData instead, alongside the
+	 * targetblock?
+	 */
+	BlockNumber	next_free;
+	BlockNumber	last_free;
 } BulkInsertStateData;
 
 
@@ -38,6 +49,7 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
 extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
 										Buffer otherBuffer, int options,
 										BulkInsertStateData *bistate,
-										Buffer *vmbuffer, Buffer *vmbuffer_other);
+										Buffer *vmbuffer, Buffer *vmbuffer_other,
+										int num_pages);
 
 #endif							/* HIO_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index bd4d85041d3..51e3a9fff34 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1977,6 +1977,8 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 	return bistate;
 }
 
@@ -2050,7 +2052,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
-									   &vmbuffer, NULL);
+									   &vmbuffer, NULL,
+									   0);
 
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
@@ -2253,6 +2256,33 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 		return tup;
 }
 
+/*
+ * Helper for heap_multi_insert() that computes the number of full pages s
+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace)
+{
+	size_t		page_avail;
+	int			npages = 0;
+
+	page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+	npages++;
+
+	for (int i = done; i < ntuples; i++)
+	{
+		size_t tup_sz = sizeof(ItemIdData) + MAXALIGN(heaptuples[i]->t_len);
+
+		if (page_avail < tup_sz)
+		{
+			npages++;
+			page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+		}
+		page_avail -= tup_sz;
+	}
+
+	return npages;
+}
+
 /*
  *	heap_multi_insert	- insert multiple tuples into a heap
  *
@@ -2279,6 +2309,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	bool		starting_with_empty_page = false;
+	int			npages = 0;
+	int			npages_used = 0;
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2329,13 +2362,30 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	while (ndone < ntuples)
 	{
 		Buffer		buffer;
-		bool		starting_with_empty_page;
 		bool		all_visible_cleared = false;
 		bool		all_frozen_set = false;
 		int			nthispage;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Compute number of pages needed to insert tuples in the worst
+		 * case. This will be used to determine how much to extend the
+		 * relation by in RelationGetBufferForTuple(), if needed.  If we
+		 * filled a prior page from scratch, we can just update our last
+		 * computation, but if we started with a partially filled page
+		 * recompute from scratch, the number of potentially required pages
+		 * can vary due to tuples needing to fit onto the page, page headers
+		 * etc.
+		 */
+		if (ndone == 0 || !starting_with_empty_page)
+		{
+			npages = heap_multi_insert_pages(heaptuples, ndone, ntuples, saveFreeSpace);
+			npages_used = 0;
+		}
+		else
+			npages_used++;
+
 		/*
 		 * Find buffer where at least the next tuple will fit.  If the page is
 		 * all-visible, this will also pin the requisite visibility map page.
@@ -2345,7 +2395,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		 */
 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
 										   InvalidBuffer, options, bistate,
-										   &vmbuffer, NULL);
+										   &vmbuffer, NULL,
+										   npages - npages_used);
 		page = BufferGetPage(buffer);
 
 		starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0;
@@ -3768,7 +3819,8 @@ l2:
 				/* It doesn't fit, must use RelationGetBufferForTuple. */
 				newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
 												   buffer, 0, NULL,
-												   &vmbuffer_new, &vmbuffer);
+												   &vmbuffer_new, &vmbuffer,
+												   0);
 				/* We're all done. */
 				break;
 			}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index bc4035a6221..0d43ebb3877 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -275,6 +275,11 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	Returns pinned and exclusive-locked buffer of a page in given relation
  *	with free space >= given len.
  *
+ *	If num_pages is > 1, the relation will be extended by at least that many
+ *	pages when we decide to extend the relation. This is more efficient for
+ *	callers that know they will need multiple pages
+ *	(e.g. heap_multi_insert()).
+ *
  *	If otherBuffer is not InvalidBuffer, then it references a previously
  *	pinned buffer of another page in the same relation; on return, this
  *	buffer will also be exclusive-locked.  (This case is used by heap_update;
@@ -333,7 +338,8 @@ Buffer
 RelationGetBufferForTuple(Relation relation, Size len,
 						  Buffer otherBuffer, int options,
 						  BulkInsertState bistate,
-						  Buffer *vmbuffer, Buffer *vmbuffer_other)
+						  Buffer *vmbuffer, Buffer *vmbuffer_other,
+						  int num_pages)
 {
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
-- 
2.38.0

v1-0011-hio-Use-BulkExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload

From 23d7d9173e12f7105047c424836f57347c8fb15f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v1 11/12] hio: Use BulkExtendRelationBuffered()

---
 src/backend/access/heap/hio.c | 181 +++++++++++++++++++++++++++++++---
 1 file changed, 170 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 0d43ebb3877..51fadc9ac4c 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -24,6 +24,7 @@
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 
+#define NEW_EXTEND
 
 /*
  * RelationPutHeapTuple - place tuple at specified page
@@ -185,6 +186,8 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }
 
+#ifndef NEW_EXTEND
+
 /*
  * Extend a relation by multiple blocks to avoid future contention on the
  * relation extension lock.  Our goal is to pre-extend the relation by an
@@ -268,6 +271,7 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	 */
 	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
 }
+#endif
 
 /*
  * RelationGetBufferForTuple
@@ -354,6 +358,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
 
 	len = MAXALIGN(len);		/* be conservative */
 
+	if (num_pages <= 0)
+		num_pages = 1;
+
 	/* Bulk insert is not supported for updates, only inserts. */
 	Assert(otherBuffer == InvalidBuffer || !bistate);
 
@@ -558,18 +565,46 @@ loop:
 			ReleaseBuffer(buffer);
 		}
 
-		/* Without FSM, always fall out of the loop and extend */
-		if (!use_fsm)
-			break;
+		if (bistate
+			&& bistate->next_free != InvalidBlockNumber
+			&& bistate->next_free <= bistate->last_free)
+		{
+			/*
+			 * We bulk extended the relation before, and there are still some
+			 * unused pages from that extension, so we don't need to look in
+			 * the FSM for a new page. But do record the free space from the
+			 * last page, somebody might insert narrower tuples later.
+			 */
+			if (use_fsm)
+				RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
 
-		/*
-		 * Update FSM as to condition of this page, and ask for another page
-		 * to try.
-		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
-													targetBlock,
-													pageFreeSpace,
-													targetFreeSpace);
+			Assert(bistate->last_free != InvalidBlockNumber &&
+				   bistate->next_free <= bistate->last_free);
+			targetBlock = bistate->next_free;
+			if (bistate->next_free >= bistate->last_free)
+			{
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
+			}
+			else
+				bistate->next_free++;
+		}
+		else if (!use_fsm)
+		{
+			/* Without FSM, always fall out of the loop and extend */
+			break;
+		}
+		else
+		{
+			/*
+			 * Update FSM as to condition of this page, and ask for another page
+			 * to try.
+			 */
+			targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														targetBlock,
+														pageFreeSpace,
+														targetFreeSpace);
+		}
 	}
 
 	/*
@@ -582,6 +617,129 @@ loop:
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+#ifdef NEW_EXTEND
+	{
+#define MAX_BUFFERS 64
+		Buffer victim_buffers[MAX_BUFFERS];
+		BlockNumber firstBlock = InvalidBlockNumber;
+		BlockNumber firstBlockFSM = InvalidBlockNumber;
+		BlockNumber curBlock;
+		uint32 extend_by_pages;
+		uint32 no_fsm_pages;
+		uint32 waitcount;
+
+		extend_by_pages = num_pages;
+
+		/*
+		 * Multiply the number of pages to extend by the number of waiters. Do
+		 * this even if we're not using the FSM, as it does relieve
+		 * contention. Pages will be found via bistate->next_free.
+		 */
+		if (needLock)
+			waitcount = RelationExtensionLockWaiterCount(relation);
+		else
+			waitcount = 0;
+		extend_by_pages += extend_by_pages * waitcount;
+
+		/*
+		 * can't extend by more than MAX_BUFFERS, we need to pin them all
+		 * concurrently. FIXME: Need an NBuffers / MaxBackends type limit
+		 * here.
+		 */
+		extend_by_pages = Min(extend_by_pages, MAX_BUFFERS);
+
+		/*
+		 * How many of the extended pages not to enter into the FSM.
+		 *
+		 * Only enter pages that we don't need ourselves into the
+		 * FSM. Otherwise every other backend will immediately try to use the
+		 * pages this backend neds itself, causing unnecessary contention.
+		 *
+		 * Bulk extended pages are remembered in bistate->next_free_buffer. So
+		 * without a bistate we can't directly make use of them.
+		 *
+		 * Never enter the page returned into the FSM, we'll immediately use
+		 * it.
+		 */
+		if (num_pages > 1 && bistate == NULL)
+			no_fsm_pages = 1;
+		else
+			no_fsm_pages = num_pages;
+
+		if (bistate && bistate->current_buf != InvalidBuffer)
+		{
+			ReleaseBuffer(bistate->current_buf);
+			bistate->current_buf = InvalidBuffer;
+		}
+
+		firstBlock = BulkExtendRelationBuffered(relation,
+												NULL,
+												false,
+												relation->rd_rel->relpersistence,
+												MAIN_FORKNUM,
+												RBM_ZERO_AND_LOCK,
+												bistate ? bistate->strategy : NULL,
+												&extend_by_pages,
+												1,
+												victim_buffers);
+		/*
+		 * Relation is now extended. Make all but the first buffer available
+		 * to other backends.
+		 *
+		 * XXX: We don't necessarily need to release pin / update FSM while
+		 * holding the extension lock. But there are some advantages.
+		 */
+		curBlock = firstBlock;
+		for (uint32 i = 0; i < extend_by_pages; i++, curBlock++)
+		{
+			Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
+			Assert(BlockNumberIsValid(curBlock));
+
+			/* don't release the pin on the page returned by this function */
+			if (i > 0)
+				ReleaseBuffer(victim_buffers[i]);
+
+			if (i >= no_fsm_pages && use_fsm)
+			{
+				if (firstBlockFSM == InvalidBlockNumber)
+					firstBlockFSM = curBlock;
+
+				RecordPageWithFreeSpace(relation,
+										curBlock,
+										BufferGetPageSize(victim_buffers[i]) - SizeOfPageHeaderData);
+			}
+		}
+
+		if (use_fsm && firstBlockFSM != InvalidBlockNumber)
+			FreeSpaceMapVacuumRange(relation, firstBlockFSM, firstBlock + num_pages);
+
+		if (bistate)
+		{
+			if (extend_by_pages > 1)
+			{
+				bistate->next_free = firstBlock + 1;
+				bistate->last_free = firstBlock + extend_by_pages - 1;
+			}
+			else
+			{
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
+			}
+		}
+
+		buffer = victim_buffers[0];
+		if (bistate)
+		{
+			IncrBufferRefCount(buffer);
+			bistate->current_buf = buffer;
+		}
+#if 0
+		ereport(LOG, errmsg("block start %u, size %zu, requested pages: %u, extend_by_pages: %d, waitcount: %d",
+							firstBlock, len, num_pages, extend_by_pages, waitcount),
+				errhidestmt(true), errhidecontext(true));
+#endif
+	}
+#else
 	/*
 	 * If we need the lock but are not able to acquire it immediately, we'll
 	 * consider extending the relation by multiple blocks at a time to manage
@@ -635,6 +793,7 @@ loop:
 	 */
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
+#endif
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
-- 
2.38.0

v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patchtext/x-diff; charset=us-asciiDownload

From be82f60f1a8097d042674678e177e6c28a4ebdfe Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:41:46 -0700
Subject: [PATCH v1 12/12] bufmgr: debug: Add PrintBuffer[Desc]

Useful for development. Perhaps we should polish these and keep them?
---
 src/include/storage/buf_internals.h |  3 ++
 src/backend/storage/buffer/bufmgr.c | 49 +++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 34b28eea906..337e94e4b4e 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -390,6 +390,9 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
+extern void PrintBuffer(Buffer buffer, const char *msg);
+extern void PrintBufferDesc(BufferDesc *buf_hdr, const char *msg);
+
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 196b7d1ff25..a77a1205cf8 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3723,6 +3723,55 @@ DropDatabaseBuffers(Oid dbid)
  *		use only.
  * -----------------------------------------------------------------
  */
+
+#include "utils/memutils.h"
+
+void
+PrintBufferDesc(BufferDesc *buf_hdr, const char *msg)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	char	   *path = "";
+	BlockNumber blockno = InvalidBlockNumber;
+
+	CurrentMemoryContext->allowInCritSection = true;
+	if (buf_state & BM_TAG_VALID)
+	{
+		path = relpathbackend(BufTagGetRelFileLocator(&buf_hdr->tag),
+							  InvalidBackendId, BufTagGetForkNum(&buf_hdr->tag));
+		blockno = buf_hdr->tag.blockNum;
+	}
+
+	fprintf(stderr, "%d: [%u] msg: %s, rel: %s, block %u: refcount: %u / %u, usagecount: %u, flags:%s%s%s%s%s%s%s%s%s%s\n",
+			MyProcPid,
+			buffer,
+			msg,
+			path,
+			blockno,
+			BUF_STATE_GET_REFCOUNT(buf_state),
+			GetPrivateRefCount(buffer),
+			BUF_STATE_GET_USAGECOUNT(buf_state),
+			buf_state & BM_LOCKED ? " BM_LOCKED" : "",
+			buf_state & BM_DIRTY ? " BM_DIRTY" : "",
+			buf_state & BM_VALID ? " BM_VALID" : "",
+			buf_state & BM_TAG_VALID ? " BM_TAG_VALID" : "",
+			buf_state & BM_IO_IN_PROGRESS ? " BM_IO_IN_PROGRESS" : "",
+			buf_state & BM_IO_ERROR ? " BM_IO_ERROR" : "",
+			buf_state & BM_JUST_DIRTIED ? " BM_JUST_DIRTIED" : "",
+			buf_state & BM_PIN_COUNT_WAITER ? " BM_PIN_COUNT_WAITER" : "",
+			buf_state & BM_CHECKPOINT_NEEDED ? " BM_CHECKPOINT_NEEDED" : "",
+			buf_state & BM_PERMANENT ? " BM_PERMANENT" : ""
+		);
+}
+
+void
+PrintBuffer(Buffer buffer, const char *msg)
+{
+	BufferDesc *buf_hdr = GetBufferDescriptor(buffer - 1);
+
+	PrintBufferDesc(buf_hdr, msg);
+}
+
 #ifdef NOT_USED
 void
 PrintBufferDescs(void)
-- 
2.38.0

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 3 years ago

In reply to: Andres Freund (#1)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

Thanks a lot for this great work. There are 12 patches in this thread,
I believe each of these patches is trying to solve separate problems
and can be reviewed and get committed separately, am I correct?

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

Yes, I too have observed this in the past for parallel inserts in CTAS
work - /messages/by-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk+KD2H7ZmRg@mail.gmail.com.
Tackling bulk relation extension problems will unblock the parallel
inserts (in CTAS, COPY) work I believe.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

Makes sense.

I will try to understand and review each patch separately.

Firstly, 0001 avoids extra loop over waiters and looks a reasonable
change, some comments on the patch:

1)
+    int            lwWaiting;        /* 0 if not waiting, 1 if on
waitlist, 2 if
+                                 * waiting to be woken */
Use macros instead of hard-coded values for better readability?

#define PROC_LW_LOCK_NOT_WAITING 0
#define PROC_LW_LOCK_ON_WAITLIST 1
#define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2

2) Missing initialization of lwWaiting to 0 or the macro in twophase.c
and proc.c.
proc->lwWaiting = false;
MyProc->lwWaiting = false;

3)
+        proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+        found = true;
I guess 'found' is a bit meaningless here as we are doing away with
the proclist_foreach_modify loop. We can directly use
MyProc->lwWaiting == 1 and remove 'found'.

4)
if (!MyProc->lwWaiting)
if (!proc->lwWaiting)
Can we modify the above conditions in lwlock.c to MyProc->lwWaiting !=
1 or PROC_LW_LOCK_ON_WAITLIST or the macro?

5) Is there any specific test case that I can see benefit of this
patch? If yes, can you please share it here?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: Bharath Rupireddy (#2)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2022-10-29 18:33:53 +0530, Bharath Rupireddy wrote:

On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

Thanks a lot for this great work. There are 12 patches in this thread,
I believe each of these patches is trying to solve separate problems
and can be reviewed and get committed separately, am I correct?

Mostly, yes.

For 0001 I already started
/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
to discuss the specific issue.

We don't strictly need v1-0002-aio-Add-some-error-checking-around-pinning.patch
but I did find it useful.

v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patch is not used in the patch
series, but I found it quite useful when debugging issues with the patch. A
heck of a lot easier to interpret page flags when they can be printed.

I also think there's some architectural questions that'll influence the number
of patches. E.g. I'm not convinced
v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch is quite the
right spot to track which additional pages should be used. It could very well
instead be alongside ->smgr_targblock. Possibly the best path would instead
be to return the additional pages explicitly to callers of
RelationGetBufferForTuple, but RelationGetBufferForTuple does a bunch of work
around pinning that potentially would need to be repeated in heap_multi_insert().

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

Yes, I too have observed this in the past for parallel inserts in CTAS
work - /messages/by-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk+KD2H7ZmRg@mail.gmail.com.
Tackling bulk relation extension problems will unblock the parallel
inserts (in CTAS, COPY) work I believe.

Yea. There's a lot of places the current approach ended up being a bottleneck.

Firstly, 0001 avoids extra loop over waiters and looks a reasonable
change, some comments on the patch:

1)
+    int            lwWaiting;        /* 0 if not waiting, 1 if on
waitlist, 2 if
+                                 * waiting to be woken */
Use macros instead of hard-coded values for better readability?
#define PROC_LW_LOCK_NOT_WAITING 0
#define PROC_LW_LOCK_ON_WAITLIST 1
#define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2

Yea - this was really more of a prototype patch - I noted that we'd want to
use defines for this in
/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

3)
+        proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+        found = true;
I guess 'found' is a bit meaningless here as we are doing away with
the proclist_foreach_modify loop. We can directly use
MyProc->lwWaiting == 1 and remove 'found'.

We can rename it, but I think we still do need it, it's easier to analyze the
logic if the relevant check happens on a value from while we held the wait
list lock. Probably should do the reset inside the locked section as well.

4)
if (!MyProc->lwWaiting)
if (!proc->lwWaiting)
Can we modify the above conditions in lwlock.c to MyProc->lwWaiting !=
1 or PROC_LW_LOCK_ON_WAITLIST or the macro?

I think it's better to check it's 0, rather than just != 1.

5) Is there any specific test case that I can see benefit of this
patch? If yes, can you please share it here?

Yep, see the other thread, there's a pretty easy case there. You can also see
it at extreme client counts with a pgbench -S against a cluster with a smaller
shared_buffers. But the difference is not huge before something like 2048-4096
clients, and then it only occurs occasionally (because you need to end up with
most connections waiting for one of the partitions). So the test case from the
other thread is a lot better.

Greetings,

Andres Freund

vignesh C

vignesh21@gmail.com

about 3 years ago

In reply to: Andres Freund (#1)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3993.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
f2857af485a00ab5dbfa2c83af9d83afe4378239 ===
=== applying patch
./v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patch
patching file src/include/storage/proc.h
Hunk #1 FAILED at 217.
1 out of 1 hunk FAILED -- saving rejects to file src/include/storage/proc.h.rej
patching file src/backend/storage/lmgr/lwlock.c
Hunk #1 succeeded at 988 with fuzz 2 (offset 1 line).
Hunk #2 FAILED at 1047.
Hunk #3 FAILED at 1076.
Hunk #4 FAILED at 1104.
Hunk #5 FAILED at 1117.
Hunk #6 FAILED at 1141.
Hunk #7 FAILED at 1775.
Hunk #8 FAILED at 1790.
7 out of 8 hunks FAILED -- saving rejects to file
src/backend/storage/lmgr/lwlock.c.rej

[1]: http://cfbot.cputube.org/patch_41_3993.log

Regards,
Vignesh

Andres Freund

andres@anarazel.de

about 3 years ago

In reply to: vignesh C (#4)

11 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-01-06 11:52:04 +0530, vignesh C wrote:

On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

The patch does not apply on top of HEAD as in [1], please post a rebased
patch.

Thanks for letting me now. Updated version attached.

Greetings,

Andres Freund

Attachments:

v2-0001-aio-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload

From dc67e1ff43e550a2ff6a0181995f2f12bbb2a423 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v2 01/14] aio: Add some error checking around pinning.

---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 42 ++++++++++++++++++++---------
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 33eadbc1291..3becf32a3c0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -129,6 +129,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
+extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cfa..bfaf141edd7 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1707,6 +1707,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	Assert(!BufferIsLocal(b));
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1852,6 +1854,8 @@ UnpinBuffer(BufferDesc *buf)
 	PrivateRefCountEntry *ref;
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 
+	Assert(!BufferIsLocal(b));
+
 	/* not moving as we're likely deleting it soon anyway */
 	ref = GetPrivateRefCountEntry(b, false);
 	Assert(ref != NULL);
@@ -4209,6 +4213,25 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 LocalRefCount[-buffer - 1]);
+	}
+	else
+	{
+		/* There should be exactly one local pin */
+		if (GetPrivateRefCount(buffer) != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 GetPrivateRefCount(buffer));
+	}
+}
+
 /*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
@@ -4236,20 +4259,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	if (BufferIsLocal(buffer))
-	{
-		/* There should be exactly one pin */
-		if (LocalRefCount[-buffer - 1] != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 LocalRefCount[-buffer - 1]);
-		/* Nobody else to wait for */
-		return;
-	}
+	BufferCheckOneLocalPin(buffer);
 
-	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		elog(ERROR, "incorrect local pin count: %d",
-			 GetPrivateRefCount(buffer));
+	/* Nobody else to wait for */
+	if (BufferIsLocal(buffer))
+		return;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -4757,6 +4771,8 @@ LockBufHdr(BufferDesc *desc)
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
 	init_local_spin_delay(&delayStatus);
 
 	while (true)
-- 
2.38.0

v2-0002-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload

From bb6a65580687d8bb932e2dc26c32e72025d34354 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v2 02/14] hio: Release extension lock before initializing page
 / pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.
---
 src/backend/access/heap/hio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e152807d2dc..7479212d4e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

v2-0003-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload

From 097a56759a7d4ac8352d95b4f23ec98e96bb394f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:25:46 -0700
Subject: [PATCH v2 03/14] Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/fd.h        |   3 +
 src/include/storage/md.h        |   2 +
 src/include/storage/smgr.h      |   2 +
 src/backend/storage/file/fd.c   | 105 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   | 103 +++++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  21 +++++++
 6 files changed, 236 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d083..2c9453aa3f0 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -106,6 +106,9 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
+extern int	FileZero(File file, off_t offset, off_t len, uint32 wait_event_info);
+extern int	FileFallocate(File file, off_t offset, off_t len, uint32 wait_event_info);
+
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
 extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index bcada9ff221..67afd14d7b0 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,6 +28,8 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 56233c4d216..a5806029ce1 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 926d000f2ea..afd05e48cc0 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -93,6 +93,7 @@
 #include "common/pg_prng.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/pg_iovec.h"
 #include "portability/mem.h"
 #include "postmaster/startup.h"
 #include "storage/fd.h"
@@ -2205,6 +2206,105 @@ FileSync(File file, uint32 wait_event_info)
 	return returnCode;
 }
 
+/* So that FileZero() doesn't have to re-zero a block on every call */
+static const PGAlignedBlock zerobuf = {0};
+
+int
+FileZero(File file, off_t offset, off_t len, uint32 wait_event_info)
+{
+	int			returnCode;
+	int			numblocks;
+	struct iovec iov[PG_IOV_MAX];
+
+	/*
+	 * FIXME: Quick-and-dirty implementation, to be replaced by
+	 * pg_pwrite_zeros() from
+	 * https://postgr.es/m/Y1oc%2BFjiyVjNZa%2BL%40paquier.xyz
+	 *
+	 * Otherwise it'd not at all be ok to rely on len being a multiple of
+	 * BLCKSZ.
+	 */
+	Assert((len % BLCKSZ) == 0);
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	numblocks = len / BLCKSZ;
+
+	for (int i = 0; i < Min(numblocks, lengthof(iov)); ++i)
+	{
+		iov[i].iov_base = (char *) zerobuf.data;
+		iov[i].iov_len = BLCKSZ;
+	}
+
+	while (numblocks > 0)
+	{
+		int			iovcnt = Min(numblocks, lengthof(iov));
+		off_t		seekpos_l = offset;
+		ssize_t		ret;
+
+		pgstat_report_wait_start(wait_event_info);
+		ret = pg_pwritev_with_retry(VfdCache[file].fd, iov, iovcnt, seekpos_l);
+		pgstat_report_wait_end();
+
+		if (ret < 0)
+			return -1;
+
+		Assert(ret == iovcnt * BLCKSZ);
+		offset += iovcnt * BLCKSZ;
+		numblocks -= iovcnt;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to reserve file space with posix_fallocate(). If posix_fallocate() is
+ * not implemented on the operating system or fails with EINVAL / EOPNOTSUPP,
+ * use FileZero() instead.
+ *
+ * Note that at least glibc() implements posix_fallocate() in userspace if not
+ * implemented by the filesystem. That's not the case for all environments
+ * though.
+ */
+int
+FileFallocate(File file, off_t offset, off_t len, uint32 wait_event_info)
+{
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+#ifdef HAVE_POSIX_FALLOCATE
+	pgstat_report_wait_start(wait_event_info);
+	returnCode = posix_fallocate(VfdCache[file].fd, offset, len);
+	pgstat_report_wait_end();
+
+	if (returnCode == 0)
+		return 0;
+
+	/* for compatibility with %m printing etc */
+	errno = returnCode;
+
+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return returnCode;
+
+	if (returnCode == 0 ||
+		(returnCode != EINVAL && returnCode != EINVAL))
+		return returnCode;
+#endif
+
+	return FileZero(file, offset, len, wait_event_info);
+}
+
 off_t
 FileSize(File file)
 {
@@ -2277,6 +2377,11 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file) < 0)
+		return -1;
+
+	FileAccess(file);
 	return VfdCache[file].fd;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 60c9905eff9..2197670f4b0 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -500,6 +501,108 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	MdfdVec    *v;
+	BlockNumber curblocknum = blocknum;
+	int         remblocks = nblocks;
+
+	Assert(nblocks > 0);
+
+	/* This assert is too expensive to have on normally ... */
+#ifdef CHECK_WRITE_VS_EXTEND
+	Assert(blocknum >= mdnblocks(reln, forknum));
+#endif
+
+	/*
+	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
+	 * more --- we mustn't create a block whose number actually is
+	 * InvalidBlockNumber or larger.
+	 */
+	if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend file \"%s\" beyond %u blocks",
+						relpath(reln->smgr_rlocator, forknum),
+						InvalidBlockNumber)));
+
+	while (remblocks > 0)
+	{
+		int			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		int			segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks;
+		off_t       seekpos = (off_t) BLCKSZ * segstartblock;
+		int			numblocks;
+
+		if (segendblock > RELSEG_SIZE)
+			segendblock = RELSEG_SIZE;
+
+		numblocks = segendblock - segstartblock;
+
+		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+
+		Assert(segstartblock < RELSEG_SIZE);
+		Assert(segendblock <= RELSEG_SIZE);
+
+		/*
+		 * If available use posix_fallocate() to extend the relation. That's
+		 * often more efficient than using write(), as it commonly won't cause
+		 * the kernel to allocate page cache space for the extended pages.
+		 *
+		 * However, we shouldn't use fallocate() for small extensions, it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 */
+		if (numblocks > 8)
+		{
+			int         ret;
+
+			ret = FileFallocate(v->mdfd_vfd, seekpos,
+								(off_t) BLCKSZ * numblocks,
+								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret != 0)
+			{
+				ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+									FilePathName(v->mdfd_vfd)),
+							 errhint("Check free disk space.")));
+			}
+		}
+		else
+		{
+			int         ret;
+
+			/*
+			 * Even if we don't have fallocate, we can still extend a bit more
+			 * efficiently than writing each 8kB block individually.
+			 * FileZero() uses pg_writev[with_retry] with a single zeroed
+			 * buffer to avoid needing a zeroed buffer for the whole length of
+			 * the extension.
+			 */
+			ret = FileZero(v->mdfd_vfd, seekpos,
+						   (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not extend file \"%s\": %m",
+								FilePathName(v->mdfd_vfd)),
+						 errhint("Check free disk space.")));
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+		remblocks -= segendblock - segstartblock;
+		curblocknum += segendblock - segstartblock;
+	}
+}
+
 /*
  *	mdopenfork() -- Open one fork of the specified relation.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 80eb6311e74..2c0d26eabe0 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, char *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -75,6 +77,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
 		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
@@ -507,6 +510,24 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+void
+smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   int nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
+											 nblocks, skipFsync);
+
+	/*
+	 * Normally we expect this to increase nblocks by nblocks, but if the
+	 * cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+}
+
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
-- 
2.38.0

v2-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From a9189ec0e28855b0e3bf02f1dfbe820227910b4d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v2 04/14] bufmgr: Add Pin/UnpinLocalBuffer()

So far these were open-coded in quite a few places, without a good reason.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h   |  2 +
 src/backend/storage/buffer/bufmgr.c   | 30 +++----------
 src/backend/storage/buffer/localbuf.c | 62 +++++++++++++++++----------
 3 files changed, 46 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index ed8aa2519c0..4b1aeb5fd25 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -413,6 +413,8 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
+extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
+extern void UnpinLocalBuffer(Buffer buffer);
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index bfaf141edd7..678e390ab4d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -644,20 +644,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		/* Is it still valid and holding the right tag? */
 		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * Bump buffer's ref and usage counts. This is equivalent of
-			 * PinBuffer for a shared buffer.
-			 */
-			if (LocalRefCount[b] == 0)
-			{
-				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-				{
-					buf_state += BUF_USAGECOUNT_ONE;
-					pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-				}
-			}
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner, recent_buffer);
+			PinLocalBuffer(bufHdr, true);
 
 			pgBufferUsage.local_blks_hit++;
 
@@ -1660,8 +1647,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
 				BufTagGetForkNum(&bufHdr->tag) == forkNum)
 				return buffer;
-			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-			LocalRefCount[-buffer - 1]--;
+			UnpinLocalBuffer(buffer);
 		}
 		else
 		{
@@ -3938,15 +3924,9 @@ ReleaseBuffer(Buffer buffer)
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
-	{
-		ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-
-		Assert(LocalRefCount[-buffer - 1] > 0);
-		LocalRefCount[-buffer - 1]--;
-		return;
-	}
-
-	UnpinBuffer(GetBufferDescriptor(buffer - 1));
+		UnpinLocalBuffer(buffer);
+	else
+		UnpinBuffer(GetBufferDescriptor(buffer - 1));
 }
 
 /*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b2720df6eaa..7b6294deef3 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -136,27 +136,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
 				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
 #endif
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		/* this part is equivalent to PinBuffer for a shared buffer */
-		if (LocalRefCount[b] == 0)
-		{
-			if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-			{
-				buf_state += BUF_USAGECOUNT_ONE;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-		}
-		LocalRefCount[b]++;
-		ResourceOwnerRememberBuffer(CurrentResourceOwner,
-									BufferDescriptorGetBuffer(bufHdr));
-		if (buf_state & BM_VALID)
-			*foundPtr = true;
-		else
-		{
-			/* Previous read attempt must have failed; try again */
-			*foundPtr = false;
-		}
+		*foundPtr = PinLocalBuffer(bufHdr, true);
 		return bufHdr;
 	}
 
@@ -193,9 +174,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			else
 			{
 				/* Found a usable buffer */
-				LocalRefCount[b]++;
-				ResourceOwnerRememberBuffer(CurrentResourceOwner,
-											BufferDescriptorGetBuffer(bufHdr));
+				PinLocalBuffer(bufHdr, false);
 				break;
 			}
 		}
@@ -483,6 +462,43 @@ InitLocalBuffers(void)
 	NLocBuffer = nbufs;
 }
 
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+	uint32		buf_state;
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	int			bufid = -(buffer + 1);
+
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	if (LocalRefCount[bufid] == 0)
+	{
+		if (adjust_usagecount &&
+			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
+		{
+			buf_state += BUF_USAGECOUNT_ONE;
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		}
+	}
+	LocalRefCount[bufid]++;
+	ResourceOwnerRememberBuffer(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf_hdr));
+
+	return buf_state & BM_VALID;
+}
+
+void
+UnpinLocalBuffer(Buffer buffer)
+{
+	int buffid = -buffer - 1;
+
+	Assert(BufferIsLocal(buffer));
+	Assert(LocalRefCount[buffid] > 0);
+
+	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
+	LocalRefCount[buffid]--;
+}
+
 /*
  * GUC check_hook for temp_buffers
  */
-- 
2.38.0

v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload

From e0fb77e2a162385bea4a9a6ce7e31935d76fd161 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:29:57 -0700
Subject: [PATCH v2 05/14] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c   | 570 ++++++++++++++------------
 src/backend/storage/buffer/localbuf.c | 115 +++---
 2 files changed, 381 insertions(+), 304 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 678e390ab4d..b9af8a05989 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -482,6 +482,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr);
+static Buffer GetVictimBuffer(BufferAccessStrategy strategy);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
@@ -1111,14 +1112,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
-	uint32		oldFlags;
-	int			buf_id;
-	BufferDesc *buf;
-	bool		valid;
-	uint32		buf_state;
+	int			existing_buf_id;
+
+	Buffer		victim_buffer;
+	BufferDesc *victim_buf_hdr;
+	uint32		victim_buf_state;
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -1129,15 +1127,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+	existing_buf_id = BufTableLookup(&newTag, newHash);
+	if (existing_buf_id >= 0)
 	{
+		BufferDesc *buf;
+		bool		valid;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
 		 * buffer pool, and check to see if the correct data has been loaded
 		 * into the buffer.
 		 */
-		buf = GetBufferDescriptor(buf_id);
+		buf = GetBufferDescriptor(existing_buf_id);
 
 		valid = PinBuffer(buf, strategy);
 
@@ -1174,266 +1175,96 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
-	/* Loop here in case we have to try another victim buffer */
-	for (;;)
+	/*
+	 * Acquire a victim buffer. Somebody else might try to do the same, we
+	 * don't hold any conflicting locks. If so we'll have to undo our work
+	 * later.
+	 */
+	victim_buffer = GetVictimBuffer(strategy);
+	victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. If
+	 * somebody else inserted another buffer for the tag, we'll release the
+	 * victim buffer we acquired and use the already inserted one.
+	 */
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
+	if (existing_buf_id >= 0)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
+		BufferDesc *existing_buf_hdr;
+		bool		valid;
 
 		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
+		 * Got a collision. Someone has already done what we were about to
+		 * do. We'll just handle this as if it were found in the buffer pool
+		 * in the first place.  First, give up the buffer we were planning to
+		 * use.
+		 *
+		 * We could do this after releasing the partition lock, but then we'd
+		 * have to call ResourceOwnerEnlargeBuffers() &
+		 * ReservePrivateRefCountEntry() before acquiring the lock, for the
+		 * rare case of such a collision.
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state);
+		UnpinBuffer(victim_buf_hdr);
 
-		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+		/* FIXME: Should we put the victim buffer onto the freelist? */
 
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf_state & BUF_FLAG_MASK;
+		/* remaining code should match code at top of routine */
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-										 LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
+		valid = PinBuffer(existing_buf_hdr, strategy);
 
-					/* Read the LSN while holding buffer header lock */
-					buf_state = LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf, buf_state);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(BufferDescriptorGetContentLock(buf));
-						UnpinBuffer(buf);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-														  smgr->smgr_rlocator.locator.spcOid,
-														  smgr->smgr_rlocator.locator.dbOid,
-														  smgr->smgr_rlocator.locator.relNumber);
-
-				FlushBuffer(buf, NULL);
-				LWLockRelease(BufferDescriptorGetContentLock(buf));
-
-				ScheduleBufferTagForWriteback(&BackendWritebackContext,
-											  &buf->tag);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-														 smgr->smgr_rlocator.locator.spcOid,
-														 smgr->smgr_rlocator.locator.dbOid,
-														 smgr->smgr_rlocator.locator.relNumber);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf);
-				continue;
-			}
-		}
-
-		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
-		 */
-		if (oldFlags & BM_TAG_VALID)
-		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* remember we have no old-partition lock or tag */
-			oldPartitionLock = NULL;
-			/* keep the compiler quiet about uninitialized variables */
-			oldHash = 0;
-		}
-
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
-		/*
-		 * Need to lock the buffer header too in order to change its tag.
-		 */
-		buf_state = LockBufHdr(buf);
-
-		/*
-		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
-		 */
-		oldFlags = buf_state & BUF_FLAG_MASK;
-		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
-			break;
-
-		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading
+			 * in the page, or (b) a previous read attempt failed.  We
+			 * have to wait for any active read attempt to finish, and
+			 * then set up our own read attempt if the page is still not
+			 * BM_VALID.  StartBufferIO does it all.
+			 */
+			if (StartBufferIO(existing_buf_hdr, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer
+				 * must have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return existing_buf_hdr;
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
-	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
-	 *
+	 * Need to lock the buffer header too in order to change its tag.
+	 */
+	victim_buf_state = LockBufHdr(victim_buf_hdr);
+
+	/* some sanity checks while we hold the buffer header lock */
+	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+
+	victim_buf_hdr->tag = newTag;
+
+	/*
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
 	 */
-	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
+	victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
-	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		victim_buf_state |= BM_PERMANENT;
 
-	UnlockBufHdr(buf, buf_state);
-
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
+	UnlockBufHdr(victim_buf_hdr, victim_buf_state);
 
 	LWLockRelease(newPartitionLock);
 
@@ -1443,12 +1274,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * to read it before we did, so there's nothing left for BufferAlloc() to
 	 * do.
 	 */
-	if (StartBufferIO(buf, true))
+	if (StartBufferIO(victim_buf_hdr, true))
 		*foundPtr = false;
 	else
 		*foundPtr = true;
 
-	return buf;
+	return victim_buf_hdr;
 }
 
 /*
@@ -1557,6 +1388,239 @@ retry:
 	StrategyFreeBuffer(buf);
 }
 
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called with the buffer pinned, but without the buffer header
+ * spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+	uint32 buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+
+	/* can't change while we're holding the pin */
+	if (buf_state & BM_TAG_VALID)
+	{
+		uint32      hash;
+		LWLock     *partition_lock;
+		BufferTag   tag;
+
+		/* have buffer pinned, so it's safe to read tag without lock */
+		tag = buf_hdr->tag;
+
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		/* lock the buffer header */
+		buf_state = LockBufHdr(buf_hdr);
+
+		Assert(BufferTagsEqual(&buf_hdr->tag, &tag));
+
+		/*
+		 * We have the buffer pinned nobody else should have been able to
+		 * unset this concurrently.
+		 */
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		/*
+		 * If somebody else pinned the buffer since, or even worse, dirtied it,
+		 * give up on this buffer: It's clearly in use.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 1 || (buf_state & BM_DIRTY))
+		{
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+			UnlockBufHdr(buf_hdr, buf_state);
+			LWLockRelease(partition_lock);
+
+			return false;
+		}
+
+		/*
+		 * Clear out the buffer's tag and flags and usagecount.  We must do
+		 * this to ensure that linear scans of the buffer array don't think
+		 * the buffer is valid.
+		 *
+		 * XXX: This is a pre-existing comment I just moved, but isn't it
+		 * entirely bogus with regard to the tag? We can't do anything with
+		 * the buffer without taking BM_VALID / BM_TAG_VALID into
+		 * account. Likely doesn't matter because we're already dirtying the
+		 * cacheline, but still.
+		 *
+		 */
+		ClearBufferTag(&buf_hdr->tag);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		BufTableDelete(&tag, hash);
+
+		LWLockRelease(partition_lock);
+	}
+
+	Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID)));
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0);
+
+	return true;
+}
+
+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy)
+{
+	Buffer cur_buf;
+	BufferDesc *cur_buf_hdr = NULL;
+	uint32 cur_buf_state;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free
+	 * refcount entry.
+	 */
+	ReservePrivateRefCountEntry();
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+again:
+
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	cur_buf_hdr = StrategyGetBuffer(strategy, &cur_buf_state);
+	cur_buf = BufferDescriptorGetBuffer(cur_buf_hdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(cur_buf_state) == 0);
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(cur_buf_hdr);
+
+	/*
+	 * We shouldn't have any other pins for this buffer.
+	 */
+	BufferCheckOneLocalPin(cur_buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released the
+	 * buffer header lock above, or even while we are writing it out (since
+	 * our share-lock won't prevent hint-bit updates).  We will recheck the
+	 * dirty bit after re-locking the buffer header.
+	 */
+	if (cur_buf_state & BM_DIRTY)
+	{
+		LWLock *content_lock;
+
+		Assert(cur_buf_state & BM_TAG_VALID);
+		Assert(cur_buf_state & BM_VALID);
+
+		/*
+		 * We need a share-lock on the buffer contents to write it out
+		 * (else we might write invalid data, eg because someone else is
+		 * compacting the page contents while we write).  We must use a
+		 * conditional lock acquisition here to avoid deadlock.  Even
+		 * though the buffer was not pinned (and therefore surely not
+		 * locked) when StrategyGetBuffer returned it, someone else could
+		 * have pinned and exclusive-locked it by the time we get here. If
+		 * we try to get the lock unconditionally, we'd block waiting for
+		 * them; if they later block waiting for us, deadlock ensues.
+		 * (This has been observed to happen when two backends are both
+		 * trying to split btree index pages, and the second one just
+		 * happens to be trying to split the page the first one got from
+		 * StrategyGetBuffer.)
+		 */
+		content_lock = BufferDescriptorGetContentLock(cur_buf_hdr);
+		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop
+			 * back to get another one.
+			 */
+			UnpinBuffer(cur_buf_hdr);
+			goto again;
+		}
+
+		/*
+		 * If using a nondefault strategy, and writing the buffer would
+		 * require a WAL flush, let the strategy decide whether to go ahead
+		 * and write/reuse the buffer or to choose another victim.  We need
+		 * lock to inspect the page LSN, so this can't be done inside
+		 * StrategyGetBuffer.
+		 */
+		if (strategy != NULL)
+		{
+			XLogRecPtr	lsn;
+
+			/* Read the LSN while holding buffer header lock */
+			cur_buf_state = LockBufHdr(cur_buf_hdr);
+			lsn = BufferGetLSN(cur_buf_hdr);
+			UnlockBufHdr(cur_buf_hdr, cur_buf_state);
+
+			if (XLogNeedsFlush(lsn)
+				&& StrategyRejectBuffer(strategy, cur_buf_hdr))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(cur_buf_hdr);
+				goto again;
+			}
+		}
+
+		/* OK, do the I/O */
+		/* FIXME: These used the wrong smgr before afaict? */
+		{
+			SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&cur_buf_hdr->tag),
+										 InvalidBackendId);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(cur_buf_hdr->tag.forkNum,
+													  cur_buf_hdr->tag.blockNum,
+													  smgr->smgr_rlocator.locator.spcOid,
+													  smgr->smgr_rlocator.locator.dbOid,
+													  smgr->smgr_rlocator.locator.relNumber);
+
+			FlushBuffer(cur_buf_hdr, smgr);
+			LWLockRelease(content_lock);
+
+			ScheduleBufferTagForWriteback(&BackendWritebackContext,
+										  &cur_buf_hdr->tag);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(cur_buf_hdr->tag.forkNum,
+													 cur_buf_hdr->tag.blockNum,
+													 smgr->smgr_rlocator.locator.spcOid,
+													 smgr->smgr_rlocator.locator.dbOid,
+													 smgr->smgr_rlocator.locator.relNumber);
+		}
+	}
+
+	/*
+	 * If the buffer has an entry in the buffer mapping table, delete it. This
+	 * can fail because another backend could have pinned or dirtied the
+	 * buffer.
+	 */
+	if (!InvalidateVictimBuffer(cur_buf_hdr))
+	{
+		UnpinBuffer(cur_buf_hdr);
+		goto again;
+	}
+
+	/* a final set of sanity checks */
+	cur_buf_state = pg_atomic_read_u32(&cur_buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(cur_buf_state) == 1);
+	Assert(!(cur_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+
+	BufferCheckOneLocalPin(cur_buf);
+
+	return cur_buf;
+}
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 7b6294deef3..b1d0c309918 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -44,13 +44,14 @@ BufferDesc *LocalBufferDescriptors = NULL;
 Block	   *LocalBufferBlockPointers = NULL;
 int32	   *LocalRefCount = NULL;
 
-static int	nextFreeLocalBuf = 0;
+static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
+static Buffer GetLocalVictimBuffer(void);
 
 
 /*
@@ -112,10 +113,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
-	int			b;
-	int			trycounter;
+	Buffer		victim_buffer;
+	int			bufid;
 	bool		found;
-	uint32		buf_state;
 
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
@@ -129,23 +129,51 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 	if (hresult)
 	{
-		b = hresult->id;
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufid = hresult->id;
+		bufHdr = GetLocalBufferDescriptor(bufid);
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
-#ifdef LBDEBUG
-		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
-#endif
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
-		return bufHdr;
+	}
+	else
+	{
+		uint32		buf_state;
+
+		victim_buffer = GetLocalVictimBuffer();
+		bufid = -(victim_buffer + 1);
+		bufHdr = GetLocalBufferDescriptor(bufid);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &newTag, HASH_ENTER, &found);
+		if (found)					/* shouldn't happen */
+			elog(ERROR, "local buffer hash table corrupted");
+		hresult->id = bufid;
+
+		/*
+		 * it's all ours now.
+		 */
+		bufHdr->tag = newTag;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+		*foundPtr = false;
 	}
 
-#ifdef LBDEBUG
-	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-			smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum,
-			-nextFreeLocalBuf - 1);
-#endif
+	return bufHdr;
+}
+
+static Buffer
+GetLocalVictimBuffer(void)
+{
+	int			victim_bufid;
+	int			trycounter;
+	uint32		buf_state;
+	BufferDesc *bufHdr;
+
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
 	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
@@ -154,14 +182,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	trycounter = NLocBuffer;
 	for (;;)
 	{
-		b = nextFreeLocalBuf;
+		victim_bufid = nextFreeLocalBufId;
 
-		if (++nextFreeLocalBuf >= NLocBuffer)
-			nextFreeLocalBuf = 0;
+		if (++nextFreeLocalBufId >= NLocBuffer)
+			nextFreeLocalBufId = 0;
 
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufHdr = GetLocalBufferDescriptor(victim_bufid);
 
-		if (LocalRefCount[b] == 0)
+		if (LocalRefCount[victim_bufid] == 0)
 		{
 			buf_state = pg_atomic_read_u32(&bufHdr->state);
 
@@ -184,6 +212,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 					 errmsg("no empty local buffer available")));
 	}
 
+	/*
+	 * lazy memory allocation: allocate space on first use of a buffer.
+	 */
+	if (LocalBufHdrGetBlock(bufHdr) == NULL)
+	{
+		/* Set pointer for use by BufferGetBlock() macro */
+		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
+	}
+
 	/*
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
@@ -213,19 +250,12 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	}
 
 	/*
-	 * lazy memory allocation: allocate space on first use of a buffer.
-	 */
-	if (LocalBufHdrGetBlock(bufHdr) == NULL)
-	{
-		/* Set pointer for use by BufferGetBlock() macro */
-		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
-	}
-
-	/*
-	 * Update the hash table: remove old entry, if any, and make new one.
+	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
 	if (buf_state & BM_TAG_VALID)
 	{
+		LocalBufferLookupEnt *hresult;
+
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, (void *) &bufHdr->tag,
 						HASH_REMOVE, NULL);
@@ -233,28 +263,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			elog(ERROR, "local buffer hash table corrupted");
 		/* mark buffer invalid just in case hash insert fails */
 		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BM_VALID | BM_TAG_VALID);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 	}
 
-	hresult = (LocalBufferLookupEnt *)
-		hash_search(LocalBufHash, (void *) &newTag, HASH_ENTER, &found);
-	if (found)					/* shouldn't happen */
-		elog(ERROR, "local buffer hash table corrupted");
-	hresult->id = b;
-
-	/*
-	 * it's all ours now.
-	 */
-	bufHdr->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
-	buf_state |= BM_TAG_VALID;
-	buf_state &= ~BUF_USAGECOUNT_MASK;
-	buf_state += BUF_USAGECOUNT_ONE;
-	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-
-	*foundPtr = false;
-	return bufHdr;
+	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
@@ -423,7 +436,7 @@ InitLocalBuffers(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	nextFreeLocalBuf = 0;
+	nextFreeLocalBufId = 0;
 
 	/* initialize fields that need to start off nonzero */
 	for (i = 0; i < nbufs; i++)
-- 
2.38.0

v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload

From 523335a6ef6b959d4aa1a666d1a76ba133b8b3c2 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v2 06/14] bufmgr: Support multiple in-progress IOs by using
 resowner

---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/utils/resowner_private.h  |  5 ++
 src/backend/access/transam/xact.c     |  4 +-
 src/backend/postmaster/autovacuum.c   |  1 -
 src/backend/postmaster/bgwriter.c     |  1 -
 src/backend/postmaster/checkpointer.c |  1 -
 src/backend/postmaster/walwriter.c    |  1 -
 src/backend/storage/buffer/bufmgr.c   | 86 ++++++++++++---------------
 src/backend/utils/resowner/resowner.c | 60 +++++++++++++++++++
 9 files changed, 105 insertions(+), 56 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3becf32a3c0..2e1d7540fd0 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -176,7 +176,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern void AbortBufferIO(void);
+extern void AbortBufferIO(Buffer buffer);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(struct WritebackContext *wb_context);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b54..ae58438ec76 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -30,6 +30,11 @@ extern void ResourceOwnerEnlargeBuffers(ResourceOwner owner);
 extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
+/* support for IO-in-progress management */
+extern void ResourceOwnerEnlargeBufferIOs(ResourceOwner owner);
+extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
+extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
+
 /* support for local lock management */
 extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d85e3139082..2a91ed64a7a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2725,8 +2725,7 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/* Clean up buffer I/O and buffer context locks, too */
-	AbortBufferIO();
+	/* Clean up buffer context locks, too */
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5086,7 +5085,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-	AbortBufferIO();
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index f5ea381c53e..2d5d1855439 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -526,7 +526,6 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 69667f0eb4b..1c4244b6ec5 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -167,7 +167,6 @@ BackgroundWriterMain(void)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index de0bbbfa791..c1ed6737df5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -271,7 +271,6 @@ CheckpointerMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 3113e8fbdd5..20e0807c60d 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -163,7 +163,6 @@ WalWriterMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b9af8a05989..0cdeb644e6e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,10 +159,6 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
-/* local state for StartBufferIO and related functions */
-static BufferDesc *InProgressBuf = NULL;
-static bool IsForInput;
-
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	AbortBufferIO();
 	UnlockBuffers();
 
 	CheckForBufferLeaks();
@@ -4618,7 +4613,7 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 {
 	uint32		buf_state;
 
-	Assert(!InProgressBuf);
+	ResourceOwnerEnlargeBufferIOs(CurrentResourceOwner);
 
 	for (;;)
 	{
@@ -4642,8 +4637,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = buf;
-	IsForInput = forInput;
+	ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+								  BufferDescriptorGetBuffer(buf));
 
 	return true;
 }
@@ -4669,8 +4664,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 {
 	uint32		buf_state;
 
-	Assert(buf == InProgressBuf);
-
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
@@ -4682,13 +4675,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = NULL;
+	ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
 }
 
 /*
- * AbortBufferIO: Clean up any active buffer I/O after an error.
+ * AbortBufferIO: Clean up active buffer I/O after an error.
  *
  *	All LWLocks we might have held have been released,
  *	but we haven't yet released buffer pins, so the buffer is still pinned.
@@ -4697,46 +4691,42 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
  *	possible the error condition wasn't related to the I/O.
  */
 void
-AbortBufferIO(void)
+AbortBufferIO(Buffer buf)
 {
-	BufferDesc *buf = InProgressBuf;
+	BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 
-	if (buf)
+	buf_state = LockBufHdr(buf_hdr);
+	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+
+	if (!(buf_state & BM_VALID))
 	{
-		uint32		buf_state;
-
-		buf_state = LockBufHdr(buf);
-		Assert(buf_state & BM_IO_IN_PROGRESS);
-		if (IsForInput)
-		{
-			Assert(!(buf_state & BM_DIRTY));
-
-			/* We'd better not think buffer is valid yet */
-			Assert(!(buf_state & BM_VALID));
-			UnlockBufHdr(buf, buf_state);
-		}
-		else
-		{
-			Assert(buf_state & BM_DIRTY);
-			UnlockBufHdr(buf, buf_state);
-			/* Issue notice if this is not the first failure... */
-			if (buf_state & BM_IO_ERROR)
-			{
-				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
-
-				path = relpathperm(BufTagGetRelFileLocator(&buf->tag),
-								   BufTagGetForkNum(&buf->tag));
-				ereport(WARNING,
-						(errcode(ERRCODE_IO_ERROR),
-						 errmsg("could not write block %u of %s",
-								buf->tag.blockNum, path),
-						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
-			}
-		}
-		TerminateBufferIO(buf, false, BM_IO_ERROR);
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
 	}
+	else
+	{
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		/* Issue notice if this is not the first failure... */
+		if (buf_state & BM_IO_ERROR)
+		{
+			/* Buffer is pinned, so we can read tag without spinlock */
+			char	   *path;
+
+			path = relpathperm(BufTagGetRelFileLocator(&buf_hdr->tag),
+							   BufTagGetForkNum(&buf_hdr->tag));
+			ereport(WARNING,
+					(errcode(ERRCODE_IO_ERROR),
+					 errmsg("could not write block %u of %s",
+							buf_hdr->tag.blockNum, path),
+					 errdetail("Multiple failures --- write error might be permanent.")));
+			pfree(path);
+		}
+	}
+
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45d..fccc59b39dd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
 
 	/* We have built-in support for remembering: */
 	ResourceArray bufferarr;	/* owned buffers */
+	ResourceArray bufferioarr;	/* in-progress buffer IO */
 	ResourceArray catrefarr;	/* catcache references */
 	ResourceArray catlistrefarr;	/* catcache-list pins */
 	ResourceArray relrefarr;	/* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	ResourceArrayInit(&(owner->bufferarr), BufferGetDatum(InvalidBuffer));
+	ResourceArrayInit(&(owner->bufferioarr), BufferGetDatum(InvalidBuffer));
 	ResourceArrayInit(&(owner->catrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->catlistrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->relrefarr), PointerGetDatum(NULL));
@@ -517,6 +519,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 	if (phase == RESOURCE_RELEASE_BEFORE_LOCKS)
 	{
+		/*
+		 * Abort failed buffer IO. AbortBufferIO()->TerminateBufferIO() calls
+		 * ResourceOwnerForgetBufferIOs(), so we just have to iterate till
+		 * there are none.
+		 *
+		 * Needs to be before we release buffer pins.
+		 *
+		 * During a commit, there shouldn't be any in-progress IO.
+		 */
+		while (ResourceArrayGetAny(&(owner->bufferioarr), &foundres))
+		{
+			Buffer		res = DatumGetBuffer(foundres);
+
+			if (isCommit)
+				elog(PANIC, "lost track of buffer IO on buffer %u", res);
+			AbortBufferIO(res);
+		}
+
 		/*
 		 * Release buffer pins.  Note that ReleaseBuffer will remove the
 		 * buffer entry from our array, so we just have to iterate till there
@@ -746,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And it better not own any resources, either */
 	Assert(owner->bufferarr.nitems == 0);
+	Assert(owner->bufferioarr.nitems == 0);
 	Assert(owner->catrefarr.nitems == 0);
 	Assert(owner->catlistrefarr.nitems == 0);
 	Assert(owner->relrefarr.nitems == 0);
@@ -775,6 +796,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And free the object. */
 	ResourceArrayFree(&(owner->bufferarr));
+	ResourceArrayFree(&(owner->bufferioarr));
 	ResourceArrayFree(&(owner->catrefarr));
 	ResourceArrayFree(&(owner->catlistrefarr));
 	ResourceArrayFree(&(owner->relrefarr));
@@ -976,6 +998,44 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 			 buffer, owner->name);
 }
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * buffer array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeBufferIOs(ResourceOwner owner)
+{
+	/* We used to allow pinning buffers without a resowner, but no more */
+	Assert(owner != NULL);
+	ResourceArrayEnlarge(&(owner->bufferioarr));
+}
+
+/*
+ * Remember that a buffer IO is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeBufferIOs()
+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+
+/*
+ * Forget that a buffer IO is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	if (!ResourceArrayRemove(&(owner->bufferioarr), BufferGetDatum(buffer)))
+		elog(PANIC, "buffer IO %d is not owned by resource owner %s",
+			 buffer, owner->name);
+}
+
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
  *
-- 
2.38.0

v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patchtext/x-diff; charset=us-asciiDownload

From 83d6b1997b81caf1f47e4cac87c399f1407fa455 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:44:02 -0700
Subject: [PATCH v2 07/14] bufmgr: Move relation extension handling into
 [Bulk]ExtendRelationBuffered()

---
 src/include/storage/buf_internals.h   |   5 +
 src/include/storage/bufmgr.h          |  13 +
 src/backend/storage/buffer/bufmgr.c   | 546 ++++++++++++++++++--------
 src/backend/storage/buffer/localbuf.c | 134 ++++++-
 src/backend/utils/probes.d            |   6 +-
 doc/src/sgml/monitoring.sgml          |  11 +-
 6 files changed, 546 insertions(+), 169 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 4b1aeb5fd25..57800254d2d 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -420,6 +420,11 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
+extern BlockNumber BulkExtendLocalRelationBuffered(SMgrRelation smgr,
+												   ForkNumber fork,
+												   ReadBufferMode mode,
+												   uint32 *num_pages,
+												   Buffer *buffers);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 2e1d7540fd0..4ecd5399966 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -132,6 +132,19 @@ extern void IncrBufferRefCount(Buffer buffer);
 extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
+extern Buffer ExtendRelationBuffered(Relation reln, struct SMgrRelationData *smgr,
+									 bool skip_extension_lock,
+									 char relpersistence,
+									 ForkNumber forkNum, ReadBufferMode mode,
+									 BufferAccessStrategy strategy);
+extern BlockNumber BulkExtendRelationBuffered(Relation rel, struct SMgrRelationData *smgr,
+											  bool skip_extension_lock,
+											  char relpersistence,
+											  ForkNumber fork, ReadBufferMode mode,
+											  BufferAccessStrategy strategy,
+											  uint32 *num_pages,
+											  uint32 num_locked_pages,
+											  Buffer *buffers);
 
 extern void InitBufferPoolAccess(void);
 extern void AtEOXact_Buffers(bool isCommit);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0cdeb644e6e..361ebc3ae26 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -459,6 +460,15 @@ static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
+static BlockNumber BulkExtendSharedRelationBuffered(Relation rel,
+													SMgrRelation smgr,
+													bool skip_extension_lock,
+													char relpersistence,
+													ForkNumber fork, ReadBufferMode mode,
+													BufferAccessStrategy strategy,
+													uint32 *num_pages,
+													uint32 num_locked_pages,
+													Buffer *buffers);
 static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
@@ -793,6 +803,73 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * Convenience wrapper around BulkExtendRelationBuffered() extending by one
+ * block.
+ */
+Buffer
+ExtendRelationBuffered(Relation rel, struct SMgrRelationData *smgr,
+					   bool skip_extension_lock,
+					   char relpersistence, ForkNumber forkNum,
+					   ReadBufferMode mode, BufferAccessStrategy strategy)
+{
+	Buffer buf;
+	uint32 num_pages = 1;
+
+	BulkExtendRelationBuffered(rel, smgr, skip_extension_lock, relpersistence,
+							   forkNum, mode, strategy, &num_pages, num_pages, &buf);
+
+	return buf;
+}
+
+
+BlockNumber
+BulkExtendRelationBuffered(Relation rel,
+						   SMgrRelation smgr,
+						   bool skip_extension_lock,
+						   char relpersistence,
+						   ForkNumber fork, ReadBufferMode mode,
+						   BufferAccessStrategy strategy,
+						   uint32 *num_pages,
+						   uint32 num_locked_pages,
+						   Buffer *buffers)
+{
+	BlockNumber first_block;
+
+	Assert(rel != NULL || smgr != NULL);
+	Assert(rel != NULL || skip_extension_lock);
+
+	if (smgr == NULL)
+		smgr = RelationGetSmgr(rel);
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
+										 smgr->smgr_rlocator.locator.spcOid,
+										 smgr->smgr_rlocator.locator.dbOid,
+										 smgr->smgr_rlocator.locator.relNumber,
+										 smgr->smgr_rlocator.backend,
+										 num_pages);
+
+	if (SmgrIsTemp(smgr))
+		first_block = BulkExtendLocalRelationBuffered(smgr,
+													  fork, mode,
+													  num_pages, buffers);
+	else
+		first_block = BulkExtendSharedRelationBuffered(rel, smgr,
+													   skip_extension_lock, relpersistence,
+													   fork, mode, strategy,
+													   num_pages, num_locked_pages,
+													   buffers);
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
+										smgr->smgr_rlocator.locator.spcOid,
+										smgr->smgr_rlocator.locator.dbOid,
+										smgr->smgr_rlocator.locator.relNumber,
+										smgr->smgr_rlocator.backend,
+										num_pages,
+										first_block);
+
+	return first_block;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -807,43 +884,32 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *bufHdr;
 	Block		bufBlock;
 	bool		found;
-	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
 	*hit = false;
 
+	/*
+	 * Backward compatibility path, most code should use
+	 * ExtendRelationBuffered() instead, as acquiring the extension lock
+	 * inside ExtendRelationBuffered() scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+		return ExtendRelationBuffered(NULL, smgr, true, relpersistence, forkNum, mode, strategy);
+
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
-	isExtend = (blockNum == P_NEW);
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
 									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend,
-									   isExtend);
-
-	/* Substitute proper block number if caller asked for P_NEW */
-	if (isExtend)
-	{
-		blockNum = smgrnblocks(smgr, forkNum);
-		/* Fail if relation is already at maximum possible length */
-		if (blockNum == P_NEW)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot extend relation %s beyond %u blocks",
-							relpath(smgr->smgr_rlocator, forkNum),
-							P_NEW)));
-	}
+									   smgr->smgr_rlocator.backend);
 
 	if (isLocalBuf)
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.local_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
@@ -858,8 +924,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.shared_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
@@ -870,168 +934,88 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
-		if (!isExtend)
-		{
-			/* Just need to update stats before we exit */
-			*hit = true;
-			VacuumPageHit++;
+		/* Just need to update stats before we exit */
+		*hit = true;
+		VacuumPageHit++;
 
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
 
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-											  smgr->smgr_rlocator.locator.spcOid,
-											  smgr->smgr_rlocator.locator.dbOid,
-											  smgr->smgr_rlocator.locator.relNumber,
-											  smgr->smgr_rlocator.backend,
-											  isExtend,
-											  found);
-
-			/*
-			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
-			 * locked on return.
-			 */
-			if (!isLocalBuf)
-			{
-				if (mode == RBM_ZERO_AND_LOCK)
-					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-								  LW_EXCLUSIVE);
-				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-			}
-
-			return BufferDescriptorGetBuffer(bufHdr);
-		}
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  smgr->smgr_rlocator.locator.spcOid,
+										  smgr->smgr_rlocator.locator.dbOid,
+										  smgr->smgr_rlocator.locator.relNumber,
+										  smgr->smgr_rlocator.backend,
+										  found);
 
 		/*
-		 * We get here only in the corner case where we are trying to extend
-		 * the relation but we found a pre-existing buffer marked BM_VALID.
-		 * This can happen because mdread doesn't complain about reads beyond
-		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
-		 * read a block beyond EOF could have left a "valid" zero-filled
-		 * buffer.  Unfortunately, we have also seen this case occurring
-		 * because of buggy Linux kernels that sometimes return an
-		 * lseek(SEEK_END) result that doesn't account for a recent write. In
-		 * that situation, the pre-existing buffer would contain valid data
-		 * that we don't want to overwrite.  Since the legitimate case should
-		 * always have left a zero-filled buffer, complain if not PageIsNew.
+		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
+		 * locked on return.
 		 */
-		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-		if (!PageIsNew((Page) bufBlock))
-			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rlocator, forkNum)),
-					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
-
-		/*
-		 * We *must* do smgrextend before succeeding, else the page will not
-		 * be reserved by the kernel, and the next P_NEW call will decide to
-		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
-		 * call that BufferAlloc didn't, and proceed.
-		 */
-		if (isLocalBuf)
+		if (!isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			Assert(buf_state & BM_VALID);
-			buf_state &= ~BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			if (mode == RBM_ZERO_AND_LOCK)
+				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+							  LW_EXCLUSIVE);
+			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
+				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
 		}
-		else
-		{
-			/*
-			 * Loop to handle the very small possibility that someone re-sets
-			 * BM_VALID between our clearing it and StartBufferIO inspecting
-			 * it.
-			 */
-			do
-			{
-				uint32		buf_state = LockBufHdr(bufHdr);
 
-				Assert(buf_state & BM_VALID);
-				buf_state &= ~BM_VALID;
-				UnlockBufHdr(bufHdr, buf_state);
-			} while (!StartBufferIO(bufHdr, true));
-		}
+		return BufferDescriptorGetBuffer(bufHdr);
 	}
 
 	/*
 	 * if we have gotten to this point, we have allocated a buffer for the
 	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
 	 * if it's a shared buffer.
-	 *
-	 * Note: if smgrextend fails, we will end up with a buffer that is
-	 * allocated but not marked BM_VALID.  P_NEW will still select the same
-	 * block number (because the relation didn't get any longer on disk) and
-	 * so future attempts to extend the relation will find the same buffer (if
-	 * it's not been recycled) but come right back here to try smgrextend
-	 * again.
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
-	if (isExtend)
-	{
-		/* new buffers are zero-filled */
+	/*
+	 * Read in the page, unless the caller intends to overwrite it and
+	 * just wants us to allocate a buffer.
+	 */
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(smgr, forkNum, blockNum, (char *) bufBlock, false);
-
-		/*
-		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
-		 * although we're essentially performing a write. At least on linux
-		 * doing so defeats the 'delayed allocation' mechanism, leading to
-		 * increased file fragmentation.
-		 */
-	}
 	else
 	{
-		/*
-		 * Read in the page, unless the caller intends to overwrite it and
-		 * just wants us to allocate a buffer.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			MemSet((char *) bufBlock, 0, BLCKSZ);
-		else
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
+		smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
+
+		if (track_io_timing)
 		{
-			instr_time	io_start,
-						io_time;
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-
-			smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
-
-			if (track_io_timing)
+		/* check for garbage data */
+		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
+									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		{
+			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
 			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-					MemSet((char *) bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
+				ereport(WARNING,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s; zeroing out page",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
+				MemSet((char *) bufBlock, 0, BLCKSZ);
 			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
 		}
 	}
 
@@ -1074,7 +1058,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									  smgr->smgr_rlocator.locator.dbOid,
 									  smgr->smgr_rlocator.locator.relNumber,
 									  smgr->smgr_rlocator.backend,
-									  isExtend,
 									  found);
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1617,6 +1600,251 @@ again:
 
 	return cur_buf;
 }
+
+/*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, as otherwise the operation likely
+ * cannot be performed at all.
+ *
+ * The number of allowed pins for a backend is computed based on
+ * shared_buffers and the maximum number of connections possible. That's very
+ * pessimistic, but oustide of toy-sized shared_buffers it should allow
+ * sufficient pins.
+ */
+static void
+LimitAdditionalPins(uint32 *additional_pins)
+{
+	uint32 max_backends;
+	int max_proportional_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	/*
+	 * Subtract the approximate number of buffers already pinned by this
+	 * backend. We get the number of "overflowed" pins for free, but don't
+	 * know the number of pins in PrivateRefCountArray. The cost of
+	 * calculating that exactly doesn't seem worth it, so just assume the max.
+	 */
+	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	if (max_proportional_pins < 0)
+		max_proportional_pins = 1;
+
+	if (*additional_pins > max_proportional_pins)
+		*additional_pins = max_proportional_pins;
+}
+
+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
+{
+	BlockNumber first_block;
+
+	LimitAdditionalPins(num_pages);
+
+	/*
+	 * FIXME: limit num_pages / buffers based on NBuffers / MaxBackends or
+	 * such. Also keep MAX_SIMUL_LWLOCKS in mind.
+	 */
+
+	pgBufferUsage.shared_blks_written += *num_pages;
+
+	/*
+	 * Acquire victim buffers for extension without holding extension
+	 * lock. Writing out victim buffers is the most expensive part of
+	 * extending the relation, particularly when doing so requires WAL
+	 * flushes. Zeroing out the buffers is also quite expensive, so do that
+	 * before holding the extension lock as well.
+	 *
+	 * These pages are pinned by us and not valid. While we hold the pin
+	 * they can't be acquired as victim buffers by another backend.
+	 */
+	for (uint32 i = 0; i < *num_pages; i++)
+	{
+		Block		buf_block;
+
+		buffers[i] = GetVictimBuffer(strategy);
+		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	if (!skip_extension_lock)
+		LockRelationForExtension(rel, ExclusiveLock);
+
+	first_block = smgrnblocks(smgr, fork);
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + *num_pages >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	/*
+	 * Insert buffers into buffer table, mark as IO_IN_PROGRESS.
+	 *
+	 * This needs to happen before we extend the relation, because as soon as
+	 * we do, other backends can start to read in those pages.
+	 */
+	for (int i = 0; i < *num_pages; i++)
+	{
+		Buffer		victim_buf = buffers[i];
+		BufferDesc *victim_buf_hdr = GetBufferDescriptor(victim_buf - 1);
+		BufferTag	tag;
+		uint32		hash;
+		LWLock     *partition_lock;
+		int			existing_id;
+
+		InitBufferTag(&tag, &smgr->smgr_rlocator.locator, fork, first_block + i);
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		existing_id = BufTableInsert(&tag, hash, victim_buf_hdr->buf_id);
+
+		/*
+		 * We get here only in the corner case where we are trying to extend
+		 * the relation but we found a pre-existing buffer. This can happen
+		 * because a prior attempt at extending the relation failed, and
+		 * because mdread doesn't complain about reads beyond EOF (when
+		 * zero_damaged_pages is ON) and so a previous attempt to read a block
+		 * beyond EOF could have left a "valid" zero-filled buffer.
+		 * Unfortunately, we have also seen this case occurring because of
+		 * buggy Linux kernels that sometimes return an lseek(SEEK_END) result
+		 * that doesn't account for a recent write. In that situation, the
+		 * pre-existing buffer would contain valid data that we don't want to
+		 * overwrite.  Since the legitimate cases should always have left a
+		 * zero-filled buffer, complain if not PageIsNew.
+		 */
+		if (existing_id >= 0)
+		{
+			BufferDesc *existing_hdr = GetBufferDescriptor(existing_id);
+			Block		buf_block;
+			bool		valid;
+
+			/*
+			 * Pin the existing buffer before releasing the partition lock,
+			 * preventing it from being evicted.
+			 */
+			valid = PinBuffer(existing_hdr, strategy);
+
+			LWLockRelease(partition_lock);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(victim_buf_hdr);
+			UnpinBuffer(victim_buf_hdr);
+
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+			buf_block = BufHdrGetBlock(existing_hdr);
+
+			if (valid && !PageIsNew((Page) buf_block))
+				ereport(ERROR,
+						(errmsg("unexpected data beyond EOF in block %u of relation %s",
+								existing_hdr->tag.blockNum, relpath(smgr->smgr_rlocator, fork)),
+						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
+
+			/*
+			 * We *must* do smgr[zero]extend before succeeding, else the page
+			 * will not be reserved by the kernel, and the next P_NEW call
+			 * will decide to return the same page.  Clear the BM_VALID bit,
+			 * do StartBufferIO() and proceed.
+			 *
+			 * Loop to handle the very small possibility that someone re-sets
+			 * BM_VALID between our clearing it and StartBufferIO inspecting
+			 * it.
+			 */
+			do
+			{
+				uint32		buf_state = LockBufHdr(existing_hdr);
+
+				buf_state &= ~BM_VALID;
+				UnlockBufHdr(existing_hdr, buf_state);
+			} while (!StartBufferIO(existing_hdr, true));
+		}
+		else
+		{
+			uint32		buf_state;
+
+			buf_state = LockBufHdr(victim_buf_hdr);
+
+			/* some sanity checks while we hold the buffer header lock */
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+			if (relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+				buf_state |= BM_PERMANENT;
+
+			UnlockBufHdr(victim_buf_hdr, buf_state);
+
+			LWLockRelease(partition_lock);
+
+			/* XXX: could combine the locked operations in it with the above */
+			StartBufferIO(victim_buf_hdr, true);
+		}
+	}
+
+	/*
+	 * Note: if smgzerorextend fails, we will end up with buffers that are
+	 * allocated but not marked BM_VALID.  The next relation extension will
+	 * still select the same block number (because the relation didn't get any
+	 * longer on disk) and so future attempts to extend the relation will find
+	 * the same buffers (if they have not been recycled) but come right back
+	 * here to try smgrzeroextend again.
+	 *
+	 * We don't need to set checksum for all-zero pages.
+	 */
+	smgrzeroextend(smgr, fork, first_block, *num_pages, false);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 *
+	 * We remove IO_IN_PROGRESS after this, as zeroing the buffer contents and
+	 * waking up waiting backends waiting can take noticeable time.
+	 */
+	if (!skip_extension_lock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+
+	/* Set BM_VALID, terminate IO, and wake up any waiters */
+	for (int i = 0; i < *num_pages; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+
+		if (i < num_locked_pages &&
+			(mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK))
+			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+
+		TerminateBufferIO(buf_hdr, false, BM_VALID);
+	}
+
+	return first_block;
+
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index b1d0c309918..0b5bc0017f1 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -48,6 +48,9 @@ static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
+/* number of local buffers pinned at least once */
+static int	NLocalPinnedBuffers = 0;
+
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
@@ -270,6 +273,132 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see LimitAdditionalPins() */
+static void
+LimitAdditionalLocalPins(uint32 *additional_pins)
+{
+	uint32 max_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	/*
+	 * In contrast to LimitAdditionalPins() other backends don't play a role
+	 * here. We can allow up to NLocBuffer pins in total.
+	 */
+	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+
+	if (*additional_pins >= max_pins)
+		*additional_pins = max_pins;
+}
+
+BlockNumber
+BulkExtendLocalRelationBuffered(SMgrRelation smgr,
+								ForkNumber fork,
+								ReadBufferMode mode,
+								uint32 *num_pages,
+								Buffer *buffers)
+{
+	BlockNumber first_block;
+
+	/* Initialize local buffers if first request in this session */
+	if (LocalBufHash == NULL)
+		InitLocalBuffers();
+
+	LimitAdditionalLocalPins(num_pages);
+
+	pgBufferUsage.temp_blks_written += *num_pages;
+
+	for (uint32 i = 0; i < *num_pages; i++)
+	{
+		BufferDesc *buf_hdr;
+		Block		buf_block;
+
+		buffers[i] = GetLocalVictimBuffer();
+		buf_hdr = GetLocalBufferDescriptor(-(buffers[i] + 1));
+		buf_block = LocalBufHdrGetBlock(buf_hdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	first_block = smgrnblocks(smgr, fork);
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + *num_pages >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	for (int i = 0; i < *num_pages; i++)
+	{
+		int			victim_buf_id;
+		BufferDesc *victim_buf_hdr;
+		BufferTag	tag;
+		LocalBufferLookupEnt *hresult;
+		bool		found;
+
+		victim_buf_id = -(buffers[i] + 1);
+		victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
+
+		InitBufferTag(&tag, &smgr->smgr_rlocator.locator, fork, first_block + i);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
+		if (found)
+		{
+			BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			uint32		buf_state;
+
+			UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
+
+			existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			PinLocalBuffer(existing_hdr, false);
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+
+			buf_state = pg_atomic_read_u32(&existing_hdr->state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(!(buf_state & BM_DIRTY));
+			buf_state &= BM_VALID;
+			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+		}
+		else
+		{
+			uint32		buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
+
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
+
+			hresult->id = victim_buf_id;
+		}
+	}
+
+	/* actually extend relation */
+	smgrzeroextend(smgr, fork, first_block, *num_pages, false);
+
+	for (int i = 0; i < *num_pages; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr;
+		uint32		buf_state;
+
+		buf_hdr = GetLocalBufferDescriptor(-(buf + 1));
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	return first_block;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
@@ -486,6 +615,7 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 
 	if (LocalRefCount[bufid] == 0)
 	{
+		NLocalPinnedBuffers++;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
@@ -507,9 +637,11 @@ UnpinLocalBuffer(Buffer buffer)
 
 	Assert(BufferIsLocal(buffer));
 	Assert(LocalRefCount[buffid] > 0);
+	Assert(NLocalPinnedBuffers > 0);
 
 	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-	LocalRefCount[buffid]--;
+	if (--LocalRefCount[buffid] == 0)
+		NLocalPinnedBuffers--;
 }
 
 /*
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index c064d679e94..f18a8fbaed0 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,10 +55,12 @@ provider postgresql {
 	probe sort__start(int, bool, int, int, bool, int);
 	probe sort__done(bool, long);
 
-	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
+	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
+	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, int);
+	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index cf220c3bcb4..99a78150814 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7379,7 +7379,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
     </row>
     <row>
      <entry><literal>buffer-read-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int)</literal></entry>
      <entry>Probe that fires when a buffer read is started.
       arg0 and arg1 contain the fork and block numbers of the page (but
       arg1 will be -1 if this is a relation extension request).
@@ -7387,12 +7387,11 @@ FROM pg_stat_get_backend_idset() AS backendid;
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.</entry>
+      </entry>
     </row>
     <row>
      <entry><literal>buffer-read-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
      <entry>Probe that fires when a buffer read is complete.
       arg0 and arg1 contain the fork and block numbers of the page (if this
       is a relation extension request, arg1 now contains the block number
@@ -7401,9 +7400,7 @@ FROM pg_stat_get_backend_idset() AS backendid;
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.
-      arg7 is true if the buffer was found in the pool, false if not.</entry>
+      arg6 is true if the buffer was found in the pool, false if not.</entry>
     </row>
     <row>
      <entry><literal>buffer-flush-start</literal></entry>
-- 
2.38.0

v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload

From 2ffc7854ecd27a188d8e6d3dd3191e6fb93f6e61 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:18:18 -0700
Subject: [PATCH v2 08/14] Convert a few places to ExtendRelationBuffered

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/brin/brin.c         | 13 ++++++----
 src/backend/access/brin/brin_pageops.c |  4 +++
 src/backend/access/brin/brin_revmap.c  | 17 ++++--------
 src/backend/access/gin/gininsert.c     | 14 +++++-----
 src/backend/access/gin/ginutil.c       | 15 +++--------
 src/backend/access/gin/ginvacuum.c     |  8 ++++++
 src/backend/access/gist/gist.c         |  6 +++--
 src/backend/access/gist/gistutil.c     | 16 +++---------
 src/backend/access/gist/gistvacuum.c   |  3 +++
 src/backend/access/nbtree/nbtpage.c    | 36 ++++++++------------------
 src/backend/access/nbtree/nbtree.c     |  3 +++
 src/backend/access/spgist/spgutils.c   | 15 +++--------
 contrib/bloom/blutils.c                | 14 +++-------
 13 files changed, 70 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index de1427a1e0e..1810f7ebfef 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendRelationBuffered(index, NULL, true,
+								  index->rd_rel->relpersistence,
+								  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+								  NULL);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -896,9 +898,10 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	metabuf = ExtendRelationBuffered(index, NULL, true,
+									 index->rd_rel->relpersistence,
+									 INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index ad5a89bd051..b578d259545 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -730,6 +730,10 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * There's not enough free space in any existing index page,
 			 * according to the FSM: extend the relation to obtain a shiny new
 			 * page.
+			 *
+			 * XXX: It's likely possible to use RBM_ZERO_AND_LOCK here,
+			 * which'd avoid the need to hold the extension lock during buffer
+			 * reclaim.
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 7fc5226bf74..7ae9cecf43d 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -538,7 +538,6 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
-	bool		needLock = !RELATION_IS_LOCAL(irel);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -570,10 +569,10 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
-
-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendRelationBuffered(irel, NULL, false,
+									 irel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
@@ -582,17 +581,11 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * up and have caller start over.  We will have to evacuate that
 			 * page from under whoever is using it.
 			 */
-			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buf);
+			UnlockReleaseBuffer(buf);
 			return;
 		}
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
-
-		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d5d748009ea..ea65b460c72 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,12 +440,14 @@ ginbuildempty(Relation index)
 				MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
-	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	MetaBuffer = ExtendRelationBuffered(index, NULL, true,
+										index->rd_rel->relpersistence,
+										INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+										NULL);
+	RootBuffer = ExtendRelationBuffered(index, NULL, true,
+										index->rd_rel->relpersistence,
+										INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+										NULL);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index e7cc452a8aa..c0362ab384c 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -298,7 +298,6 @@ Buffer
 GinNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -326,16 +325,10 @@ GinNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendRelationBuffered(index, NULL, false,
+									index->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index e5d310d8362..13251d7e07d 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -736,6 +736,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	needLock = !RELATION_IS_LOCAL(index);
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	npages = RelationGetNumberOfBlocks(index);
@@ -786,6 +790,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	stats->pages_free = totFreePages;
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f61..6dfb07a45b7 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -133,8 +133,10 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = ExtendRelationBuffered(index, NULL, true,
+									index->rd_rel->relpersistence,
+									INIT_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index 56451fede10..c8d57f06d20 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -824,7 +824,6 @@ Buffer
 gistNewBuffer(Relation r)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -877,17 +876,10 @@ gistNewBuffer(Relation r)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(r);
-
-	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
-
-	buffer = ReadBuffer(r, P_NEW);
-	LockBuffer(buffer, GIST_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+	buffer = ExtendRelationBuffered(r, NULL, false,
+									r->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3f60d3274d2..cc711b04986 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -203,6 +203,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * we must already have processed any tuples due to be moved into such a
 	 * page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d197..1733f2a18ed 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -881,7 +881,6 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	}
 	else
 	{
-		bool		needLock;
 		Page		page;
 
 		Assert(access == BT_WRITE);
@@ -962,31 +961,18 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		}
 
 		/*
-		 * Extend the relation by one page.
-		 *
-		 * We have to use a lock to ensure no one else is extending the rel at
-		 * the same time, else we will both try to initialize the same new
-		 * page.  We can skip locking for new or temp relations, however,
-		 * since no one else could be accessing them.
+		 * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
+		 * we risk a race condition against btvacuumscan --- see comments
+		 * therein. This forces us to repeat the valgrind request that
+		 * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
+		 * without introducing a race.
 		 */
-		needLock = !RELATION_IS_LOCAL(rel);
-
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-
-		buf = ReadBuffer(rel, P_NEW);
-
-		/* Acquire buffer lock on new page */
-		_bt_lockbuf(rel, buf, BT_WRITE);
-
-		/*
-		 * Release the file-extension lock; it's now OK for someone else to
-		 * extend the relation some more.  Note that we cannot release this
-		 * lock before we have buffer lock on the new page, or we risk a race
-		 * condition against btvacuumscan --- see comments therein.
-		 */
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+		buf = ExtendRelationBuffered(rel, NULL, false,
+									 rel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 1cc88da032d..383f18a999e 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -969,6 +969,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * write-lock on the left page before it adds a right page, so we must
 	 * already have processed any tuples due to be moved into such a page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 3761f2c193b..f459b45eaa9 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -365,7 +365,6 @@ Buffer
 SpGistNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -405,16 +404,10 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendRelationBuffered(index, NULL, false,
+									index->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a6d9f09f315..f700e7c9f0b 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -353,7 +353,6 @@ Buffer
 BloomNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -387,15 +386,10 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendRelationBuffered(index, NULL, false,
+									index->rd_rel->relpersistence,
+									MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									NULL);
 
 	return buffer;
 }
-- 
2.38.0

v2-0009-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload

From ede000ac82133434ebe293506d5d434f6f360904 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v2 09/14] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/hio.h         | 14 +++++++-
 src/backend/access/heap/heapam.c | 60 +++++++++++++++++++++++++++++---
 src/backend/access/heap/hio.c    |  8 ++++-
 3 files changed, 76 insertions(+), 6 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 3f20b585326..dd61462d988 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -30,6 +30,17 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+
+	/*
+	 * State for bulk extensions. Further pages that were unused at the time
+	 * of the extension. They might be in use by the time we use them though,
+	 * so rechecks are needed.
+	 *
+	 * FIXME: Perhaps these should live in RelationData instead, alongside the
+	 * targetblock?
+	 */
+	BlockNumber	next_free;
+	BlockNumber	last_free;
 } BulkInsertStateData;
 
 
@@ -38,6 +49,7 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
 extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
 										Buffer otherBuffer, int options,
 										BulkInsertStateData *bistate,
-										Buffer *vmbuffer, Buffer *vmbuffer_other);
+										Buffer *vmbuffer, Buffer *vmbuffer_other,
+										int num_pages);
 
 #endif							/* HIO_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 63c4f01f0fd..e73a750892c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1979,6 +1979,8 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 	return bistate;
 }
 
@@ -2052,7 +2054,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
-									   &vmbuffer, NULL);
+									   &vmbuffer, NULL,
+									   0);
 
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
@@ -2255,6 +2258,33 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 		return tup;
 }
 
+/*
+ * Helper for heap_multi_insert() that computes the number of full pages s
+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace)
+{
+	size_t		page_avail;
+	int			npages = 0;
+
+	page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+	npages++;
+
+	for (int i = done; i < ntuples; i++)
+	{
+		size_t tup_sz = sizeof(ItemIdData) + MAXALIGN(heaptuples[i]->t_len);
+
+		if (page_avail < tup_sz)
+		{
+			npages++;
+			page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+		}
+		page_avail -= tup_sz;
+	}
+
+	return npages;
+}
+
 /*
  *	heap_multi_insert	- insert multiple tuples into a heap
  *
@@ -2281,6 +2311,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	bool		starting_with_empty_page = false;
+	int			npages = 0;
+	int			npages_used = 0;
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2331,13 +2364,30 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	while (ndone < ntuples)
 	{
 		Buffer		buffer;
-		bool		starting_with_empty_page;
 		bool		all_visible_cleared = false;
 		bool		all_frozen_set = false;
 		int			nthispage;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Compute number of pages needed to insert tuples in the worst
+		 * case. This will be used to determine how much to extend the
+		 * relation by in RelationGetBufferForTuple(), if needed.  If we
+		 * filled a prior page from scratch, we can just update our last
+		 * computation, but if we started with a partially filled page
+		 * recompute from scratch, the number of potentially required pages
+		 * can vary due to tuples needing to fit onto the page, page headers
+		 * etc.
+		 */
+		if (ndone == 0 || !starting_with_empty_page)
+		{
+			npages = heap_multi_insert_pages(heaptuples, ndone, ntuples, saveFreeSpace);
+			npages_used = 0;
+		}
+		else
+			npages_used++;
+
 		/*
 		 * Find buffer where at least the next tuple will fit.  If the page is
 		 * all-visible, this will also pin the requisite visibility map page.
@@ -2347,7 +2397,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		 */
 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
 										   InvalidBuffer, options, bistate,
-										   &vmbuffer, NULL);
+										   &vmbuffer, NULL,
+										   npages - npages_used);
 		page = BufferGetPage(buffer);
 
 		starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0;
@@ -3770,7 +3821,8 @@ l2:
 				/* It doesn't fit, must use RelationGetBufferForTuple. */
 				newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
 												   buffer, 0, NULL,
-												   &vmbuffer_new, &vmbuffer);
+												   &vmbuffer_new, &vmbuffer,
+												   0);
 				/* We're all done. */
 				break;
 			}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..65886839e70 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -275,6 +275,11 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	Returns pinned and exclusive-locked buffer of a page in given relation
  *	with free space >= given len.
  *
+ *	If num_pages is > 1, the relation will be extended by at least that many
+ *	pages when we decide to extend the relation. This is more efficient for
+ *	callers that know they will need multiple pages
+ *	(e.g. heap_multi_insert()).
+ *
  *	If otherBuffer is not InvalidBuffer, then it references a previously
  *	pinned buffer of another page in the same relation; on return, this
  *	buffer will also be exclusive-locked.  (This case is used by heap_update;
@@ -333,7 +338,8 @@ Buffer
 RelationGetBufferForTuple(Relation relation, Size len,
 						  Buffer otherBuffer, int options,
 						  BulkInsertState bistate,
-						  Buffer *vmbuffer, Buffer *vmbuffer_other)
+						  Buffer *vmbuffer, Buffer *vmbuffer_other,
+						  int num_pages)
 {
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
-- 
2.38.0

v2-0010-hio-Use-BulkExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload

From 57b2af7bd09b8baa5e0d0b0cca07816902b9e759 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v2 10/14] hio: Use BulkExtendRelationBuffered()

---
 src/backend/access/heap/hio.c | 181 +++++++++++++++++++++++++++++++---
 1 file changed, 170 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e70..f40439d2778 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -24,6 +24,7 @@
 #include "storage/lmgr.h"
 #include "storage/smgr.h"
 
+#define NEW_EXTEND
 
 /*
  * RelationPutHeapTuple - place tuple at specified page
@@ -185,6 +186,8 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }
 
+#ifndef NEW_EXTEND
+
 /*
  * Extend a relation by multiple blocks to avoid future contention on the
  * relation extension lock.  Our goal is to pre-extend the relation by an
@@ -268,6 +271,7 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 	 */
 	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
 }
+#endif
 
 /*
  * RelationGetBufferForTuple
@@ -354,6 +358,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
 
 	len = MAXALIGN(len);		/* be conservative */
 
+	if (num_pages <= 0)
+		num_pages = 1;
+
 	/* Bulk insert is not supported for updates, only inserts. */
 	Assert(otherBuffer == InvalidBuffer || !bistate);
 
@@ -558,18 +565,46 @@ loop:
 			ReleaseBuffer(buffer);
 		}
 
-		/* Without FSM, always fall out of the loop and extend */
-		if (!use_fsm)
-			break;
+		if (bistate
+			&& bistate->next_free != InvalidBlockNumber
+			&& bistate->next_free <= bistate->last_free)
+		{
+			/*
+			 * We bulk extended the relation before, and there are still some
+			 * unused pages from that extension, so we don't need to look in
+			 * the FSM for a new page. But do record the free space from the
+			 * last page, somebody might insert narrower tuples later.
+			 */
+			if (use_fsm)
+				RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
 
-		/*
-		 * Update FSM as to condition of this page, and ask for another page
-		 * to try.
-		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
-													targetBlock,
-													pageFreeSpace,
-													targetFreeSpace);
+			Assert(bistate->last_free != InvalidBlockNumber &&
+				   bistate->next_free <= bistate->last_free);
+			targetBlock = bistate->next_free;
+			if (bistate->next_free >= bistate->last_free)
+			{
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
+			}
+			else
+				bistate->next_free++;
+		}
+		else if (!use_fsm)
+		{
+			/* Without FSM, always fall out of the loop and extend */
+			break;
+		}
+		else
+		{
+			/*
+			 * Update FSM as to condition of this page, and ask for another page
+			 * to try.
+			 */
+			targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														targetBlock,
+														pageFreeSpace,
+														targetFreeSpace);
+		}
 	}
 
 	/*
@@ -582,6 +617,129 @@ loop:
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
+#ifdef NEW_EXTEND
+	{
+#define MAX_BUFFERS 64
+		Buffer victim_buffers[MAX_BUFFERS];
+		BlockNumber firstBlock = InvalidBlockNumber;
+		BlockNumber firstBlockFSM = InvalidBlockNumber;
+		BlockNumber curBlock;
+		uint32 extend_by_pages;
+		uint32 no_fsm_pages;
+		uint32 waitcount;
+
+		extend_by_pages = num_pages;
+
+		/*
+		 * Multiply the number of pages to extend by the number of waiters. Do
+		 * this even if we're not using the FSM, as it does relieve
+		 * contention. Pages will be found via bistate->next_free.
+		 */
+		if (needLock)
+			waitcount = RelationExtensionLockWaiterCount(relation);
+		else
+			waitcount = 0;
+		extend_by_pages += extend_by_pages * waitcount;
+
+		/*
+		 * can't extend by more than MAX_BUFFERS, we need to pin them all
+		 * concurrently. FIXME: Need an NBuffers / MaxBackends type limit
+		 * here.
+		 */
+		extend_by_pages = Min(extend_by_pages, MAX_BUFFERS);
+
+		/*
+		 * How many of the extended pages not to enter into the FSM.
+		 *
+		 * Only enter pages that we don't need ourselves into the
+		 * FSM. Otherwise every other backend will immediately try to use the
+		 * pages this backend neds itself, causing unnecessary contention.
+		 *
+		 * Bulk extended pages are remembered in bistate->next_free_buffer. So
+		 * without a bistate we can't directly make use of them.
+		 *
+		 * Never enter the page returned into the FSM, we'll immediately use
+		 * it.
+		 */
+		if (num_pages > 1 && bistate == NULL)
+			no_fsm_pages = 1;
+		else
+			no_fsm_pages = num_pages;
+
+		if (bistate && bistate->current_buf != InvalidBuffer)
+		{
+			ReleaseBuffer(bistate->current_buf);
+			bistate->current_buf = InvalidBuffer;
+		}
+
+		firstBlock = BulkExtendRelationBuffered(relation,
+												NULL,
+												false,
+												relation->rd_rel->relpersistence,
+												MAIN_FORKNUM,
+												RBM_ZERO_AND_LOCK,
+												bistate ? bistate->strategy : NULL,
+												&extend_by_pages,
+												1,
+												victim_buffers);
+		/*
+		 * Relation is now extended. Make all but the first buffer available
+		 * to other backends.
+		 *
+		 * XXX: We don't necessarily need to release pin / update FSM while
+		 * holding the extension lock. But there are some advantages.
+		 */
+		curBlock = firstBlock;
+		for (uint32 i = 0; i < extend_by_pages; i++, curBlock++)
+		{
+			Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
+			Assert(BlockNumberIsValid(curBlock));
+
+			/* don't release the pin on the page returned by this function */
+			if (i > 0)
+				ReleaseBuffer(victim_buffers[i]);
+
+			if (i >= no_fsm_pages && use_fsm)
+			{
+				if (firstBlockFSM == InvalidBlockNumber)
+					firstBlockFSM = curBlock;
+
+				RecordPageWithFreeSpace(relation,
+										curBlock,
+										BufferGetPageSize(victim_buffers[i]) - SizeOfPageHeaderData);
+			}
+		}
+
+		if (use_fsm && firstBlockFSM != InvalidBlockNumber)
+			FreeSpaceMapVacuumRange(relation, firstBlockFSM, firstBlock + num_pages);
+
+		if (bistate)
+		{
+			if (extend_by_pages > 1)
+			{
+				bistate->next_free = firstBlock + 1;
+				bistate->last_free = firstBlock + extend_by_pages - 1;
+			}
+			else
+			{
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
+			}
+		}
+
+		buffer = victim_buffers[0];
+		if (bistate)
+		{
+			IncrBufferRefCount(buffer);
+			bistate->current_buf = buffer;
+		}
+#if 0
+		ereport(LOG, errmsg("block start %u, size %zu, requested pages: %u, extend_by_pages: %d, waitcount: %d",
+							firstBlock, len, num_pages, extend_by_pages, waitcount),
+				errhidestmt(true), errhidecontext(true));
+#endif
+	}
+#else
 	/*
 	 * If we need the lock but are not able to acquire it immediately, we'll
 	 * consider extending the relation by multiple blocks at a time to manage
@@ -635,6 +793,7 @@ loop:
 	 */
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
+#endif
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
-- 
2.38.0

v2-0011-bufmgr-debug-Add-PrintBuffer-Desc.patchtext/x-diff; charset=us-asciiDownload

From 8b8a853c378c8f409f8d173a54d1da0ba4d047d6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:41:46 -0700
Subject: [PATCH v2 11/14] bufmgr: debug: Add PrintBuffer[Desc]

Useful for development. Perhaps we should polish these and keep them?
---
 src/include/storage/buf_internals.h |  3 ++
 src/backend/storage/buffer/bufmgr.c | 49 +++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 57800254d2d..b651838b61a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -390,6 +390,9 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
+extern void PrintBuffer(Buffer buffer, const char *msg);
+extern void PrintBufferDesc(BufferDesc *buf_hdr, const char *msg);
+
 /* freelist.c */
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 361ebc3ae26..d3134cecf2d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3723,6 +3723,55 @@ DropDatabaseBuffers(Oid dbid)
  *		use only.
  * -----------------------------------------------------------------
  */
+
+#include "utils/memutils.h"
+
+void
+PrintBufferDesc(BufferDesc *buf_hdr, const char *msg)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	char	   *path = "";
+	BlockNumber blockno = InvalidBlockNumber;
+
+	CurrentMemoryContext->allowInCritSection = true;
+	if (buf_state & BM_TAG_VALID)
+	{
+		path = relpathbackend(BufTagGetRelFileLocator(&buf_hdr->tag),
+							  InvalidBackendId, BufTagGetForkNum(&buf_hdr->tag));
+		blockno = buf_hdr->tag.blockNum;
+	}
+
+	fprintf(stderr, "%d: [%u] msg: %s, rel: %s, block %u: refcount: %u / %u, usagecount: %u, flags:%s%s%s%s%s%s%s%s%s%s\n",
+			MyProcPid,
+			buffer,
+			msg,
+			path,
+			blockno,
+			BUF_STATE_GET_REFCOUNT(buf_state),
+			GetPrivateRefCount(buffer),
+			BUF_STATE_GET_USAGECOUNT(buf_state),
+			buf_state & BM_LOCKED ? " BM_LOCKED" : "",
+			buf_state & BM_DIRTY ? " BM_DIRTY" : "",
+			buf_state & BM_VALID ? " BM_VALID" : "",
+			buf_state & BM_TAG_VALID ? " BM_TAG_VALID" : "",
+			buf_state & BM_IO_IN_PROGRESS ? " BM_IO_IN_PROGRESS" : "",
+			buf_state & BM_IO_ERROR ? " BM_IO_ERROR" : "",
+			buf_state & BM_JUST_DIRTIED ? " BM_JUST_DIRTIED" : "",
+			buf_state & BM_PIN_COUNT_WAITER ? " BM_PIN_COUNT_WAITER" : "",
+			buf_state & BM_CHECKPOINT_NEEDED ? " BM_CHECKPOINT_NEEDED" : "",
+			buf_state & BM_PERMANENT ? " BM_PERMANENT" : ""
+		);
+}
+
+void
+PrintBuffer(Buffer buffer, const char *msg)
+{
+	BufferDesc *buf_hdr = GetBufferDescriptor(buffer - 1);
+
+	PrintBufferDesc(buf_hdr, msg);
+}
+
 #ifdef NOT_USED
 void
 PrintBufferDescs(void)
-- 
2.38.0

David Rowley

dgrowleyml@gmail.com

almost 3 years ago

In reply to: Andres Freund (#5)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote:

Thanks for letting me now. Updated version attached.

I'm not too sure I've qualified for giving a meaningful design review
here, but I have started looking at the patches and so far only made
it as far as 0006.

I noted down the following while reading:

v2-0001:

1. BufferCheckOneLocalPin needs a header comment

v2-0002:

2. The following comment and corresponding code to release the
extension lock has been moved now.

/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more.
*/

I think it's worth detailing out why it's fine to release the
extension lock in the new location. You've added detail to the commit
message but I think you need to do the same in the comments too.

v2-0003

3. FileFallocate() and FileZero() should likely document what they
return, i.e zero on success and non-zero on failure.

4. I'm not quite clear on why you've modified FileGetRawDesc() to call
FileAccess() twice.

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

v2-0005
v2-0006

David

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: David Rowley (#6)

Re: refactoring relation extension and BufferAlloc(), faster COPY

I'll continue reviewing this, but here's some feedback on the first two
patches:

v2-0001-aio-Add-some-error-checking-around-pinning.patch:

I wonder if the extra assertion in LockBufHdr() is worth the overhead.
It won't add anything without assertions, of course, but still. No
objections if you think it's worth it.

v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:

Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK,
which zeroes the page, and then we call PageInit to zero the page again.
RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache
previously, but with P_NEW, that is always true.

- Heikki

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Heikki Linnakangas (#7)

Re: refactoring relation extension and BufferAlloc(), faster COPY

v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch

This can be applied separately from the rest of the patches, which is
nice. Some small comments on it:

* Needs a rebase, it conflicted slightly with commit f30d62c2fc.

* GetVictimBuffer needs a comment to explain what it does. In
particular, mention that it returns a buffer that's pinned and known
!BM_TAG_VALID.

* I suggest renaming 'cur_buf' and other such local variables in
GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some
other buffer involved too, but there is no 'prev' or 'next' or 'other'
buffer. The old code called it just 'buf' too, and before this patch it
actually was a bit confusing because there were two buffers involved.
But with this patch, GetVictimBuffer only deals with one buffer at a time.

* This FIXME:

/* OK, do the I/O */
/* FIXME: These used the wrong smgr before afaict? */
{
SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
InvalidBackendId);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);

FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context);
LWLockRelease(content_lock);

ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf_hdr->tag);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);
}

I believe that was intentional. The probes previously reported the block
and relation whose read *caused* the eviction. It was not just the smgr
but also the blockNum and forkNum that referred to the block that was
being read. There's another pair of probe points,
TRACE_POSTGRESQL_BUFFER_FLUSH_START/DONE, inside FlushBuffer that
indicate the page that is being flushed.

I see that reporting the evicted page is more convenient with this
patch, otherwise you'd need to pass the smgr and blocknum of the page
that's being read to InvalidateVictimBuffer(). IMHO you can just remove
these probe points. We don't need to bend over backwards to maintain
specific probe points.

* InvalidateVictimBuffer reads the buffer header with an atomic read op,
just to check if BM_TAG_VALID is set. If it's not, it does nothing
(except for a few Asserts). But the caller has already read the buffer
header. Consider refactoring it so that the caller checks VM_TAG_VALID,
and only calls InvalidateVictimBuffer if it's set, saving one atomic
read in InvalidateVictimBuffer. I think it would be just as readable, so
no loss there. I doubt the atomic read makes any measurable performance
difference, but it looks redundant.

* I don't understand this comment:

/*
* Clear out the buffer's tag and flags and usagecount. We must do
* this to ensure that linear scans of the buffer array don't think
* the buffer is valid.
*
* XXX: This is a pre-existing comment I just moved, but isn't it
* entirely bogus with regard to the tag? We can't do anything with
* the buffer without taking BM_VALID / BM_TAG_VALID into
* account. Likely doesn't matter because we're already dirtying the
* cacheline, but still.
*
*/
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the
cacheline have to do with the correctness here?

* pgindent

- Heikki

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#8)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:

v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch

This can be applied separately from the rest of the patches, which is nice.
Some small comments on it:

Thanks for looking at these!

* Needs a rebase, it conflicted slightly with commit f30d62c2fc.

Will work on that.

* GetVictimBuffer needs a comment to explain what it does. In particular,
mention that it returns a buffer that's pinned and known !BM_TAG_VALID.

Will add.

* I suggest renaming 'cur_buf' and other such local variables in
GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some
other buffer involved too, but there is no 'prev' or 'next' or 'other'
buffer. The old code called it just 'buf' too, and before this patch it
actually was a bit confusing because there were two buffers involved. But
with this patch, GetVictimBuffer only deals with one buffer at a time.

Hm. Yea. I probably ended up with these names because initially
GetVictimBuffer() wasn't a separate function, and I indeed constantly got
confused by which buffer was referenced.

* This FIXME:

/* OK, do the I/O */
/* FIXME: These used the wrong smgr before afaict? */
{
SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
InvalidBackendId);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);

FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context);
LWLockRelease(content_lock);

ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf_hdr->tag);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);
}

I believe that was intentional. The probes previously reported the block and
relation whose read *caused* the eviction. It was not just the smgr but also
the blockNum and forkNum that referred to the block that was being read.

You're probably right. It's certainly not understandable from our docs
though:

<row>
<entry><literal>buffer-write-dirty-start</literal></entry>
<entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
<entry>Probe that fires when a server process begins to write a dirty
buffer. (If this happens often, it implies that
<xref linkend="guc-shared-buffers"/> is too
small or the background writer control parameters need adjustment.)
arg0 and arg1 contain the fork and block numbers of the page.
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.</entry>
</row>

I see that reporting the evicted page is more convenient with this patch,
otherwise you'd need to pass the smgr and blocknum of the page that's being
read to InvalidateVictimBuffer(). IMHO you can just remove these probe
points. We don't need to bend over backwards to maintain specific probe
points.

Agreed.

* InvalidateVictimBuffer reads the buffer header with an atomic read op,
just to check if BM_TAG_VALID is set.

It's not a real atomic op, in the sense of being special instruction. It does
force the compiler to actually read from memory, but that's it.

But you're right, even that is unnecessary. I think it ended up that way
because I also wanted the full buf_hdr, and it seemed somewhat error prone to
pass in both.

* I don't understand this comment:

/*
* Clear out the buffer's tag and flags and usagecount. We must do
* this to ensure that linear scans of the buffer array don't think
* the buffer is valid.
*
* XXX: This is a pre-existing comment I just moved, but isn't it
* entirely bogus with regard to the tag? We can't do anything with
* the buffer without taking BM_VALID / BM_TAG_VALID into
* account. Likely doesn't matter because we're already dirtying the
* cacheline, but still.
*
*/
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the
cacheline have to do with the correctness here?

There's nothing wrong with clearing out the tag, but I don't think it's a hard
requirement today, and certainly not for the reason stated above.

Validity of the buffer isn't determined by the tag, it's determined by
BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID).

Without either having pinned the buffer, or holding the buffer header
spinlock, the tag can change at any time. And code like DropDatabaseBuffers()
knows that, and re-checks the the tag after locking the buffer header
spinlock.

Afaict, there'd be no correctness issue with removing the
ClearBufferTag(). There would be an efficiency issue though, because when
encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(),
which'd find that BM_[TAG_]VALID isn't set, and not to anything.

Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further. Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

Greetings,

Andres Freund

#10

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#9)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 2023-02-11 13:36:51 -0800, Andres Freund wrote:

Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further. Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

There's a small regression for a single relation, but after that it's a clear
benefit.

32GB shared buffers, empty. The test creates N new relations and then rolls
back.

tps tps
num relations HEAD precheck
1 46.11 45.22
2 43.24 44.87
4 35.14 44.20
8 28.72 42.79

I don't understand the regression at 1, TBH. I think it must be a random code
layout issue, because the same pre-check in DropRelationBuffers() (exercised
via TRUNCATE of a newly created relation), shows a tiny speedup.

#11

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#7)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-10 18:38:50 +0200, Heikki Linnakangas wrote:

I'll continue reviewing this, but here's some feedback on the first two
patches:

v2-0001-aio-Add-some-error-checking-around-pinning.patch:

I wonder if the extra assertion in LockBufHdr() is worth the overhead. It
won't add anything without assertions, of course, but still. No objections
if you think it's worth it.

It's so easy to get confused about local/non-local buffers, that I think it is
useful. I think we really need to consider cleaning up the separation
further. Having half the code for local buffers in bufmgr.c and the other half
in localbuf.c, without a scheme that I can recognize, is not a good scheme.

It bothers me somewhat ConditionalLockBufferForCleanup() silently accepts
multiple pins by the current backend. That's the right thing for
e.g. heap_page_prune_opt(), but for something like lazy_scan_heap() it's
not. And yes, I did encounter a bug hidden by that when making vacuumlazy use
AIO as part of that patchset. That's why I made BufferCheckOneLocalPin()
externally visible.

v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:

Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK,
which zeroes the page, and then we call PageInit to zero the page again.
RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache
previously, but with P_NEW, that is always true.

It is quite silly, and it shows up noticably in profiles. The zeroing is
definitely needed in other places calling PageInit(), though. I suspect we
should have a PageInitZeroed() or such, that asserts the page is zero, but
otherwise skips it.

Seems independent enough from this series, that I'd probably tackle it
separately? If you prefer, I'm ok with adding a patch to this series instead,
though.

Greetings,

Andres Freund

#12

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: David Rowley (#6)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-01-20 13:40:55 +1300, David Rowley wrote:

On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote:

Thanks for letting me now. Updated version attached.

I'm not too sure I've qualified for giving a meaningful design review
here, but I have started looking at the patches and so far only made
it as far as 0006.

Thanks!

I noted down the following while reading:

v2-0001:

1. BufferCheckOneLocalPin needs a header comment

v2-0002:

2. The following comment and corresponding code to release the
extension lock has been moved now.

/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more.
*/

I think it's worth detailing out why it's fine to release the
extension lock in the new location. You've added detail to the commit
message but I think you need to do the same in the comments too.

Will do.

v2-0003

3. FileFallocate() and FileZero() should likely document what they
return, i.e zero on success and non-zero on failure.

I guess I just tried to fit in with the rest of the file :)

4. I'm not quite clear on why you've modified FileGetRawDesc() to call
FileAccess() twice.

I do not have the faintest idea what happened there... Will fix.

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

I think it'd be nicer to just move the read inside the if
(adjust_usagecount). That way the rest of the function doesn't have to be
duplicated.

Thanks,

Andres Freund

#13

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#12)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-11 14:25:06 -0800, Andres Freund wrote:

On 2023-01-20 13:40:55 +1300, David Rowley wrote:

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

I think it'd be nicer to just move the read inside the if
(adjust_usagecount). That way the rest of the function doesn't have to be
duplicated.

Ah, no, we need it for the return value. No current users of
PinLocalBuffer(adjust_usagecount = false)
need the return value, but I don't think that's necessarily the case.

I'm somewhat inclined to not duplicate it, but if you think it's worth it,
I'll do that.

Greetings,

Andres Freund

#14

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Andres Freund (#9)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 11/02/2023 23:36, Andres Freund wrote:

On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:

* I don't understand this comment:

/*
* Clear out the buffer's tag and flags and usagecount. We must do
* this to ensure that linear scans of the buffer array don't think
* the buffer is valid.
*
* XXX: This is a pre-existing comment I just moved, but isn't it
* entirely bogus with regard to the tag? We can't do anything with
* the buffer without taking BM_VALID / BM_TAG_VALID into
* account. Likely doesn't matter because we're already dirtying the
* cacheline, but still.
*
*/
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the
cacheline have to do with the correctness here?

There's nothing wrong with clearing out the tag, but I don't think it's a hard
requirement today, and certainly not for the reason stated above.

Validity of the buffer isn't determined by the tag, it's determined by
BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID).

Without either having pinned the buffer, or holding the buffer header
spinlock, the tag can change at any time. And code like DropDatabaseBuffers()
knows that, and re-checks the the tag after locking the buffer header
spinlock.

Afaict, there'd be no correctness issue with removing the
ClearBufferTag(). There would be an efficiency issue though, because when
encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(),
which'd find that BM_[TAG_]VALID isn't set, and not to anything.

Okay, gotcha.

Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further. Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

Depends on what percentage of buffers are valid, I guess. If all buffers
are valid, checking BM_TAG_VALID first would lose. In practice, I doubt
it makes any measurable difference either way.

Since we're micro-optimizing, I noticed that
BufTagMatchesRelFileLocator() compares the fields in order "spcOid,
dbOid, relNumber". Before commit 82ac34db20, we used
RelFileLocatorEqual(), which has this comment:

/*
* Note: RelFileLocatorEquals and RelFileLocatorBackendEquals compare
relNumber
* first since that is most likely to be different in two unequal
* RelFileLocators. It is probably redundant to compare spcOid if the
other
* fields are found equal, but do it anyway to be sure. Likewise for
checking
* the backend ID in RelFileLocatorBackendEquals.
*/

So we lost that micro-optimization. Should we reorder the checks in
BufTagMatchesRelFileLocator()?

- Heikki

#15

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Andres Freund (#5)

Re: refactoring relation extension and BufferAlloc(), faster COPY

v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch

This looks straightforward. My only concern is that it changes the order
that things happen at abort. Currently, AbortBufferIO() is called very
early in AbortTransaction(), and this patch moves it much later. I don't
see any immediate problems from that, but it feels scary.

@@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
static void
AtProcExit_Buffers(int code, Datum arg)
{
- AbortBufferIO();
UnlockBuffers();

CheckForBufferLeaks();

Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on
elog(FATAL)? Do we need to worry about that?

- Heikki

#16

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Andres Freund (#5)

Re: refactoring relation extension and BufferAlloc(), faster COPY

v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch

+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)

Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.

v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index de1427a1e0e..1810f7ebfef 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
* whole relation will be rolled back.
*/

-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendRelationBuffered(index, NULL, true,
+								  index->rd_rel->relpersistence,
+								  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+								  NULL);
Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);

brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
BRIN_CURRENT_VERSION);

Since we're changing the API anyway, how about introducing a new
function for this case where we extend the relation but we what block
number we're going to get? This pattern of using P_NEW and asserting the
result has always felt awkward to me.

-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendRelationBuffered(irel, NULL, false,
+									 irel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);

These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW).
I'd suggest something like:

buf = ExtendBuffer(rel);

Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
ExtendRelationBuffered?

Is it ever possible to call this without a relcache entry? WAL redo
functions do that with ReadBuffer, but they only extend a relation
implicitly, by replay a record for a particular block.

All of the above comments are around the BulkExtendRelationBuffered()
function's API. That needs a closer look and a more thought-out design
to make it nice. Aside from that, this approach seems valid.

- Heikki

#17

Alvaro Herrera

alvherre@alvh.no-ip.org

almost 3 years ago

In reply to: Heikki Linnakangas (#16)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 2023-Feb-21, Heikki Linnakangas wrote:

+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.

Yeah, I noticed this too. I think it would be easy enough to add a new
struct that can be passed as a pointer, which can be stack-allocated
by the caller, and which holds the input arguments that are common to
both functions, as is sensible.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Update: super-fast reaction on the Postgres bugs mailing list. The report
was acknowledged [...], and a fix is under discussion.
The wonders of open-source !"
https://twitter.com/gunnarmorling/status/1596080409259003906

#18

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#15)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 17:40:31 +0200, Heikki Linnakangas wrote:

v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch

This looks straightforward. My only concern is that it changes the order
that things happen at abort. Currently, AbortBufferIO() is called very early
in AbortTransaction(), and this patch moves it much later. I don't see any
immediate problems from that, but it feels scary.

Yea, it does feel a bit awkward. But I suspect it's actually the right
thing. We've not even adjusted the transaction state at the point we're
calling AbortBufferIO(). And AbortBufferIO() will sometimes allocate memory
for a WARNING, which conceivably could fail - although I don't think that's a
particularly realistic scenario due to TransactionAbortContext (I guess you
could have a large error context stack or such).

Medium term I think we need to move a lot more of the error handling into
resowners. Having a dozen+ places with their own choreographed sigsetjmp()
recovery blocks is error prone as hell. Not to mention tedious.

@@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
static void
AtProcExit_Buffers(int code, Datum arg)
{
- AbortBufferIO();
UnlockBuffers();
CheckForBufferLeaks();

Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on
elog(FATAL)? Do we need to worry about that?

We have before_shmem_exit() callbacks that should protect against
that. InitPostgres() registers ShutdownPostgres(), and
CreateAuxProcessResourceOwner() registers
ReleaseAuxProcessResourcesCallback().

I think we'd already be in trouble if we didn't reliably end up doing resowner
cleanup during process exit.

Perhaps ResourceOwnerCreate()/ResourceOwnerDelete() should maintain a list of
"active" resource owners and have a before-exit callback that ensures the list
is empty and PANICs if not? Better a crash restart than hanging because we
didn't release some shared resource.

Greetings,

Andres Freund

#19

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#16)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:

v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch
+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.

Yea. I think this is the part of the patchset I like the least.

The ugliest bit is accepting both rel and smgr. The background to that is that
we need the relation oid to acquire the extension lock. But during crash
recovery we don't have that - which is fine, because we don't need the
extension lock.

We could have two different of functions, but that ends up a mess as well, as
we've seen in other cases.

v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index de1427a1e0e..1810f7ebfef 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
* whole relation will be rolled back.
*/
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendRelationBuffered(index, NULL, true,
+								  index->rd_rel->relpersistence,
+								  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+								  NULL);
Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
BRIN_CURRENT_VERSION);
Since we're changing the API anyway, how about introducing a new function
for this case where we extend the relation but we what block number we're
going to get? This pattern of using P_NEW and asserting the result has
always felt awkward to me.

To me it always felt like a code smell that some code insists on specific
getting specific block numbers with P_NEW. I guess it's ok for things like
building a new index, but outside of that it feels wrong.

The first case I found just now is revmap_physical_extend(). Which seems to
extend the relation while holding an lwlock. Ugh.

Maybe ExtendRelationBufferedTo() or something like that? With a big comment
saying that users of it are likely bad ;)

-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendRelationBuffered(irel, NULL, false,
+									 irel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);
These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW). I'd
suggest something like:

I guess. Not sure if it's worth optimizing for brevity all that much here -
there's not that many places extending relations. Several places end up with
less code, actually , because they don't need to care about the extension lock
themselves anymore. I think an ExtendBuffer() that doesn't mention the fork,
etc, ends up being more confusing than helpful.

buf = ExtendBuffer(rel);

Without the relation in the name it just seems confusing to me - the extension
isn't "restricted" to shared_buffers. ReadBuffer() isn't great as a name
either, but it makes a bit more sense at least, it reads into a buffer. And
it's a vastly more frequent operation, so optimizing for density is worth it.

Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
ExtendRelationBuffered?

Hm. That's a a good point. Probably not. Perhaps it could be useful to support
RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
used RBM_ZERO_AND_LOCK.

Is it ever possible to call this without a relcache entry? WAL redo
functions do that with ReadBuffer, but they only extend a relation
implicitly, by replay a record for a particular block.

I think we should use it for crash recovery as well, but the patch doesn't
yet. We have some gnarly code there, see the loop using P_NEW in
XLogReadBufferExtended(). Extending the file one-by-one is a lot more
expensive than doing it in bulk.

All of the above comments are around the BulkExtendRelationBuffered()
function's API. That needs a closer look and a more thought-out design to
make it nice. Aside from that, this approach seems valid.

Thanks for looking! I agree that it can stand a fair bit of polishing...

Greetings,

Andres Freund

#20

Jim Nasby

nasbyj@amazon.com

almost 3 years ago

In reply to: Andres Freund (#1)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 10/28/22 9:54 PM, Andres Freund wrote:

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.

Some food for thought: I think it's also completely fine to extend any
relation over a certain size by multiple blocks, regardless of
concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't
have a good feel for what algorithm would make sense here; maybe
something along the lines of extend = max(relpages / 2048, 128); if
extend < 8 extend = 1; (presumably extending by just a couple extra
pages doesn't help much without concurrency).

#21

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Jim Nasby (#20)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 15:00:15 -0600, Jim Nasby wrote:

On 10/28/22 9:54 PM, Andres Freund wrote:

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.

Some food for thought: I think it's also completely fine to extend any
relation over a certain size by multiple blocks, regardless of concurrency.
E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel
for what algorithm would make sense here; maybe something along the lines of
extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably
extending by just a couple extra pages doesn't help much without
concurrency).

I previously implemented just that. It's not easy to get right. You can easily
end up with several backends each extending the relation by quite a bit, at
the same time (or you re-introduce contention). Which can end up with a
relation being larger by a bunch if data loading stops at some point.

We might want that as well at some point, but the approach implemented in the
patchset is precise and thus always a win, and thus should be the baseline.

Greetings,

Andres Freund

#22

Jim Nasby

nasbyj@amazon.com

almost 3 years ago

In reply to: Andres Freund (#21)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 2/21/23 3:12 PM, Andres Freund wrote:

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

Hi,

On 2023-02-21 15:00:15 -0600, Jim Nasby wrote:

Some food for thought: I think it's also completely fine to extend any
relation over a certain size by multiple blocks, regardless of concurrency.
E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't have a good feel
for what algorithm would make sense here; maybe something along the lines of
extend = max(relpages / 2048, 128); if extend < 8 extend = 1; (presumably
extending by just a couple extra pages doesn't help much without
concurrency).

I previously implemented just that. It's not easy to get right. You can easily
end up with several backends each extending the relation by quite a bit, at
the same time (or you re-introduce contention). Which can end up with a
relation being larger by a bunch if data loading stops at some point.

We might want that as well at some point, but the approach implemented in the
patchset is precise and thus always a win, and thus should be the baseline.

Yeah, what I was suggesting would only make sense when there *wasn't*
contention.

#23

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Andres Freund (#19)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 21/02/2023 21:22, Andres Freund wrote:

On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:

Is it ever possible to call this without a relcache entry? WAL redo
functions do that with ReadBuffer, but they only extend a relation
implicitly, by replay a record for a particular block.

I think we should use it for crash recovery as well, but the patch doesn't
yet. We have some gnarly code there, see the loop using P_NEW in
XLogReadBufferExtended(). Extending the file one-by-one is a lot more
expensive than doing it in bulk.

Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the
gap, and then call ExtendRelationBuffered for the target page. Or the
new ExtendRelationBufferedTo() function that you mentioned.

In the common case that you load a lot of data to a relation extending
it, and then crash, the WAL replay would still extend the relation one
page at a time, which is inefficient. Changing that would need bigger
changes, to WAL-log the relation extension as a separate WAL record, for
example. I don't think we need to solve that right now, it can be
addressed separately later.

- Heikki

#24

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#23)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-22 11:18:57 +0200, Heikki Linnakangas wrote:

On 21/02/2023 21:22, Andres Freund wrote:

On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:

Is it ever possible to call this without a relcache entry? WAL redo
functions do that with ReadBuffer, but they only extend a relation
implicitly, by replay a record for a particular block.

I think we should use it for crash recovery as well, but the patch doesn't
yet. We have some gnarly code there, see the loop using P_NEW in
XLogReadBufferExtended(). Extending the file one-by-one is a lot more
expensive than doing it in bulk.

Hmm, XLogReadBufferExtended() could use smgrzeroextend() to fill the gap,
and then call ExtendRelationBuffered for the target page. Or the new
ExtendRelationBufferedTo() function that you mentioned.

I don't think it's safe to just use smgrzeroextend(). Without the page-level
interlock from the buffer entry, a concurrent reader can read/write the
extended portion of the relation, while we're extending. That can lead to
loosing writes.

It also turns out that just doing smgrzeroextend(), without filling s_b, is
often bad for performance, because it may cause reads when trying to fill the
buffers. Although hopefully that's less of an issue during WAL replay, due to
REGBUF_WILL_INIT.

In the common case that you load a lot of data to a relation extending it,
and then crash, the WAL replay would still extend the relation one page at a
time, which is inefficient. Changing that would need bigger changes, to
WAL-log the relation extension as a separate WAL record, for example. I
don't think we need to solve that right now, it can be addressed separately
later.

Yea, that seems indeed something for later.

There's several things we could do without adding WAL logging of relation
extension themselves.

One relatively easy thing would be to add information about the number of
blocks we're extending by to XLOG_HEAP2_MULTI_INSERT records. Compared to the
insertions themselves that'd barely be noticable.

A slightly more complicated thing would be to peek ahead in the WAL (we have
infrastructure for that now) and extend by enough for the next few relation
extensions.

Greetings,

Andres Freund

#25

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#19)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 11:22:26 -0800, Andres Freund wrote:

On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:

Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
ExtendRelationBuffered?

Hm. That's a a good point. Probably not. Perhaps it could be useful to support
RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
used RBM_ZERO_AND_LOCK.

There's a fair number of callers using RBM_NORMAL, via ReadBuffer[Extended]()
right now. While some of them are trivial to convert, others aren't (e.g.,
brin_getinsertbuffer()). So I'm inclined to continue allowing RBM_NORMAL.

Greetings,

Andres Freund

#26

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Alvaro Herrera (#17)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 21/02/2023 18:33, Alvaro Herrera wrote:

On 2023-Feb-21, Heikki Linnakangas wrote:
+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.
Yeah, I noticed this too. I think it would be easy enough to add a new
struct that can be passed as a pointer, which can be stack-allocated
by the caller, and which holds the input arguments that are common to
both functions, as is sensible.

We also do this in freespace.c and visibilitymap.c:

/* Extend as needed. */
while (fsm_nblocks_now < fsm_nblocks)
{
PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);

smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
pg.data, false);
fsm_nblocks_now++;
}

We could use the new smgrzeroextend function here. But it would be
better to go through the buffer cache, because after this, the last
block, at 'fsm_nblocks', will be read with ReadBuffer() and modified.

We could use BulkExtendSharedRelationBuffered() to extend the relation
and keep the last page locked, but the
BulkExtendSharedRelationBuffered() signature doesn't allow that. It can
return the first N pages locked, but there's no way to return the *last*
page locked.

Perhaps we should decompose this function into several function calls.
Something like:

/* get N victim buffers, pinned and !BM_VALID */
buffers = BeginExtendRelation(int npages);

LockRelationForExtension(rel)

/* Insert buffers into buffer table */
first_blk = smgrnblocks()
for (blk = first_blk; blk < last_blk; blk++)
MapNewBuffer(blk, buffers[i])

/* extend the file on disk */
smgrzeroextend();

UnlockRelationForExtension(rel)

for (blk = first_blk; blk < last_blk; blk++)
{
memset(BufferGetPage(buffers[i]), 0,
FinishNewBuffer(buffers[i])
/* optionally lock the buffer */
LockBuffer(buffers[i]);
}

That's a lot more verbose, of course, but gives the callers the
flexibility. And might even be more readable than one function call with
lots of arguments.

This would expose the concept of a buffer that's mapped but marked as
IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing
details that shouldn't be exposed. On the other hand, it might come
handy. Instead of RBM_ZERO_AND_LOCK mode, for example, it might be handy
to have a function that returns an IO-in-progress buffer that you can
initialize any way you want.

- Heikki

#27

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#26)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:

We also do this in freespace.c and visibilitymap.c:

/* Extend as needed. */
while (fsm_nblocks_now < fsm_nblocks)
{
PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);

smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
pg.data, false);
fsm_nblocks_now++;
}

We could use the new smgrzeroextend function here. But it would be better to
go through the buffer cache, because after this, the last block, at
'fsm_nblocks', will be read with ReadBuffer() and modified.

I doubt it's a particularly crucial thing to optimize.

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.

We could use BulkExtendSharedRelationBuffered() to extend the relation and
keep the last page locked, but the BulkExtendSharedRelationBuffered()
signature doesn't allow that. It can return the first N pages locked, but
there's no way to return the *last* page locked.

We can't rely on bulk extending a, potentially, large number of pages in one
go anyway (since we might not be allowed to pin that many pages). So I don't
think requiring locking the last page is a really viable API.

I think for this case I'd just just use the ExtendRelationTo() API we were
discussing nearby. Compared to the cost of reducing syscalls / filesystem
overhead to extend the relation, the cost of the buffer mapping lookup does't
seem significant. That's different in e.g. the hio.c case, because there we
need a buffer with free space, and concurrent activity could otherwise fill up
the buffer before we can lock it again.

I had started hacking on ExtendRelationTo() that when I saw problems with the
existing code that made me hesitate:
/messages/by-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de

Perhaps we should decompose this function into several function calls.
Something like:

/* get N victim buffers, pinned and !BM_VALID */
buffers = BeginExtendRelation(int npages);

LockRelationForExtension(rel)

/* Insert buffers into buffer table */
first_blk = smgrnblocks()
for (blk = first_blk; blk < last_blk; blk++)
MapNewBuffer(blk, buffers[i])

/* extend the file on disk */
smgrzeroextend();

UnlockRelationForExtension(rel)

for (blk = first_blk; blk < last_blk; blk++)
{
memset(BufferGetPage(buffers[i]), 0,
FinishNewBuffer(buffers[i])
/* optionally lock the buffer */
LockBuffer(buffers[i]);
}

That's a lot more verbose, of course, but gives the callers the flexibility.
And might even be more readable than one function call with lots of
arguments.

To me this seems like a quite bad idea. The amount of complexity this would
expose all over the tree is substantial. Which would also make it harder to
further improve relation extension at a later date. It certainly shouldn't be
the default interface. And I'm not sure I see a promisung usecase.

This would expose the concept of a buffer that's mapped but marked as
IO-in-progress outside bufmgr.c. On one hand, maybe that's exposing details
that shouldn't be exposed. On the other hand, it might come handy. Instead
of RBM_ZERO_AND_LOCK mode, for example, it might be handy to have a function
that returns an IO-in-progress buffer that you can initialize any way you
want.

I'd much rather encapsulate that in additional functions, or perhaps a
callback that can make decisions about what to do.

Greetings,

Andres Freund

#28

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Alvaro Herrera (#17)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote:

On 2023-Feb-21, Heikki Linnakangas wrote:
+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.
Yeah, I noticed this too. I think it would be easy enough to add a new
struct that can be passed as a pointer, which can be stack-allocated
by the caller, and which holds the input arguments that are common to
both functions, as is sensible.

I played a fair bit with various options. I ended up not using a struct to
pass most options, but instead go for a flags argument. However, I did use a
struct for passing either relation or smgr.

typedef enum ExtendBufferedFlags {
EB_SKIP_EXTENSION_LOCK = (1 << 0),
EB_IN_RECOVERY = (1 << 1),
EB_CREATE_FORK_IF_NEEDED = (1 << 2),
EB_LOCK_FIRST = (1 << 3),

/* internal flags follow */
EB_RELEASE_PINS = (1 << 4),
} ExtendBufferedFlags;

/*
* To identify the relation - either relation or smgr + relpersistence has to
* be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
* to use the same function for both crash recovery and normal operation.
*/
typedef struct ExtendBufferedWhat
{
Relation rel;
struct SMgrRelationData *smgr;
char relpersistence;
} ExtendBufferedWhat;

#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
/* requires use of EB_SKIP_EXTENSION_LOCK */
#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})

extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
ForkNumber forkNum,
BufferAccessStrategy strategy,
uint32 flags);
extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
ForkNumber fork,
BufferAccessStrategy strategy,
uint32 flags,
uint32 extend_by,
Buffer *buffers,
uint32 *extended_by);
extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
ForkNumber fork,
BufferAccessStrategy strategy,
uint32 flags,
BlockNumber extend_to,
ReadBufferMode mode);

As you can see I removed ReadBufferMode from most of the functions (as
suggested by Heikki earlier). When extending by 1/multiple pages, we only need
to know whether to lock or not.

The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
fall back to reading page normally if there was a concurrent relation
extension.

The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated,
gnarly, code to do so from vm_extend(), fsm_extend().

I'm not sure about the function naming pattern. I do like 'By' a lot more than
the Bulk prefix I used before.

What do you think?

Greetings,

Andres Freund

#29

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Andres Freund (#27)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 27/02/2023 23:45, Andres Freund wrote:

On 2023-02-27 18:06:22 +0200, Heikki Linnakangas wrote:

We also do this in freespace.c and visibilitymap.c:

/* Extend as needed. */
while (fsm_nblocks_now < fsm_nblocks)
{
PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);

smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
pg.data, false);
fsm_nblocks_now++;
}

We could use the new smgrzeroextend function here. But it would be better to
go through the buffer cache, because after this, the last block, at
'fsm_nblocks', will be read with ReadBuffer() and modified.

I doubt it's a particularly crucial thing to optimize.

Yeah, it won't make any practical difference to performance. I'm more
thinking if we can make this more consistent with other places where we
extend a relation.

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.

fsm_readbuf and vm_readbuf check the relation size first, with
smgrnblocks(), before trying to read the page. So to have a problem, the
smgrnblocks() would have to already return the new size, but the
smgrread() would not return the new contents. I don't think that's
possible, but not sure.

We could use BulkExtendSharedRelationBuffered() to extend the relation and
keep the last page locked, but the BulkExtendSharedRelationBuffered()
signature doesn't allow that. It can return the first N pages locked, but
there's no way to return the *last* page locked.

We can't rely on bulk extending a, potentially, large number of pages in one
go anyway (since we might not be allowed to pin that many pages). So I don't
think requiring locking the last page is a really viable API.

I think for this case I'd just just use the ExtendRelationTo() API we were
discussing nearby. Compared to the cost of reducing syscalls / filesystem
overhead to extend the relation, the cost of the buffer mapping lookup does't
seem significant. That's different in e.g. the hio.c case, because there we
need a buffer with free space, and concurrent activity could otherwise fill up
the buffer before we can lock it again.

Works for me.

- Heikki

#30

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#29)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote:

On 27/02/2023 23:45, Andres Freund wrote:

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.

fsm_readbuf and vm_readbuf check the relation size first, with
smgrnblocks(), before trying to read the page. So to have a problem, the
smgrnblocks() would have to already return the new size, but the smgrread()
would not return the new contents. I don't think that's possible, but not
sure.

I hacked Thomas' program to test torn reads to ftruncate the file on the write
side.

It frequently observes a file size that's not the write size (e.g. reading 4k
when writing an 8k block).

After extending the test to more than one reader, I indeed also see torn
reads. So far all the tears have been at a 4k block boundary. However so far
it always has been *prior* page contents, not 0s.

Greetings,

Andres Freund

#31

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#30)

1 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-01 09:02:00 -0800, Andres Freund wrote:

On 2023-03-01 11:12:35 +0200, Heikki Linnakangas wrote:

On 27/02/2023 23:45, Andres Freund wrote:

But, uh, isn't this code racy? Because this doesn't go through shared buffers,
there's no IO_IN_PROGRESS interlocking against a concurrent reader. We know
that writing pages isn't atomic vs readers. So another connection could
connection could see the new relation size, but a read might return a
partially written state of the page. Which then would cause checksum
failures. And even worse, I think it could lead to loosing a write, if the
concurrent connection writes out a page.

fsm_readbuf and vm_readbuf check the relation size first, with
smgrnblocks(), before trying to read the page. So to have a problem, the
smgrnblocks() would have to already return the new size, but the smgrread()
would not return the new contents. I don't think that's possible, but not
sure.

I hacked Thomas' program to test torn reads to ftruncate the file on the write
side.

It frequently observes a file size that's not the write size (e.g. reading 4k
when writing an 8k block).

After extending the test to more than one reader, I indeed also see torn
reads. So far all the tears have been at a 4k block boundary. However so far
it always has been *prior* page contents, not 0s.

On tmpfs the failure rate is much higher, and we also end up reading 0s,
despite never writing them.

I've attached my version of the test program.

ext4: lots of 4k reads with 8k writes, some torn reads at 4k boundaries
xfs: no issues
tmpfs: loads of 4k reads with 8k writes, lots torn reads reading 0s, some torn reads at 4k boundaries

Greetings,

Andres Freund

#32

Heikki Linnakangas

hlinnaka@iki.fi

almost 3 years ago

In reply to: Andres Freund (#28)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On 01/03/2023 09:33, Andres Freund wrote:

On 2023-02-21 17:33:31 +0100, Alvaro Herrera wrote:
On 2023-Feb-21, Heikki Linnakangas wrote:
+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)
Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.
Yeah, I noticed this too. I think it would be easy enough to add a new
struct that can be passed as a pointer, which can be stack-allocated
by the caller, and which holds the input arguments that are common to
both functions, as is sensible.
I played a fair bit with various options. I ended up not using a struct to
pass most options, but instead go for a flags argument. However, I did use a
struct for passing either relation or smgr.

typedef enum ExtendBufferedFlags {
EB_SKIP_EXTENSION_LOCK = (1 << 0),
EB_IN_RECOVERY = (1 << 1),
EB_CREATE_FORK_IF_NEEDED = (1 << 2),
EB_LOCK_FIRST = (1 << 3),

/* internal flags follow */
EB_RELEASE_PINS = (1 << 4),
} ExtendBufferedFlags;

Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really
needed? What does EB_LOCK_FIRST do?

/*
* To identify the relation - either relation or smgr + relpersistence has to
* be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
* to use the same function for both crash recovery and normal operation.
*/
typedef struct ExtendBufferedWhat
{
Relation rel;
struct SMgrRelationData *smgr;
char relpersistence;
} ExtendBufferedWhat;

#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
/* requires use of EB_SKIP_EXTENSION_LOCK */
#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})

Clever. I'm still not 100% convinced we need the EB_SMGR variant, but
with this we'll have the flexibility in any case.

extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
ForkNumber forkNum,
BufferAccessStrategy strategy,
uint32 flags);
extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
ForkNumber fork,
BufferAccessStrategy strategy,
uint32 flags,
uint32 extend_by,
Buffer *buffers,
uint32 *extended_by);
extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
ForkNumber fork,
BufferAccessStrategy strategy,
uint32 flags,
BlockNumber extend_to,
ReadBufferMode mode);

As you can see I removed ReadBufferMode from most of the functions (as
suggested by Heikki earlier). When extending by 1/multiple pages, we only need
to know whether to lock or not.

Ok, that's better. Still complex and a lot of arguments, but I don't
have any great suggestions on how to improve it.

The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
fall back to reading page normally if there was a concurrent relation
extension.

Hmm, I think you'll need another return value, to let the caller know if
the relation was extended or not. Or a flag to ereport(ERROR) if the
page already exists, for ginbuild() and friends.

The reason EB_CREATE_FORK_IF_NEEDED exists is to remove the duplicated,
gnarly, code to do so from vm_extend(), fsm_extend().

Makes sense.

I'm not sure about the function naming pattern. I do like 'By' a lot more than
the Bulk prefix I used before.

- Heikki

#33

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Heikki Linnakangas (#32)

15 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-02 00:04:14 +0200, Heikki Linnakangas wrote:

On 01/03/2023 09:33, Andres Freund wrote:

typedef enum ExtendBufferedFlags {
EB_SKIP_EXTENSION_LOCK = (1 << 0),
EB_IN_RECOVERY = (1 << 1),
EB_CREATE_FORK_IF_NEEDED = (1 << 2),
EB_LOCK_FIRST = (1 << 3),

/* internal flags follow */
EB_RELEASE_PINS = (1 << 4),
} ExtendBufferedFlags;

Is EB_IN_RECOVERY always set when RecoveryInProgress()? Is it really needed?

Right now it's just passed in from the caller. It's at the moment just needed
to know what to pass to smgrcreate(isRedo).

However, XLogReadBufferExtended() doesn't currently use this path, so maybe
it's not actually needed.

What does EB_LOCK_FIRST do?

Lock the first returned buffer, this is basically the replacement for
num_locked_buffers from the earlier version. I think it's likely that either
locking the first, or potentially at some later point locking all buffers, is
all that's needed for ExtendBufferedRelBy().

EB_LOCK_FIRST_BUFFER would perhaps be better?

/*
* To identify the relation - either relation or smgr + relpersistence has to
* be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
* to use the same function for both crash recovery and normal operation.
*/
typedef struct ExtendBufferedWhat
{
Relation rel;
struct SMgrRelationData *smgr;
char relpersistence;
} ExtendBufferedWhat;

#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
/* requires use of EB_SKIP_EXTENSION_LOCK */
#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})

Clever. I'm still not 100% convinced we need the EB_SMGR variant, but with
this we'll have the flexibility in any case.

Hm - how would you use it from XLogReadBufferExtended() without that?

XLogReadBufferExtended() spends a disappointing amount of time in
smgropen(). Quite visible in profiles.

In the plan read case at least one in XLogReadBufferExtended()
itself, then in ReadBufferWithoutRelcache(). The extension path right now is
worse - it does one smgropen() for each extended block.

I think we should avoid using ReadBufferWithoutRelcache() in
XLogReadBufferExtended() in the read path as well, but that's for later.

extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
ForkNumber forkNum,
BufferAccessStrategy strategy,
uint32 flags);
extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
ForkNumber fork,
BufferAccessStrategy strategy,
uint32 flags,
uint32 extend_by,
Buffer *buffers,
uint32 *extended_by);
extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
ForkNumber fork,
BufferAccessStrategy strategy,
uint32 flags,
BlockNumber extend_to,
ReadBufferMode mode);

As you can see I removed ReadBufferMode from most of the functions (as
suggested by Heikki earlier). When extending by 1/multiple pages, we only need
to know whether to lock or not.

Ok, that's better. Still complex and a lot of arguments, but I don't have
any great suggestions on how to improve it.

I don't think there are going to be all that many callers of
ExtendBufferedRelBy() and ExtendBufferedRelTo(), most are going to be
ExtendBufferedRel(), I think. So the complexity seems acceptable.

The reason ExtendBufferedRelTo() has a 'mode' argument is that that allows to
fall back to reading page normally if there was a concurrent relation
extension.

Hmm, I think you'll need another return value, to let the caller know if the
relation was extended or not. Or a flag to ereport(ERROR) if the page
already exists, for ginbuild() and friends.

I don't think ginbuild() et al need to use ExtendBufferedRelTo()? A plain
ExtendBufferedRel() should suffice. The places that do need it are
fsm_extend() and vm_extend() - I did end up avoiding the redundant lookup.

But I was wondering about a flag controlling this as well.

Attached is my current version of this. Still needs more polishing, including
comments explaining the flags. But I thought it'd be useful to have it out
there.

There's two new patches in the series:
- a patch to not initialize pages in the loop in fsm_extend(), vm_extend()
anymore - we have a check about initializing pages at a later point, so
there doesn't really seem to be a need for it?
- a patch to use the new ExtendBufferedRelTo() in fsm_extend(), vm_extend()
and XLogReadBufferExtended()

In this version I also tries to address some of the other feedback raised in
the thread. One thing I haven't decided what to do about yet is David's
feedback about a version of PinLocalBuffer() that doesn't adjust the
usagecount, which wouldn't need to read the buf_state.

Greetings,

Andres Freund

Attachments:

v4-0001-Revise-pg_pwrite_zeros.patchtext/x-diff; charset=us-asciiDownload

From 777b5265c296b09dd842e5b93f617bafd0f00a93 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 13 Feb 2023 17:11:29 -0800
Subject: [PATCH v4 01/15] Revise pg_pwrite_zeros()

- add offset parameter
- avoid memset() of zerobuf on every call
- don't initialize the whole IOV array unnecessarily
- don't handle the smaller trailing write in a separate write call
---
 src/include/common/file_utils.h    |  2 +-
 src/common/file_utils.c            | 62 ++++++++----------------------
 src/backend/access/transam/xlog.c  |  2 +-
 src/bin/pg_basebackup/walmethods.c |  2 +-
 4 files changed, 20 insertions(+), 48 deletions(-)

diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index bda6d3a5413..b7efa1226d6 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -44,6 +44,6 @@ extern ssize_t pg_pwritev_with_retry(int fd,
 									 int iovcnt,
 									 off_t offset);
 
-extern ssize_t pg_pwrite_zeros(int fd, size_t size);
+extern ssize_t pg_pwrite_zeros(int fd, size_t size, off_t offset);
 
 #endif							/* FILE_UTILS_H */
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 4dae534152f..93b5d42c5d1 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -537,62 +537,34 @@ pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)
  * is returned with errno set.
  */
 ssize_t
-pg_pwrite_zeros(int fd, size_t size)
+pg_pwrite_zeros(int fd, size_t size, off_t offset)
 {
-	PGAlignedBlock zbuffer;		/* worth BLCKSZ */
-	size_t		zbuffer_sz;
+	static const PGAlignedBlock zbuffer = {0};	/* worth BLCKSZ */
+	void	   *zerobuf_addr = unconstify(PGAlignedBlock *, &zbuffer)->data;
 	struct iovec iov[PG_IOV_MAX];
-	int			blocks;
-	size_t		remaining_size = 0;
-	int			i;
-	ssize_t		written;
+	size_t		remaining_size = size;
 	ssize_t		total_written = 0;
 
-	zbuffer_sz = sizeof(zbuffer.data);
-
-	/* Zero-fill the buffer. */
-	memset(zbuffer.data, 0, zbuffer_sz);
-
-	/* Prepare to write out a lot of copies of our zero buffer at once. */
-	for (i = 0; i < lengthof(iov); ++i)
-	{
-		iov[i].iov_base = zbuffer.data;
-		iov[i].iov_len = zbuffer_sz;
-	}
-
 	/* Loop, writing as many blocks as we can for each system call. */
-	blocks = size / zbuffer_sz;
-	remaining_size = size % zbuffer_sz;
-	for (i = 0; i < blocks;)
+	while (remaining_size > 0)
 	{
-		int			iovcnt = Min(blocks - i, lengthof(iov));
-		off_t		offset = i * zbuffer_sz;
+		int			iovcnt = 0;
+		ssize_t		written;
+
+		for (; iovcnt < PG_IOV_MAX && remaining_size > 0; iovcnt++)
+		{
+			size_t		this_iov_size = Min(remaining_size, BLCKSZ);
+
+			iov[iovcnt].iov_base = zerobuf_addr;
+			iov[iovcnt].iov_len = this_iov_size;
+			remaining_size -= this_iov_size;
+		}
 
 		written = pg_pwritev_with_retry(fd, iov, iovcnt, offset);
-
-		if (written < 0)
-			return written;
-
-		i += iovcnt;
-		total_written += written;
-	}
-
-	/* Now, write the remaining size, if any, of the file with zeros. */
-	if (remaining_size > 0)
-	{
-		/* We'll never write more than one block here */
-		int			iovcnt = 1;
-
-		/* Jump on to the end of previously written blocks */
-		off_t		offset = i * zbuffer_sz;
-
-		iov[0].iov_len = remaining_size;
-
-		written = pg_pwritev_with_retry(fd, iov, iovcnt, offset);
-
 		if (written < 0)
 			return written;
 
+		offset += written;
 		total_written += written;
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d1..786b26054cf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2981,7 +2981,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
-		rc = pg_pwrite_zeros(fd, wal_segment_size);
+		rc = pg_pwrite_zeros(fd, wal_segment_size, 0);
 
 		if (rc < 0)
 			save_errno = errno;
diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index 54014e2b84d..6d14b988cb6 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -222,7 +222,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	{
 		ssize_t		rc;
 
-		rc = pg_pwrite_zeros(fd, pad_to_size);
+		rc = pg_pwrite_zeros(fd, pad_to_size, 0);
 
 		if (rc < 0)
 		{
-- 
2.38.0

v4-0002-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload

From 4b1b378cd7d3b1c3f0366ccda0f1092de833774e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v4 02/15] Add some error checking around pinning

---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 42 ++++++++++++++++++++---------
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b8a18b8081f..973547a8baf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -134,6 +134,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
+extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68d..55df6a76c9c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1755,6 +1755,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	Assert(!BufferIsLocal(b));
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1900,6 +1902,8 @@ UnpinBuffer(BufferDesc *buf)
 	PrivateRefCountEntry *ref;
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 
+	Assert(!BufferIsLocal(b));
+
 	/* not moving as we're likely deleting it soon anyway */
 	ref = GetPrivateRefCountEntry(b, false);
 	Assert(ref != NULL);
@@ -4282,6 +4286,25 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 LocalRefCount[-buffer - 1]);
+	}
+	else
+	{
+		/* There should be exactly one local pin */
+		if (GetPrivateRefCount(buffer) != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 GetPrivateRefCount(buffer));
+	}
+}
+
 /*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
@@ -4309,20 +4332,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	if (BufferIsLocal(buffer))
-	{
-		/* There should be exactly one pin */
-		if (LocalRefCount[-buffer - 1] != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 LocalRefCount[-buffer - 1]);
-		/* Nobody else to wait for */
-		return;
-	}
+	BufferCheckOneLocalPin(buffer);
 
-	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		elog(ERROR, "incorrect local pin count: %d",
-			 GetPrivateRefCount(buffer));
+	/* Nobody else to wait for */
+	if (BufferIsLocal(buffer))
+		return;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -4823,6 +4837,8 @@ LockBufHdr(BufferDesc *desc)
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
 	init_local_spin_delay(&delayStatus);
 
 	while (true)
-- 
2.38.0

v4-0003-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload

From 53283e0fe821387c56ad0ba222d2983229dbaf6f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v4 03/15] hio: Release extension lock before initializing page
 / pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.
---
 src/backend/access/heap/hio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e152807d2dc..7479212d4e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

v4-0004-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload

From d8b39aa46ac5222828a2dc9c9956beefeb47183a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 27 Feb 2023 17:36:37 -0800
Subject: [PATCH v4 04/15] Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Backpatch:
---
 src/include/storage/fd.h        |   3 +
 src/include/storage/md.h        |   2 +
 src/include/storage/smgr.h      |   2 +
 src/backend/storage/file/fd.c   |  90 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   | 111 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  28 ++++++++
 6 files changed, 236 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d083..5f300a89f62 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -106,6 +106,9 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
+extern int	FileZero(File file, off_t amount, off_t offset, uint32 wait_event_info);
+extern int	FileFallocate(File file, off_t amount, off_t offset, uint32 wait_event_info);
+
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
 extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8f32af9ef3d..941879ee6a8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,6 +28,8 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 0935144f425..a9a179aabac 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index ea690f05c69..77d00296a3a 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2198,6 +2198,92 @@ FileSync(File file, uint32 wait_event_info)
 	return returnCode;
 }
 
+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+	int			returnCode;
+	ssize_t		written;
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	pgstat_report_wait_start(wait_event_info);
+	written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+	pgstat_report_wait_end();
+
+	if (written < 0)
+		return -1;
+	else if (written != amount)
+	{
+		/* if errno is unset, assume problem is no disk space */
+		if (errno == 0)
+			errno = ENOSPC;
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to reserve file space with posix_fallocate(). If posix_fallocate() is
+ * not implemented on the operating system or fails with EINVAL / EOPNOTSUPP,
+ * use FileZero() instead.
+ *
+ * Note that at least glibc() implements posix_fallocate() in userspace if not
+ * implemented by the filesystem. That's not the case for all environments
+ * though.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ *
+ * Even though posix_fallocate has the offset before the length argument, it
+ * seems better to keep the argument order consistent with most of the other
+ * functions in this file.
+ */
+int
+FileFallocate(File file, off_t amount, off_t offset, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	pgstat_report_wait_start(wait_event_info);
+	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+	pgstat_report_wait_end();
+
+	if (returnCode == 0)
+		return 0;
+
+	/* for compatibility with %m printing etc */
+	errno = returnCode;
+
+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return returnCode;
+
+	if (returnCode == 0 ||
+		(returnCode != EINVAL && returnCode != EINVAL))
+		return returnCode;
+#endif
+
+	return FileZero(file, amount, offset, wait_event_info);
+}
+
 off_t
 FileSize(File file)
 {
@@ -2270,6 +2356,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file) < 0)
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1feb..59a65a8305c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -500,6 +501,116 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
+/*
+ *	mdzeroextend() -- Add ew zeroed out blocks to the specified relation.
+ *
+ *		Similar to mdrextend(), except the relation can be extended by
+ *		multiple blocks at once, and that the added blocks will be filled with
+ *		zeroes.
+ */
+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	MdfdVec    *v;
+	BlockNumber curblocknum = blocknum;
+	int			remblocks = nblocks;
+
+	Assert(nblocks > 0);
+
+	/* This assert is too expensive to have on normally ... */
+#ifdef CHECK_WRITE_VS_EXTEND
+	Assert(blocknum >= mdnblocks(reln, forknum));
+#endif
+
+	/*
+	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
+	 * more --- we mustn't create a block whose number actually is
+	 * InvalidBlockNumber or larger.
+	 */
+	if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend file \"%s\" beyond %u blocks",
+						relpath(reln->smgr_rlocator, forknum),
+						InvalidBlockNumber)));
+
+	while (remblocks > 0)
+	{
+		int			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		int			segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks;
+		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		int			numblocks;
+
+		if (segendblock > RELSEG_SIZE)
+			segendblock = RELSEG_SIZE;
+
+		numblocks = segendblock - segstartblock;
+
+		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+
+		Assert(segstartblock < RELSEG_SIZE);
+		Assert(segendblock <= RELSEG_SIZE);
+
+		/*
+		 * If available and useful, use posix_fallocate() (via FileAllocate())
+		 * to extend the relation. That's often more efficient than using
+		 * write(), as it commonly won't cause the kernel to allocate page
+		 * cache space for the extended pages.
+		 *
+		 * However, we don't use FileAllocate() for small extensions, as it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 */
+		if (numblocks > 8)
+		{
+			int			ret;
+
+			ret = FileFallocate(v->mdfd_vfd,
+								seekpos, (off_t) BLCKSZ * numblocks,
+								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret != 0)
+			{
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+			}
+		}
+		else
+		{
+			int			ret;
+
+			/*
+			 * Even if we don't want to use fallocate, we can still extend a
+			 * bit more efficiently than writing each 8kB block individually.
+			 * pg_pwrite_zeroes() (via FileZero()) uses
+			 * pg_pwritev_with_retry() to avoid multiple writes or needing a
+			 * zeroed buffer for the whole length of the extension.
+			 */
+			ret = FileZero(v->mdfd_vfd,
+						   seekpos, (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\": %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+		remblocks -= segendblock - segstartblock;
+		curblocknum += segendblock - segstartblock;
+	}
+}
+
 /*
  *	mdopenfork() -- Open one fork of the specified relation.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e54145..5224ca52592 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -75,6 +77,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
 		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
@@ -507,6 +510,31 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+/*
+ *	smgrzeroextend() -- Add new zeroed out blocks to a file.
+ *
+ *		Similar to smgrextend(), except the relation can be extended by
+ *		multiple blocks at once, and that the added blocks will be filled with
+ *		zeroes.
+ */
+void
+smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   int nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
+											 nblocks, skipFsync);
+
+	/*
+	 * Normally we expect this to increase the fork size by nblocks, but if
+	 * the cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+}
+
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
-- 
2.38.0

v4-0005-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From e4e16c84f4f89a9574774857ee1ac3e7ab40ab9e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v4 05/15] bufmgr: Add Pin/UnpinLocalBuffer()

So far these were open-coded in quite a few places, without a good reason.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h   |  2 +
 src/backend/storage/buffer/bufmgr.c   | 30 +++----------
 src/backend/storage/buffer/localbuf.c | 62 +++++++++++++++++----------
 3 files changed, 46 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0b448147407..fa5c451b1a9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -415,6 +415,8 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
+extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
+extern void UnpinLocalBuffer(Buffer buffer);
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 55df6a76c9c..d2f9efef723 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -636,20 +636,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		/* Is it still valid and holding the right tag? */
 		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * Bump buffer's ref and usage counts. This is equivalent of
-			 * PinBuffer for a shared buffer.
-			 */
-			if (LocalRefCount[b] == 0)
-			{
-				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-				{
-					buf_state += BUF_USAGECOUNT_ONE;
-					pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-				}
-			}
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner, recent_buffer);
+			PinLocalBuffer(bufHdr, true);
 
 			pgBufferUsage.local_blks_hit++;
 
@@ -1708,8 +1695,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
 				BufTagGetForkNum(&bufHdr->tag) == forkNum)
 				return buffer;
-			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-			LocalRefCount[-buffer - 1]--;
+			UnpinLocalBuffer(buffer);
 		}
 		else
 		{
@@ -4011,15 +3997,9 @@ ReleaseBuffer(Buffer buffer)
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
-	{
-		ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-
-		Assert(LocalRefCount[-buffer - 1] > 0);
-		LocalRefCount[-buffer - 1]--;
-		return;
-	}
-
-	UnpinBuffer(GetBufferDescriptor(buffer - 1));
+		UnpinLocalBuffer(buffer);
+	else
+		UnpinBuffer(GetBufferDescriptor(buffer - 1));
 }
 
 /*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663d..798c5b93a87 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -145,27 +145,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
 				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
 #endif
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		/* this part is equivalent to PinBuffer for a shared buffer */
-		if (LocalRefCount[b] == 0)
-		{
-			if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-			{
-				buf_state += BUF_USAGECOUNT_ONE;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-		}
-		LocalRefCount[b]++;
-		ResourceOwnerRememberBuffer(CurrentResourceOwner,
-									BufferDescriptorGetBuffer(bufHdr));
-		if (buf_state & BM_VALID)
-			*foundPtr = true;
-		else
-		{
-			/* Previous read attempt must have failed; try again */
-			*foundPtr = false;
-		}
+		*foundPtr = PinLocalBuffer(bufHdr, true);
 		return bufHdr;
 	}
 
@@ -202,9 +183,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			else
 			{
 				/* Found a usable buffer */
-				LocalRefCount[b]++;
-				ResourceOwnerRememberBuffer(CurrentResourceOwner,
-											BufferDescriptorGetBuffer(bufHdr));
+				PinLocalBuffer(bufHdr, false);
 				break;
 			}
 		}
@@ -491,6 +470,43 @@ InitLocalBuffers(void)
 	NLocBuffer = nbufs;
 }
 
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+	uint32		buf_state;
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	int			bufid = -(buffer + 1);
+
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	if (LocalRefCount[bufid] == 0)
+	{
+		if (adjust_usagecount &&
+			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
+		{
+			buf_state += BUF_USAGECOUNT_ONE;
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		}
+	}
+	LocalRefCount[bufid]++;
+	ResourceOwnerRememberBuffer(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf_hdr));
+
+	return buf_state & BM_VALID;
+}
+
+void
+UnpinLocalBuffer(Buffer buffer)
+{
+	int			buffid = -buffer - 1;
+
+	Assert(BufferIsLocal(buffer));
+	Assert(LocalRefCount[buffid] > 0);
+
+	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
+	LocalRefCount[buffid]--;
+}
+
 /*
  * GUC check_hook for temp_buffers
  */
-- 
2.38.0

v4-0006-bufmgr-Remove-buffer-write-dirty-tracepoints.patchtext/x-diff; charset=us-asciiDownload

From f6d9312cb0dc9d0d10620836b64cc6999ef993d8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:21:57 -0800
Subject: [PATCH v4 06/15] bufmgr: Remove buffer-write-dirty tracepoints

The trace point was using the relfileno / fork / block for the to-be-read-in
buffer. Some upcoming work would make that more expensive to provide. We still
have buffer-flush-start/done, which can serve most tracing needs that
buffer-write-dirty could serve.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 10 ----------
 doc/src/sgml/monitoring.sgml        | 17 -----------------
 2 files changed, 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index d2f9efef723..cdd9e92b9fe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1277,21 +1277,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				}
 
 				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-														  smgr->smgr_rlocator.locator.spcOid,
-														  smgr->smgr_rlocator.locator.dbOid,
-														  smgr->smgr_rlocator.locator.relNumber);
-
 				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
 											  &buf->tag);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-														 smgr->smgr_rlocator.locator.spcOid,
-														 smgr->smgr_rlocator.locator.dbOid,
-														 smgr->smgr_rlocator.locator.relNumber);
 			}
 			else
 			{
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d02..77bb13ba0a7 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7758,23 +7758,6 @@ FROM pg_stat_get_backend_idset() AS backendid;
       it's typically not actually been written to disk yet.)
       The arguments are the same as for <literal>buffer-flush-start</literal>.</entry>
     </row>
-    <row>
-     <entry><literal>buffer-write-dirty-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
-     <entry>Probe that fires when a server process begins to write a dirty
-      buffer.  (If this happens often, it implies that
-      <xref linkend="guc-shared-buffers"/> is too
-      small or the background writer control parameters need adjustment.)
-      arg0 and arg1 contain the fork and block numbers of the page.
-      arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
-      identifying the relation.</entry>
-    </row>
-    <row>
-     <entry><literal>buffer-write-dirty-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
-     <entry>Probe that fires when a dirty-buffer write is complete.
-      The arguments are the same as for <literal>buffer-write-dirty-start</literal>.</entry>
-    </row>
     <row>
      <entry><literal>wal-buffer-write-dirty-start</literal></entry>
      <entry><literal>()</literal></entry>
-- 
2.38.0

v4-0007-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload

From f4c3596c63e69a536204cc50551b5b1bbaee7d6d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:26:34 -0800
Subject: [PATCH v4 07/15] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c   | 581 ++++++++++++++------------
 src/backend/storage/buffer/localbuf.c | 115 ++---
 2 files changed, 379 insertions(+), 317 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cdd9e92b9fe..4faa77b1705 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -473,6 +473,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr, IOContext *io_context);
+static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -1128,18 +1129,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr, IOContext *io_context)
 {
-	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
-	uint32		oldFlags;
-	int			buf_id;
-	BufferDesc *buf;
-	bool		valid;
-	uint32		buf_state;
+	int			existing_buf_id;
+
+	Buffer		victim_buffer;
+	BufferDesc *victim_buf_hdr;
+	uint32		victim_buf_state;
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -1150,15 +1147,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+	existing_buf_id = BufTableLookup(&newTag, newHash);
+	if (existing_buf_id >= 0)
 	{
+		BufferDesc *buf;
+		bool		valid;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
 		 * buffer pool, and check to see if the correct data has been loaded
 		 * into the buffer.
 		 */
-		buf = GetBufferDescriptor(buf_id);
+		buf = GetBufferDescriptor(existing_buf_id);
 
 		valid = PinBuffer(buf, strategy);
 
@@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	*io_context = IOContextForStrategy(strategy);
 
-	/* Loop here in case we have to try another victim buffer */
-	for (;;)
+	/*
+	 * Acquire a victim buffer. Somebody else might try to do the same, we
+	 * don't hold any conflicting locks. If so we'll have to undo our work
+	 * later.
+	 */
+	victim_buffer = GetVictimBuffer(strategy, *io_context);
+	victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. If
+	 * somebody else inserted another buffer for the tag, we'll release the
+	 * victim buffer we acquired and use the already inserted one.
+	 */
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
+	if (existing_buf_id >= 0)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
+		BufferDesc *existing_buf_hdr;
+		bool		valid;
 
 		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to
+		 * use.
+		 *
+		 * We could do this after releasing the partition lock, but then we'd
+		 * have to call ResourceOwnerEnlargeBuffers() &
+		 * ReservePrivateRefCountEntry() before acquiring the lock, for the
+		 * rare case of such a collision.
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		UnpinBuffer(victim_buf_hdr);
 
-		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+		/* FIXME: Should we put the victim buffer onto the freelist? */
 
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf_state & BUF_FLAG_MASK;
+		/* remaining code should match code at top of routine */
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-										 LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
+		valid = PinBuffer(existing_buf_hdr, strategy);
 
-					/* Read the LSN while holding buffer header lock */
-					buf_state = LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf, buf_state);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf, from_ring))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(BufferDescriptorGetContentLock(buf));
-						UnpinBuffer(buf);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
-				LWLockRelease(BufferDescriptorGetContentLock(buf));
-
-				ScheduleBufferTagForWriteback(&BackendWritebackContext,
-											  &buf->tag);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf);
-				continue;
-			}
-		}
-
-		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
-		 */
-		if (oldFlags & BM_TAG_VALID)
-		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* remember we have no old-partition lock or tag */
-			oldPartitionLock = NULL;
-			/* keep the compiler quiet about uninitialized variables */
-			oldHash = 0;
-		}
-
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
-		/*
-		 * Need to lock the buffer header too in order to change its tag.
-		 */
-		buf_state = LockBufHdr(buf);
-
-		/*
-		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
-		 */
-		oldFlags = buf_state & BUF_FLAG_MASK;
-		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
-			break;
-
-		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(existing_buf_hdr, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return existing_buf_hdr;
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
-	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
-	 *
+	 * Need to lock the buffer header too in order to change its tag.
+	 */
+	victim_buf_state = LockBufHdr(victim_buf_hdr);
+
+	/* some sanity checks while we hold the buffer header lock */
+	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+
+	victim_buf_hdr->tag = newTag;
+
+	/*
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
 	 */
-	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
+	victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
-	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		victim_buf_state |= BM_PERMANENT;
 
-	UnlockBufHdr(buf, buf_state);
-
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
+	UnlockBufHdr(victim_buf_hdr, victim_buf_state);
 
 	LWLockRelease(newPartitionLock);
 
-	if (oldFlags & BM_VALID)
-	{
-		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
-		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, *io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT);
-	}
-
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
 	 * to read it before we did, so there's nothing left for BufferAlloc() to
 	 * do.
 	 */
-	if (StartBufferIO(buf, true))
+	if (StartBufferIO(victim_buf_hdr, true))
 		*foundPtr = false;
 	else
 		*foundPtr = true;
 
-	return buf;
+	return victim_buf_hdr;
 }
 
 /*
@@ -1595,6 +1413,237 @@ retry:
 	StrategyFreeBuffer(buf);
 }
 
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+	uint32		buf_state;
+	uint32		hash;
+	LWLock	   *partition_lock;
+	BufferTag	tag;
+
+	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+
+	/* have buffer pinned, so it's safe to read tag without lock */
+	tag = buf_hdr->tag;
+
+	hash = BufTableHashCode(&tag);
+	partition_lock = BufMappingPartitionLock(hash);
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+	/* lock the buffer header */
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We have the buffer pinned nobody else should have been able to unset
+	 * this concurrently.
+	 */
+	Assert(buf_state & BM_TAG_VALID);
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BufferTagsEqual(&buf_hdr->tag, &tag));
+
+	/*
+	 * If somebody else pinned the buffer since, or even worse, dirtied
+	 * it, give up on this buffer: It's clearly in use.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) != 1 || (buf_state & BM_DIRTY))
+	{
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		UnlockBufHdr(buf_hdr, buf_state);
+		LWLockRelease(partition_lock);
+
+		return false;
+	}
+
+	/*
+	 * Clear out the buffer's tag and flags and usagecount.  This is not
+	 * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+	 * doing anything with the buffer. But currently it's beneficial as the
+	 * pre-check for several linear scans of shared buffers just checks the
+	 * tag.
+	 */
+	ClearBufferTag(&buf_hdr->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+	/* finally delete buffer from the buffer mapping table */
+	BufTableDelete(&tag, hash);
+
+	LWLockRelease(partition_lock);
+
+	Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID)));
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0);
+
+	return true;
+}
+
+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+	BufferDesc *buf_hdr;
+	Buffer		buf;
+	uint32		buf_state;
+	bool		from_ring;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/* we return here if a prospective victim buffer gets used concurrently */
+again:
+
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+	buf = BufferDescriptorGetBuffer(buf_hdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf_hdr);
+
+	/*
+	 * We shouldn't have any other pins for this buffer.
+	 */
+	BufferCheckOneLocalPin(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released the
+	 * buffer header lock above, or even while we are writing it out (since
+	 * our share-lock won't prevent hint-bit updates).  We will recheck the
+	 * dirty bit after re-locking the buffer header.
+	 */
+	if (buf_state & BM_DIRTY)
+	{
+		LWLock	   *content_lock;
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_VALID);
+
+		/*
+		 * We need a share-lock on the buffer contents to write it out (else
+		 * we might write invalid data, eg because someone else is compacting
+		 * the page contents while we write).  We must use a conditional lock
+		 * acquisition here to avoid deadlock.  Even though the buffer was not
+		 * pinned (and therefore surely not locked) when StrategyGetBuffer
+		 * returned it, someone else could have pinned and exclusive-locked it
+		 * by the time we get here. If we try to get the lock unconditionally,
+		 * we'd block waiting for them; if they later block waiting for us,
+		 * deadlock ensues. (This has been observed to happen when two
+		 * backends are both trying to split btree index pages, and the second
+		 * one just happens to be trying to split the page the first one got
+		 * from StrategyGetBuffer.)
+		 */
+		content_lock = BufferDescriptorGetContentLock(buf_hdr);
+		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop back
+			 * to get another one.
+			 */
+			UnpinBuffer(buf_hdr);
+			goto again;
+		}
+
+		/*
+		 * If using a nondefault strategy, and writing the buffer would
+		 * require a WAL flush, let the strategy decide whether to go ahead
+		 * and write/reuse the buffer or to choose another victim.  We need
+		 * lock to inspect the page LSN, so this can't be done inside
+		 * StrategyGetBuffer.
+		 */
+		if (strategy != NULL)
+		{
+			XLogRecPtr	lsn;
+
+			/* Read the LSN while holding buffer header lock */
+			buf_state = LockBufHdr(buf_hdr);
+			lsn = BufferGetLSN(buf_hdr);
+			UnlockBufHdr(buf_hdr, buf_state);
+
+			if (XLogNeedsFlush(lsn)
+				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(buf_hdr);
+				goto again;
+			}
+		}
+
+		/* OK, do the I/O */
+		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+		LWLockRelease(content_lock);
+
+		ScheduleBufferTagForWriteback(&BackendWritebackContext,
+									  &buf_hdr->tag);
+	}
+
+
+	if (buf_state & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
+	/*
+	 * If the buffer has an entry in the buffer mapping table, delete it. This
+	 * can fail because another backend could have pinned or dirtied the
+	 * buffer.
+	 */
+	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+	{
+		UnpinBuffer(buf_hdr);
+		goto again;
+	}
+
+	/* a final set of sanity checks */
+#ifdef USE_ASSERT_CHECKING
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+
+	BufferCheckOneLocalPin(buf);
+#endif
+
+	return buf;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 798c5b93a87..5b44b0be8b5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -45,13 +45,14 @@ BufferDesc *LocalBufferDescriptors = NULL;
 Block	   *LocalBufferBlockPointers = NULL;
 int32	   *LocalRefCount = NULL;
 
-static int	nextFreeLocalBuf = 0;
+static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
+static Buffer GetLocalVictimBuffer(void);
 
 
 /*
@@ -113,10 +114,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
-	int			b;
-	int			trycounter;
+	Buffer		victim_buffer;
+	int			bufid;
 	bool		found;
-	uint32		buf_state;
 
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
@@ -138,23 +138,51 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 	if (hresult)
 	{
-		b = hresult->id;
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufid = hresult->id;
+		bufHdr = GetLocalBufferDescriptor(bufid);
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
-#ifdef LBDEBUG
-		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
-#endif
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
-		return bufHdr;
+	}
+	else
+	{
+		uint32		buf_state;
+
+		victim_buffer = GetLocalVictimBuffer();
+		bufid = -(victim_buffer + 1);
+		bufHdr = GetLocalBufferDescriptor(bufid);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
+		if (found)				/* shouldn't happen */
+			elog(ERROR, "local buffer hash table corrupted");
+		hresult->id = bufid;
+
+		/*
+		 * it's all ours now.
+		 */
+		bufHdr->tag = newTag;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+		*foundPtr = false;
 	}
 
-#ifdef LBDEBUG
-	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-			smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum,
-			-nextFreeLocalBuf - 1);
-#endif
+	return bufHdr;
+}
+
+static Buffer
+GetLocalVictimBuffer(void)
+{
+	int			victim_bufid;
+	int			trycounter;
+	uint32		buf_state;
+	BufferDesc *bufHdr;
+
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
 	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
@@ -163,14 +191,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	trycounter = NLocBuffer;
 	for (;;)
 	{
-		b = nextFreeLocalBuf;
+		victim_bufid = nextFreeLocalBufId;
 
-		if (++nextFreeLocalBuf >= NLocBuffer)
-			nextFreeLocalBuf = 0;
+		if (++nextFreeLocalBufId >= NLocBuffer)
+			nextFreeLocalBufId = 0;
 
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufHdr = GetLocalBufferDescriptor(victim_bufid);
 
-		if (LocalRefCount[b] == 0)
+		if (LocalRefCount[victim_bufid] == 0)
 		{
 			buf_state = pg_atomic_read_u32(&bufHdr->state);
 
@@ -193,6 +221,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 					 errmsg("no empty local buffer available")));
 	}
 
+	/*
+	 * lazy memory allocation: allocate space on first use of a buffer.
+	 */
+	if (LocalBufHdrGetBlock(bufHdr) == NULL)
+	{
+		/* Set pointer for use by BufferGetBlock() macro */
+		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
+	}
+
 	/*
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
@@ -223,48 +260,24 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	}
 
 	/*
-	 * lazy memory allocation: allocate space on first use of a buffer.
-	 */
-	if (LocalBufHdrGetBlock(bufHdr) == NULL)
-	{
-		/* Set pointer for use by BufferGetBlock() macro */
-		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
-	}
-
-	/*
-	 * Update the hash table: remove old entry, if any, and make new one.
+	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
 	if (buf_state & BM_TAG_VALID)
 	{
+		LocalBufferLookupEnt *hresult;
+
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
 		if (!hresult)			/* shouldn't happen */
 			elog(ERROR, "local buffer hash table corrupted");
 		/* mark buffer invalid just in case hash insert fails */
 		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BM_VALID | BM_TAG_VALID);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
-	hresult = (LocalBufferLookupEnt *)
-		hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
-	if (found)					/* shouldn't happen */
-		elog(ERROR, "local buffer hash table corrupted");
-	hresult->id = b;
-
-	/*
-	 * it's all ours now.
-	 */
-	bufHdr->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
-	buf_state |= BM_TAG_VALID;
-	buf_state &= ~BUF_USAGECOUNT_MASK;
-	buf_state += BUF_USAGECOUNT_ONE;
-	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-
-	*foundPtr = false;
-	return bufHdr;
+	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
@@ -431,7 +444,7 @@ InitLocalBuffers(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	nextFreeLocalBuf = 0;
+	nextFreeLocalBufId = 0;
 
 	/* initialize fields that need to start off nonzero */
 	for (i = 0; i < nbufs; i++)
-- 
2.38.0

v4-0008-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload

From d19867ec061bc3dd49b6343577cc93e23774329b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v4 08/15] bufmgr: Support multiple in-progress IOs by using
 resowner

---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/utils/resowner_private.h  |  5 ++
 src/backend/access/transam/xact.c     |  4 +-
 src/backend/postmaster/autovacuum.c   |  1 -
 src/backend/postmaster/bgwriter.c     |  1 -
 src/backend/postmaster/checkpointer.c |  1 -
 src/backend/postmaster/walwriter.c    |  1 -
 src/backend/storage/buffer/bufmgr.c   | 86 ++++++++++++---------------
 src/backend/utils/resowner/resowner.c | 60 +++++++++++++++++++
 9 files changed, 105 insertions(+), 56 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 973547a8baf..c568336b2b3 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -181,7 +181,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern void AbortBufferIO(void);
+extern void AbortBufferIO(Buffer buffer);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(struct WritebackContext *wb_context);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b54..ae58438ec76 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -30,6 +30,11 @@ extern void ResourceOwnerEnlargeBuffers(ResourceOwner owner);
 extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
+/* support for IO-in-progress management */
+extern void ResourceOwnerEnlargeBufferIOs(ResourceOwner owner);
+extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
+extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
+
 /* support for local lock management */
 extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b8764012607..a0f53e9936a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2725,8 +2725,7 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/* Clean up buffer I/O and buffer context locks, too */
-	AbortBufferIO();
+	/* Clean up buffer context locks, too */
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5086,7 +5085,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-	AbortBufferIO();
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index ff6149a1793..3540b340b51 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -526,7 +526,6 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 9bb47da404d..caad642ec93 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -167,7 +167,6 @@ BackgroundWriterMain(void)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index aaad5c52281..ace9893d957 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -271,7 +271,6 @@ CheckpointerMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 513e580c513..65e84be39b9 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -163,7 +163,6 @@ WalWriterMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4faa77b1705..c45ad37ec00 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,10 +159,6 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
-/* local state for StartBufferIO and related functions */
-static BufferDesc *InProgressBuf = NULL;
-static bool IsForInput;
-
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -2712,7 +2708,6 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	AbortBufferIO();
 	UnlockBuffers();
 
 	CheckForBufferLeaks();
@@ -4659,7 +4654,7 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 {
 	uint32		buf_state;
 
-	Assert(!InProgressBuf);
+	ResourceOwnerEnlargeBufferIOs(CurrentResourceOwner);
 
 	for (;;)
 	{
@@ -4683,8 +4678,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = buf;
-	IsForInput = forInput;
+	ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+								  BufferDescriptorGetBuffer(buf));
 
 	return true;
 }
@@ -4710,8 +4705,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 {
 	uint32		buf_state;
 
-	Assert(buf == InProgressBuf);
-
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
@@ -4723,13 +4716,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = NULL;
+	ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
 }
 
 /*
- * AbortBufferIO: Clean up any active buffer I/O after an error.
+ * AbortBufferIO: Clean up active buffer I/O after an error.
  *
  *	All LWLocks we might have held have been released,
  *	but we haven't yet released buffer pins, so the buffer is still pinned.
@@ -4738,46 +4732,42 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
  *	possible the error condition wasn't related to the I/O.
  */
 void
-AbortBufferIO(void)
+AbortBufferIO(Buffer buf)
 {
-	BufferDesc *buf = InProgressBuf;
+	BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 
-	if (buf)
+	buf_state = LockBufHdr(buf_hdr);
+	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+
+	if (!(buf_state & BM_VALID))
 	{
-		uint32		buf_state;
-
-		buf_state = LockBufHdr(buf);
-		Assert(buf_state & BM_IO_IN_PROGRESS);
-		if (IsForInput)
-		{
-			Assert(!(buf_state & BM_DIRTY));
-
-			/* We'd better not think buffer is valid yet */
-			Assert(!(buf_state & BM_VALID));
-			UnlockBufHdr(buf, buf_state);
-		}
-		else
-		{
-			Assert(buf_state & BM_DIRTY);
-			UnlockBufHdr(buf, buf_state);
-			/* Issue notice if this is not the first failure... */
-			if (buf_state & BM_IO_ERROR)
-			{
-				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
-
-				path = relpathperm(BufTagGetRelFileLocator(&buf->tag),
-								   BufTagGetForkNum(&buf->tag));
-				ereport(WARNING,
-						(errcode(ERRCODE_IO_ERROR),
-						 errmsg("could not write block %u of %s",
-								buf->tag.blockNum, path),
-						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
-			}
-		}
-		TerminateBufferIO(buf, false, BM_IO_ERROR);
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
 	}
+	else
+	{
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		/* Issue notice if this is not the first failure... */
+		if (buf_state & BM_IO_ERROR)
+		{
+			/* Buffer is pinned, so we can read tag without spinlock */
+			char	   *path;
+
+			path = relpathperm(BufTagGetRelFileLocator(&buf_hdr->tag),
+							   BufTagGetForkNum(&buf_hdr->tag));
+			ereport(WARNING,
+					(errcode(ERRCODE_IO_ERROR),
+					 errmsg("could not write block %u of %s",
+							buf_hdr->tag.blockNum, path),
+					 errdetail("Multiple failures --- write error might be permanent.")));
+			pfree(path);
+		}
+	}
+
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45d..fccc59b39dd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
 
 	/* We have built-in support for remembering: */
 	ResourceArray bufferarr;	/* owned buffers */
+	ResourceArray bufferioarr;	/* in-progress buffer IO */
 	ResourceArray catrefarr;	/* catcache references */
 	ResourceArray catlistrefarr;	/* catcache-list pins */
 	ResourceArray relrefarr;	/* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	ResourceArrayInit(&(owner->bufferarr), BufferGetDatum(InvalidBuffer));
+	ResourceArrayInit(&(owner->bufferioarr), BufferGetDatum(InvalidBuffer));
 	ResourceArrayInit(&(owner->catrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->catlistrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->relrefarr), PointerGetDatum(NULL));
@@ -517,6 +519,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 	if (phase == RESOURCE_RELEASE_BEFORE_LOCKS)
 	{
+		/*
+		 * Abort failed buffer IO. AbortBufferIO()->TerminateBufferIO() calls
+		 * ResourceOwnerForgetBufferIOs(), so we just have to iterate till
+		 * there are none.
+		 *
+		 * Needs to be before we release buffer pins.
+		 *
+		 * During a commit, there shouldn't be any in-progress IO.
+		 */
+		while (ResourceArrayGetAny(&(owner->bufferioarr), &foundres))
+		{
+			Buffer		res = DatumGetBuffer(foundres);
+
+			if (isCommit)
+				elog(PANIC, "lost track of buffer IO on buffer %u", res);
+			AbortBufferIO(res);
+		}
+
 		/*
 		 * Release buffer pins.  Note that ReleaseBuffer will remove the
 		 * buffer entry from our array, so we just have to iterate till there
@@ -746,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And it better not own any resources, either */
 	Assert(owner->bufferarr.nitems == 0);
+	Assert(owner->bufferioarr.nitems == 0);
 	Assert(owner->catrefarr.nitems == 0);
 	Assert(owner->catlistrefarr.nitems == 0);
 	Assert(owner->relrefarr.nitems == 0);
@@ -775,6 +796,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And free the object. */
 	ResourceArrayFree(&(owner->bufferarr));
+	ResourceArrayFree(&(owner->bufferioarr));
 	ResourceArrayFree(&(owner->catrefarr));
 	ResourceArrayFree(&(owner->catlistrefarr));
 	ResourceArrayFree(&(owner->relrefarr));
@@ -976,6 +998,44 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 			 buffer, owner->name);
 }
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * buffer array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeBufferIOs(ResourceOwner owner)
+{
+	/* We used to allow pinning buffers without a resowner, but no more */
+	Assert(owner != NULL);
+	ResourceArrayEnlarge(&(owner->bufferioarr));
+}
+
+/*
+ * Remember that a buffer IO is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeBufferIOs()
+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+
+/*
+ * Forget that a buffer IO is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	if (!ResourceArrayRemove(&(owner->bufferioarr), BufferGetDatum(buffer)))
+		elog(PANIC, "buffer IO %d is not owned by resource owner %s",
+			 buffer, owner->name);
+}
+
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
  *
-- 
2.38.0

v4-0009-bufmgr-Move-relation-extension-handling-into-Exte.patchtext/x-diff; charset=us-asciiDownload

From 90351be6c08cc6ad5145b243b763b380fa4585af Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:24:19 -0800
Subject: [PATCH v4 09/15] bufmgr: Move relation extension handling into
 ExtendBufferedRel{By,To,}

---
 src/include/pgstat.h                   |   1 +
 src/include/storage/buf_internals.h    |   7 +
 src/include/storage/bufmgr.h           |  49 ++
 src/backend/storage/buffer/bufmgr.c    | 765 +++++++++++++++++++------
 src/backend/storage/buffer/localbuf.c  | 156 ++++-
 src/backend/utils/activity/pgstat_io.c |   8 +-
 src/backend/utils/probes.d             |   6 +-
 doc/src/sgml/monitoring.sgml           |  43 +-
 8 files changed, 853 insertions(+), 182 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f43fac09ede..d608280b990 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -507,6 +507,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index fa5c451b1a9..feca19f5620 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -422,6 +422,13 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
+extern BlockNumber ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+										  ForkNumber fork,
+										  uint32 flags,
+										  uint32 extend_by,
+										  BlockNumber extend_upto,
+										  Buffer *buffers,
+										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index c568336b2b3..4665c82baf2 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -138,6 +138,55 @@ extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
+
+/*
+ * Flags influencing the behaviour of ExtendBufferedRel*.
+ */
+typedef enum ExtendBufferedFlags {
+	EB_SKIP_EXTENSION_LOCK = (1 << 0),
+	EB_IN_RECOVERY = (1 << 1),
+	EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+	EB_LOCK_FIRST = (1 << 3),
+	EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+	/* internal flags follow */
+	EB_LOCK_TARGET = (1 << 5),
+} ExtendBufferedFlags;
+
+/*
+ * To identify the relation - either relation or smgr + relpersistence has to
+ * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
+ * to use the same function for both crash recovery and normal operation.
+ */
+typedef struct ExtendBufferedWhat
+{
+	Relation rel;
+	struct SMgrRelationData *smgr;
+	char relpersistence;
+} ExtendBufferedWhat;
+
+#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
+#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
+								ForkNumber forkNum,
+								BufferAccessStrategy strategy,
+								uint32 flags);
+extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
+									   ForkNumber fork,
+									   BufferAccessStrategy strategy,
+									   uint32 flags,
+									   uint32 extend_by,
+									   Buffer *buffers,
+									   uint32 *extended_by);
+extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
+								  ForkNumber fork,
+								  BufferAccessStrategy strategy,
+								  uint32 flags,
+								  BlockNumber extend_to,
+								  ReadBufferMode mode);
+
+
 extern void InitBufferPoolAccess(void);
 extern void AtEOXact_Buffers(bool isCommit);
 extern void PrintBufferLeakWarning(Buffer buffer);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index c45ad37ec00..f79e244b74a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -450,6 +451,22 @@ static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
+static BlockNumber ExtendBufferedRelInternal(ExtendBufferedWhat eb,
+											 ForkNumber fork,
+											 BufferAccessStrategy strategy,
+											 uint32 flags,
+											 uint32 extend_by,
+											 BlockNumber extend_upto,
+											 Buffer *buffers,
+											 uint32 *extended_by);
+static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
 static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
@@ -785,6 +802,166 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * Convenience wrapper around ExtendBufferedRelBy() extending by one block.
+ */
+Buffer
+ExtendBufferedRel(ExtendBufferedWhat eb,
+				  ForkNumber forkNum,
+				  BufferAccessStrategy strategy,
+				  uint32 flags)
+{
+	Buffer		buf;
+	uint32		extend_by = 1;
+
+	ExtendBufferedRelBy(eb, forkNum, strategy, flags, extend_by,
+						&buf, &extend_by);
+
+	return buf;
+}
+
+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by. Depending on the availability of
+ * resources the relation may end up being extended by a smaller number of
+ * pages. *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ */
+BlockNumber
+ExtendBufferedRelBy(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					uint32 extend_by,
+					Buffer *buffers,
+					uint32 *extended_by)
+{
+	Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_by > 0);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	return ExtendBufferedRelInternal(eb, fork, strategy, flags,
+									 extend_by, InvalidBlockNumber,
+									 buffers, extended_by);
+}
+
+Buffer
+ExtendBufferedRelTo(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					BlockNumber extend_to,
+					ReadBufferMode mode)
+{
+	BlockNumber current_size;
+	uint32 extended_by = 0;
+	Buffer buffer = InvalidBuffer;
+	Buffer buffers[64];
+
+	Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
+	Assert(mode == RBM_NORMAL || mode == RBM_ZERO_ON_ERROR ||
+		   mode == RBM_ZERO_AND_LOCK);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * Create the file first if it doesn't exist.  If
+	 * smgr_cached_nblocks[fork] is positive then it must exist, no need for
+	 * an smgrexists call.
+	 */
+	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
+		(eb.smgr->smgr_cached_nblocks[fork] == 0 ||
+		 eb.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(eb.smgr, fork))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+
+		/* could have been closed while waiting for lock */
+		eb.smgr = RelationGetSmgr(eb.rel);
+
+		/* recheck, fork might have been created concurrently */
+		if (!smgrexists(eb.smgr, fork))
+			smgrcreate(eb.smgr, fork, flags & EB_IN_RECOVERY);
+
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	/*
+	 * Estimate how many pages we'll need to extend by. This avoids acquiring
+	 * unnecessarily many victim buffers.
+	 */
+	current_size = smgrnblocks(eb.smgr, fork);
+
+	if (mode == RBM_ZERO_AND_LOCK)
+		flags |= EB_LOCK_TARGET;
+
+	while (current_size < extend_to)
+	{
+		uint32 num_pages = lengthof(buffers);
+		BlockNumber first_block;
+
+		if ((uint64) current_size + num_pages > extend_to)
+			num_pages = extend_to - current_size;
+
+		first_block = ExtendBufferedRelInternal(eb, fork, strategy, flags,
+												num_pages, extend_to,
+												buffers, &extended_by);
+
+		current_size = first_block + extended_by;
+		Assert(current_size <= extend_to);
+		Assert(num_pages != 0 || current_size >= extend_to);
+
+		for (int i = 0; i < extended_by; i++)
+		{
+			if (first_block + i + 1 != extend_to)
+				ReleaseBuffer(buffers[i]);
+			else
+				buffer = buffers[i];
+		}
+	}
+
+	/*
+	 * It's possible that another backend concurrently extended the
+	 * relation. In that case read the buffer.
+	 *
+	 * XXX: Should we control this via a flag?
+	 */
+	if (buffer == InvalidBuffer)
+	{
+		bool hit;
+
+		Assert(extended_by == 0);
+		buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+								   fork, extend_to - 1, mode, strategy,
+								   &hit);
+	}
+
+	return buffer;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +978,36 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	bool		found;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
 	*hit = false;
 
+	/*
+	 * Backward compatibility path, most code should use
+	 * ExtendRelationBuffered() instead, as acquiring the extension lock
+	 * inside ExtendRelationBuffered() scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+	{
+		uint32 flags = EB_SKIP_EXTENSION_LOCK;
+
+		Assert(mode == RBM_NORMAL || mode == RBM_ZERO_AND_LOCK);
+
+		if (mode == RBM_ZERO_AND_LOCK)
+			flags |= EB_LOCK_FIRST;
+
+		return ExtendBufferedRel(EB_SMGR(smgr, relpersistence),
+								 forkNum, strategy, flags);
+	}
+
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
-	isExtend = (blockNum == P_NEW);
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
 									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend,
-									   isExtend);
-
-	/* Substitute proper block number if caller asked for P_NEW */
-	if (isExtend)
-	{
-		blockNum = smgrnblocks(smgr, forkNum);
-		/* Fail if relation is already at maximum possible length */
-		if (blockNum == P_NEW)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot extend relation %s beyond %u blocks",
-							relpath(smgr->smgr_rlocator, forkNum),
-							P_NEW)));
-	}
+									   smgr->smgr_rlocator.backend);
 
 	if (isLocalBuf)
 	{
@@ -843,8 +1021,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.local_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
@@ -859,8 +1035,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.shared_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
@@ -871,103 +1045,40 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
-		if (!isExtend)
-		{
-			/* Just need to update stats before we exit */
-			*hit = true;
-			VacuumPageHit++;
+		/* Just need to update stats before we exit */
+		*hit = true;
+		VacuumPageHit++;
 
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
 
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-											  smgr->smgr_rlocator.locator.spcOid,
-											  smgr->smgr_rlocator.locator.dbOid,
-											  smgr->smgr_rlocator.locator.relNumber,
-											  smgr->smgr_rlocator.backend,
-											  isExtend,
-											  found);
-
-			/*
-			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
-			 * locked on return.
-			 */
-			if (!isLocalBuf)
-			{
-				if (mode == RBM_ZERO_AND_LOCK)
-					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-								  LW_EXCLUSIVE);
-				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-			}
-
-			return BufferDescriptorGetBuffer(bufHdr);
-		}
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  smgr->smgr_rlocator.locator.spcOid,
+										  smgr->smgr_rlocator.locator.dbOid,
+										  smgr->smgr_rlocator.locator.relNumber,
+										  smgr->smgr_rlocator.backend,
+										  found);
 
 		/*
-		 * We get here only in the corner case where we are trying to extend
-		 * the relation but we found a pre-existing buffer marked BM_VALID.
-		 * This can happen because mdread doesn't complain about reads beyond
-		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
-		 * read a block beyond EOF could have left a "valid" zero-filled
-		 * buffer.  Unfortunately, we have also seen this case occurring
-		 * because of buggy Linux kernels that sometimes return an
-		 * lseek(SEEK_END) result that doesn't account for a recent write. In
-		 * that situation, the pre-existing buffer would contain valid data
-		 * that we don't want to overwrite.  Since the legitimate case should
-		 * always have left a zero-filled buffer, complain if not PageIsNew.
+		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
+		 * on return.
 		 */
-		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-		if (!PageIsNew((Page) bufBlock))
-			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rlocator, forkNum)),
-					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
-
-		/*
-		 * We *must* do smgrextend before succeeding, else the page will not
-		 * be reserved by the kernel, and the next P_NEW call will decide to
-		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
-		 * call that BufferAlloc didn't, and proceed.
-		 */
-		if (isLocalBuf)
+		if (!isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			Assert(buf_state & BM_VALID);
-			buf_state &= ~BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			if (mode == RBM_ZERO_AND_LOCK)
+				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+							  LW_EXCLUSIVE);
+			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
+				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
 		}
-		else
-		{
-			/*
-			 * Loop to handle the very small possibility that someone re-sets
-			 * BM_VALID between our clearing it and StartBufferIO inspecting
-			 * it.
-			 */
-			do
-			{
-				uint32		buf_state = LockBufHdr(bufHdr);
 
-				Assert(buf_state & BM_VALID);
-				buf_state &= ~BM_VALID;
-				UnlockBufHdr(bufHdr, buf_state);
-			} while (!StartBufferIO(bufHdr, true));
-		}
+		return BufferDescriptorGetBuffer(bufHdr);
 	}
 
 	/*
 	 * if we have gotten to this point, we have allocated a buffer for the
 	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
 	 * if it's a shared buffer.
-	 *
-	 * Note: if smgrextend fails, we will end up with a buffer that is
-	 * allocated but not marked BM_VALID.  P_NEW will still select the same
-	 * block number (because the relation didn't get any longer on disk) and
-	 * so future attempts to extend the relation will find the same buffer (if
-	 * it's not been recycled) but come right back here to try smgrextend
-	 * again.
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
@@ -982,72 +1093,51 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		io_object = IOOBJECT_RELATION;
 	}
 
-	if (isExtend)
-	{
-		/* new buffers are zero-filled */
+	/*
+	 * Read in the page, unless the caller intends to overwrite it and just
+	 * wants us to allocate a buffer.
+	 */
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
-
-		/*
-		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
-		 * although we're essentially performing a write. At least on linux
-		 * doing so defeats the 'delayed allocation' mechanism, leading to
-		 * increased file fragmentation.
-		 */
-	}
 	else
 	{
-		/*
-		 * Read in the page, unless the caller intends to overwrite it and
-		 * just wants us to allocate a buffer.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			MemSet((char *) bufBlock, 0, BLCKSZ);
-		else
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
+		smgrread(smgr, forkNum, blockNum, bufBlock);
+
+		if (track_io_timing)
 		{
-			instr_time	io_start,
-						io_time;
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
+		pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
+		/* check for garbage data */
+		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
+									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		{
+			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			{
+				ereport(WARNING,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s; zeroing out page",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
+				MemSet((char *) bufBlock, 0, BLCKSZ);
+			}
 			else
-				INSTR_TIME_SET_ZERO(io_start);
-
-			smgrread(smgr, forkNum, blockNum, bufBlock);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-					MemSet((char *) bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-			}
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
 		}
 	}
 
@@ -1090,7 +1180,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									  smgr->smgr_rlocator.locator.dbOid,
 									  smgr->smgr_rlocator.locator.relNumber,
 									  smgr->smgr_rlocator.backend,
-									  isExtend,
 									  found);
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1640,6 +1729,350 @@ again:
 	return buf;
 }
 
+/*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, as otherwise the operation likely
+ * cannot be performed at all.
+ *
+ * The number of allowed pins for a backend is computed based on
+ * shared_buffers and the maximum number of connections possible. That's very
+ * pessimistic, but oustide of toy-sized shared_buffers it should allow
+ * sufficient pins.
+ */
+static void
+LimitAdditionalPins(uint32 *additional_pins)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	/*
+	 * Subtract the approximate number of buffers already pinned by this
+	 * backend. We get the number of "overflowed" pins for free, but don't
+	 * know the number of pins in PrivateRefCountArray. The cost of
+	 * calculating that exactly doesn't seem worth it, so just assume the max.
+	 */
+	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	if (max_proportional_pins < 0)
+		max_proportional_pins = 1;
+
+	if (*additional_pins > max_proportional_pins)
+		*additional_pins = max_proportional_pins;
+}
+
+static BlockNumber
+ExtendBufferedRelInternal(ExtendBufferedWhat eb,
+						  ForkNumber fork,
+						  BufferAccessStrategy strategy,
+						  uint32 flags,
+						  uint32 extend_by,
+						  BlockNumber extend_upto,
+						  Buffer *buffers,
+						  uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
+										 eb.smgr->smgr_rlocator.locator.spcOid,
+										 eb.smgr->smgr_rlocator.locator.dbOid,
+										 eb.smgr->smgr_rlocator.locator.relNumber,
+										 eb.smgr->smgr_rlocator.backend,
+										 extend_by);
+
+	if (eb.relpersistence == RELPERSISTENCE_TEMP)
+		first_block = ExtendBufferedRelLocal(eb, fork, flags,
+											 extend_by, extend_upto,
+											 buffers, &extend_by);
+	else
+		first_block = ExtendBufferedRelShared(eb, fork, strategy, flags,
+											  extend_by, extend_upto,
+											  buffers, &extend_by);
+	*extended_by = extend_by;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
+										eb.smgr->smgr_rlocator.locator.spcOid,
+										eb.smgr->smgr_rlocator.locator.dbOid,
+										eb.smgr->smgr_rlocator.locator.relNumber,
+										eb.smgr->smgr_rlocator.backend,
+										*extended_by,
+										first_block);
+
+	return first_block;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * shared buffers.
+ */
+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	LimitAdditionalPins(&extend_by);
+
+	/*
+	 * Acquire victim buffers for extension without holding extension lock.
+	 * Writing out victim buffers is the most expensive part of extending the
+	 * relation, particularly when doing so requires WAL flushes. Zeroing out
+	 * the buffers is also quite expensive, so do that before holding the
+	 * extension lock as well.
+	 *
+	 * These pages are pinned by us and not valid. While we hold the pin they
+	 * can't be acquired as victim buffers by another backend.
+	 */
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		Block		buf_block;
+
+		buffers[i] = GetVictimBuffer(strategy, io_context);
+		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	/*
+	 * Lock relation against concurrent extensions, unless requested not to.
+	 *
+	 * We use the same extension lock for all forks. That's unnecessarily
+	 * restrictive, but currently extensions for forks don't happen often
+	 * enough to make it worth locking more granularly.
+	 *
+	 * Note that another backend might have extended the relation by the time
+	 * we get the lock.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+		eb.smgr = RelationGetSmgr(eb.rel);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	if (extend_upto != InvalidBlockNumber)
+	{
+		uint32 old_num_pages = extend_by;
+
+		if (first_block > extend_upto)
+			extend_by = 0;
+		else if ((uint64) first_block + extend_by > extend_upto)
+			extend_by = extend_upto - first_block;
+
+		for (uint32 i = extend_by; i < old_num_pages; i++)
+		{
+			BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(buf_hdr);
+			UnpinBuffer(buf_hdr);
+		}
+
+		if (extend_by == 0)
+		{
+			UnlockRelationForExtension(eb.rel, ExclusiveLock);
+			*extended_by = extend_by;
+			return first_block;
+		}
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	/*
+	 * Insert buffers into buffer table, mark as IO_IN_PROGRESS.
+	 *
+	 * This needs to happen before we extend the relation, because as soon as
+	 * we do, other backends can start to read in those pages.
+	 */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		victim_buf = buffers[i];
+		BufferDesc *victim_buf_hdr = GetBufferDescriptor(victim_buf - 1);
+		BufferTag	tag;
+		uint32		hash;
+		LWLock	   *partition_lock;
+		int			existing_id;
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		existing_id = BufTableInsert(&tag, hash, victim_buf_hdr->buf_id);
+
+		/*
+		 * We get here only in the corner case where we are trying to extend
+		 * the relation but we found a pre-existing buffer. This can happen
+		 * because a prior attempt at extending the relation failed, and
+		 * because mdread doesn't complain about reads beyond EOF (when
+		 * zero_damaged_pages is ON) and so a previous attempt to read a block
+		 * beyond EOF could have left a "valid" zero-filled buffer.
+		 * Unfortunately, we have also seen this case occurring because of
+		 * buggy Linux kernels that sometimes return an lseek(SEEK_END) result
+		 * that doesn't account for a recent write. In that situation, the
+		 * pre-existing buffer would contain valid data that we don't want to
+		 * overwrite.  Since the legitimate cases should always have left a
+		 * zero-filled buffer, complain if not PageIsNew.
+		 */
+		if (existing_id >= 0)
+		{
+			BufferDesc *existing_hdr = GetBufferDescriptor(existing_id);
+			Block		buf_block;
+			bool		valid;
+
+			/*
+			 * Pin the existing buffer before releasing the partition lock,
+			 * preventing it from being evicted.
+			 */
+			valid = PinBuffer(existing_hdr, strategy);
+
+			LWLockRelease(partition_lock);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(victim_buf_hdr);
+			UnpinBuffer(victim_buf_hdr);
+
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+			buf_block = BufHdrGetBlock(existing_hdr);
+
+			if (valid && !PageIsNew((Page) buf_block))
+				ereport(ERROR,
+						(errmsg("unexpected data beyond EOF in block %u of relation %s",
+								existing_hdr->tag.blockNum, relpath(eb.smgr->smgr_rlocator, fork)),
+						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
+
+			/*
+			 * We *must* do smgr[zero]extend before succeeding, else the page
+			 * will not be reserved by the kernel, and the next P_NEW call
+			 * will decide to return the same page.  Clear the BM_VALID bit,
+			 * do StartBufferIO() and proceed.
+			 *
+			 * Loop to handle the very small possibility that someone re-sets
+			 * BM_VALID between our clearing it and StartBufferIO inspecting
+			 * it.
+			 */
+			do
+			{
+				uint32		buf_state = LockBufHdr(existing_hdr);
+
+				buf_state &= ~BM_VALID;
+				UnlockBufHdr(existing_hdr, buf_state);
+			} while (!StartBufferIO(existing_hdr, true));
+		}
+		else
+		{
+			uint32		buf_state;
+
+			buf_state = LockBufHdr(victim_buf_hdr);
+
+			/* some sanity checks while we hold the buffer header lock */
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+			if (eb.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+				buf_state |= BM_PERMANENT;
+
+			UnlockBufHdr(victim_buf_hdr, buf_state);
+
+			LWLockRelease(partition_lock);
+
+			/* XXX: could combine the locked operations in it with the above */
+			StartBufferIO(victim_buf_hdr, true);
+		}
+	}
+
+	/*
+	 * Note: if smgzerorextend fails, we will end up with buffers that are
+	 * allocated but not marked BM_VALID.  The next relation extension will
+	 * still select the same block number (because the relation didn't get any
+	 * longer on disk) and so future attempts to extend the relation will find
+	 * the same buffers (if they have not been recycled) but come right back
+	 * here to try smgrzeroextend again.
+	 *
+	 * We don't need to set checksum for all-zero pages.
+	 */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 *
+	 * We remove IO_IN_PROGRESS after this, as zeroing the buffer contents and
+	 * waking up waiting backends waiting can take noticeable time.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+
+	/* Set BM_VALID, terminate IO, and wake up any waiters */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+		bool		lock = false;
+
+		if (flags & EB_LOCK_FIRST && i == 0)
+			lock = true;
+		else if (flags & EB_LOCK_TARGET)
+		{
+			Assert(extend_upto != InvalidBlockNumber);
+			if (first_block + i + 1 == extend_upto)
+				lock = true;
+		}
+
+		if (lock)
+			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+
+		TerminateBufferIO(buf_hdr, false, BM_VALID);
+	}
+
+	pgBufferUsage.shared_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
+						 extend_by);
+
+	*extended_by = extend_by;
+
+	return first_block;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5b44b0be8b5..0528fddf992 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -49,6 +49,9 @@ static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
+/* number of local buffers pinned at least once */
+static int	NLocalPinnedBuffers = 0;
+
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
@@ -280,6 +283,154 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see LimitAdditionalPins() */
+static void
+LimitAdditionalLocalPins(uint32 *additional_pins)
+{
+	uint32		max_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	/*
+	 * In contrast to LimitAdditionalPins() other backends don't play a role
+	 * here. We can allow up to NLocBuffer pins in total.
+	 */
+	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+
+	if (*additional_pins >= max_pins)
+		*additional_pins = max_pins;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * temporary buffers.
+ */
+BlockNumber
+ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+					   ForkNumber fork,
+					   uint32 flags,
+					   uint32 extend_by,
+					   BlockNumber extend_upto,
+					   Buffer *buffers,
+					   uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	/* Initialize local buffers if first request in this session */
+	if (LocalBufHash == NULL)
+		InitLocalBuffers();
+
+	LimitAdditionalLocalPins(&extend_by);
+
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		BufferDesc *buf_hdr;
+		Block		buf_block;
+
+		buffers[i] = GetLocalVictimBuffer();
+		buf_hdr = GetLocalBufferDescriptor(-(buffers[i] + 1));
+		buf_block = LocalBufHdrGetBlock(buf_hdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	if (extend_upto != InvalidBlockNumber)
+	{
+		/*
+		 * In contranst to shared relations, nothing could change the relation
+		 * size concurrently. Thus we shouldn't end up finding that we don't
+		 * need to do anything.
+		 */
+		Assert(first_block <= extend_upto);
+
+		Assert((uint64) first_block + extend_by <= extend_upto);
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		int			victim_buf_id;
+		BufferDesc *victim_buf_hdr;
+		BufferTag	tag;
+		LocalBufferLookupEnt *hresult;
+		bool		found;
+
+		victim_buf_id = -(buffers[i] + 1);
+		victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
+		if (found)
+		{
+			BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			uint32		buf_state;
+
+			UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
+
+			existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			PinLocalBuffer(existing_hdr, false);
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+
+			buf_state = pg_atomic_read_u32(&existing_hdr->state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(!(buf_state & BM_DIRTY));
+			buf_state &= BM_VALID;
+			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+		}
+		else
+		{
+			uint32		buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
+
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
+
+			hresult->id = victim_buf_id;
+		}
+	}
+
+	/* actually extend relation */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr;
+		uint32		buf_state;
+
+		buf_hdr = GetLocalBufferDescriptor(-(buf + 1));
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	*extended_by = extend_by;
+
+	pgBufferUsage.temp_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
+						 extend_by);
+
+	return first_block;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
@@ -494,6 +645,7 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 
 	if (LocalRefCount[bufid] == 0)
 	{
+		NLocalPinnedBuffers++;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
@@ -515,9 +667,11 @@ UnpinLocalBuffer(Buffer buffer)
 
 	Assert(BufferIsLocal(buffer));
 	Assert(LocalRefCount[buffid] > 0);
+	Assert(NLocalPinnedBuffers > 0);
 
 	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-	LocalRefCount[buffid]--;
+	if (--LocalRefCount[buffid] == 0)
+		NLocalPinnedBuffers--;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d5546101..f2f6eae8031 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -64,13 +64,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	pgstat_count_io_op_n(io_object, io_context, io_op, 1);
+}
+
+void
+pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt)
 {
 	Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
 	Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.data[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index c064d679e94..fd3df2f7900 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,10 +55,12 @@ provider postgresql {
 	probe sort__start(int, bool, int, int, bool, int);
 	probe sort__done(bool, long);
 
-	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
+	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
+	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 77bb13ba0a7..62a7dd2785c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7713,33 +7713,52 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <entry>Probe that fires when the two-phase portion of a checkpoint is
       complete.</entry>
     </row>
+    <row>
+     <entry><literal>buffer-extend-start</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int)</literal></entry>
+     <entry>Probe that fires when a relation extension starts.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the caller would like to extend
+       by.</entry>
+    </row>
+    <row>
+     <entry><literal>buffer-extend-done</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber)</literal></entry>
+     <entry>Probe that fires when a relation extension is complete.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the relation was extended by, this
+       can be less than the number in the
+       <literal>buffer-extend-start</literal> due to resource
+       constraints. arg6 contains the BlockNumber of the first new
+       block.</entry>
+    </row>
     <row>
      <entry><literal>buffer-read-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int)</literal></entry>
      <entry>Probe that fires when a buffer read is started.
-      arg0 and arg1 contain the fork and block numbers of the page (but
-      arg1 will be -1 if this is a relation extension request).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.</entry>
+      </entry>
     </row>
     <row>
      <entry><literal>buffer-read-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
      <entry>Probe that fires when a buffer read is complete.
-      arg0 and arg1 contain the fork and block numbers of the page (if this
-      is a relation extension request, arg1 now contains the block number
-      of the newly added block).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.
-      arg7 is true if the buffer was found in the pool, false if not.</entry>
+      arg6 is true if the buffer was found in the pool, false if not.</entry>
     </row>
     <row>
      <entry><literal>buffer-flush-start</literal></entry>
-- 
2.38.0

v4-0010-Convert-a-few-places-to-ExtendBufferedRel.patchtext/x-diff; charset=us-asciiDownload

From 188abd0e2f1158be92c92b735f9d0ddbc08dc798 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:18:18 -0700
Subject: [PATCH v4 10/15] Convert a few places to ExtendBufferedRel

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/brin/brin.c         |  9 +++----
 src/backend/access/brin/brin_pageops.c |  4 +++
 src/backend/access/brin/brin_revmap.c  | 15 +++---------
 src/backend/access/gin/gininsert.c     | 10 +++-----
 src/backend/access/gin/ginutil.c       | 13 ++--------
 src/backend/access/gin/ginvacuum.c     |  8 ++++++
 src/backend/access/gist/gist.c         |  4 +--
 src/backend/access/gist/gistutil.c     | 14 ++---------
 src/backend/access/gist/gistvacuum.c   |  3 +++
 src/backend/access/nbtree/nbtpage.c    | 34 +++++++-------------------
 src/backend/access/nbtree/nbtree.c     |  3 +++
 src/backend/access/spgist/spgutils.c   | 13 ++--------
 contrib/bloom/blutils.c                | 12 ++-------
 13 files changed, 48 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index b5a5fa7b334..b283e987152 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -836,9 +836,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -903,9 +903,8 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	metabuf = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index ad5a89bd051..b578d259545 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -730,6 +730,10 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * There's not enough free space in any existing index page,
 			 * according to the FSM: extend the relation to obtain a shiny new
 			 * page.
+			 *
+			 * XXX: It's likely possible to use RBM_ZERO_AND_LOCK here,
+			 * which'd avoid the need to hold the extension lock during buffer
+			 * reclaim.
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 7fc5226bf74..f4271ba31c9 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -538,7 +538,6 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
-	bool		needLock = !RELATION_IS_LOCAL(irel);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -570,10 +569,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
-
-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendBufferedRel(EB_REL(irel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
@@ -582,17 +579,11 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * up and have caller start over.  We will have to evacuate that
 			 * page from under whoever is using it.
 			 */
-			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buf);
+			UnlockReleaseBuffer(buf);
 			return;
 		}
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
-
-		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d5d748009ea..be1841de7bf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,12 +440,10 @@ ginbuildempty(Relation index)
 				MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
-	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	MetaBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	RootBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index f05128ecf50..be2cf78d0d4 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -298,7 +298,6 @@ Buffer
 GinNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -326,16 +325,8 @@ GinNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index e5d310d8362..13251d7e07d 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -736,6 +736,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	needLock = !RELATION_IS_LOCAL(index);
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	npages = RelationGetNumberOfBlocks(index);
@@ -786,6 +790,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	stats->pages_free = totFreePages;
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ba394f08f61..3865955c394 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -133,8 +133,8 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+							   EB_SKIP_EXTENSION_LOCK | EB_LOCK_FIRST);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff1..446fcb3cc5e 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -824,7 +824,6 @@ Buffer
 gistNewBuffer(Relation r)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -877,17 +876,8 @@ gistNewBuffer(Relation r)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(r);
-
-	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
-
-	buffer = ReadBuffer(r, P_NEW);
-	LockBuffer(buffer, GIST_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(r), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3f60d3274d2..cc711b04986 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -203,6 +203,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * we must already have processed any tuples due to be moved into such a
 	 * page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d197..c4f16b3d2b7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -881,7 +881,6 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	}
 	else
 	{
-		bool		needLock;
 		Page		page;
 
 		Assert(access == BT_WRITE);
@@ -962,31 +961,16 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		}
 
 		/*
-		 * Extend the relation by one page.
-		 *
-		 * We have to use a lock to ensure no one else is extending the rel at
-		 * the same time, else we will both try to initialize the same new
-		 * page.  We can skip locking for new or temp relations, however,
-		 * since no one else could be accessing them.
+		 * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
+		 * we risk a race condition against btvacuumscan --- see comments
+		 * therein. This forces us to repeat the valgrind request that
+		 * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
+		 * without introducing a race.
 		 */
-		needLock = !RELATION_IS_LOCAL(rel);
-
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-
-		buf = ReadBuffer(rel, P_NEW);
-
-		/* Acquire buffer lock on new page */
-		_bt_lockbuf(rel, buf, BT_WRITE);
-
-		/*
-		 * Release the file-extension lock; it's now OK for someone else to
-		 * extend the relation some more.  Note that we cannot release this
-		 * lock before we have buffer lock on the new page, or we risk a race
-		 * condition against btvacuumscan --- see comments therein.
-		 */
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+		buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3f7b541e9d2..bc184813a70 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -969,6 +969,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * write-lock on the left page before it adds a right page, so we must
 	 * already have processed any tuples due to be moved into such a page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 3761f2c193b..0775b7d2618 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -365,7 +365,6 @@ Buffer
 SpGistNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -405,16 +404,8 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a6d9f09f315..d935ed8fbdf 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -353,7 +353,6 @@ Buffer
 BloomNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -387,15 +386,8 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
-- 
2.38.0

v4-0011-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload

From d1000bd37bb287491e6e3200f486be718db33ad4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v4 11/15] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/hio.h         | 14 +++++++-
 src/backend/access/heap/heapam.c | 59 +++++++++++++++++++++++++++++---
 src/backend/access/heap/hio.c    |  8 ++++-
 3 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 3f20b585326..7f877817597 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -30,6 +30,17 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+
+	/*
+	 * State for bulk extensions. Further pages that were unused at the time
+	 * of the extension. They might be in use by the time we use them though,
+	 * so rechecks are needed.
+	 *
+	 * FIXME: Perhaps these should live in RelationData instead, alongside the
+	 * targetblock?
+	 */
+	BlockNumber next_free;
+	BlockNumber last_free;
 } BulkInsertStateData;
 
 
@@ -38,6 +49,7 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
 extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
 										Buffer otherBuffer, int options,
 										BulkInsertStateData *bistate,
-										Buffer *vmbuffer, Buffer *vmbuffer_other);
+										Buffer *vmbuffer, Buffer *vmbuffer_other,
+										int num_pages);
 
 #endif							/* HIO_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 7eb79cee58d..2eb7879a016 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1774,6 +1774,8 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 	return bistate;
 }
 
@@ -1847,7 +1849,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
-									   &vmbuffer, NULL);
+									   &vmbuffer, NULL,
+									   0);
 
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
@@ -2050,6 +2053,33 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 		return tup;
 }
 
+/*
+ * Helper for heap_multi_insert() that computes the number of full pages s
+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace)
+{
+	size_t		page_avail;
+	int			npages = 0;
+
+	page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+	npages++;
+
+	for (int i = done; i < ntuples; i++)
+	{
+		size_t		tup_sz = sizeof(ItemIdData) + MAXALIGN(heaptuples[i]->t_len);
+
+		if (page_avail < tup_sz)
+		{
+			npages++;
+			page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+		}
+		page_avail -= tup_sz;
+	}
+
+	return npages;
+}
+
 /*
  *	heap_multi_insert	- insert multiple tuples into a heap
  *
@@ -2076,6 +2106,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	bool		starting_with_empty_page = false;
+	int			npages = 0;
+	int			npages_used = 0;
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2126,13 +2159,29 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	while (ndone < ntuples)
 	{
 		Buffer		buffer;
-		bool		starting_with_empty_page;
 		bool		all_visible_cleared = false;
 		bool		all_frozen_set = false;
 		int			nthispage;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Compute number of pages needed to insert tuples in the worst case.
+		 * This will be used to determine how much to extend the relation by
+		 * in RelationGetBufferForTuple(), if needed.  If we filled a prior
+		 * page from scratch, we can just update our last computation, but if
+		 * we started with a partially filled page recompute from scratch, the
+		 * number of potentially required pages can vary due to tuples needing
+		 * to fit onto the page, page headers etc.
+		 */
+		if (ndone == 0 || !starting_with_empty_page)
+		{
+			npages = heap_multi_insert_pages(heaptuples, ndone, ntuples, saveFreeSpace);
+			npages_used = 0;
+		}
+		else
+			npages_used++;
+
 		/*
 		 * Find buffer where at least the next tuple will fit.  If the page is
 		 * all-visible, this will also pin the requisite visibility map page.
@@ -2142,7 +2191,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		 */
 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
 										   InvalidBuffer, options, bistate,
-										   &vmbuffer, NULL);
+										   &vmbuffer, NULL,
+										   npages - npages_used);
 		page = BufferGetPage(buffer);
 
 		starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0;
@@ -3566,7 +3616,8 @@ l2:
 				/* It doesn't fit, must use RelationGetBufferForTuple. */
 				newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
 												   buffer, 0, NULL,
-												   &vmbuffer_new, &vmbuffer);
+												   &vmbuffer_new, &vmbuffer,
+												   0);
 				/* We're all done. */
 				break;
 			}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..65886839e70 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -275,6 +275,11 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	Returns pinned and exclusive-locked buffer of a page in given relation
  *	with free space >= given len.
  *
+ *	If num_pages is > 1, the relation will be extended by at least that many
+ *	pages when we decide to extend the relation. This is more efficient for
+ *	callers that know they will need multiple pages
+ *	(e.g. heap_multi_insert()).
+ *
  *	If otherBuffer is not InvalidBuffer, then it references a previously
  *	pinned buffer of another page in the same relation; on return, this
  *	buffer will also be exclusive-locked.  (This case is used by heap_update;
@@ -333,7 +338,8 @@ Buffer
 RelationGetBufferForTuple(Relation relation, Size len,
 						  Buffer otherBuffer, int options,
 						  BulkInsertState bistate,
-						  Buffer *vmbuffer, Buffer *vmbuffer_other)
+						  Buffer *vmbuffer, Buffer *vmbuffer_other,
+						  int num_pages)
 {
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
-- 
2.38.0

v4-0012-hio-Use-ExtendBufferedRelBy.patchtext/x-diff; charset=us-asciiDownload

From 4ee7be145f977d23672b8e6066d3967895f21eef Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v4 12/15] hio: Use ExtendBufferedRelBy()

---
 src/backend/access/heap/hio.c | 287 +++++++++++++++++-----------------
 1 file changed, 147 insertions(+), 140 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e70..48cfcff975f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -185,90 +185,6 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	}
 }
 
-/*
- * Extend a relation by multiple blocks to avoid future contention on the
- * relation extension lock.  Our goal is to pre-extend the relation by an
- * amount which ramps up as the degree of contention ramps up, but limiting
- * the result to some sane overall value.
- */
-static void
-RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
-{
-	BlockNumber blockNum,
-				firstBlock = InvalidBlockNumber;
-	int			extraBlocks;
-	int			lockWaiters;
-
-	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
-	if (lockWaiters <= 0)
-		return;
-
-	/*
-	 * It might seem like multiplying the number of lock waiters by as much as
-	 * 20 is too aggressive, but benchmarking revealed that smaller numbers
-	 * were insufficient.  512 is just an arbitrary cap to prevent
-	 * pathological results.
-	 */
-	extraBlocks = Min(512, lockWaiters * 20);
-
-	do
-	{
-		Buffer		buffer;
-		Page		page;
-		Size		freespace;
-
-		/*
-		 * Extend by one page.  This should generally match the main-line
-		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout, and we don't immediately
-		 * initialize the page (see below).
-		 */
-		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-		page = BufferGetPage(buffer);
-
-		if (!PageIsNew(page))
-			elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-				 BufferGetBlockNumber(buffer),
-				 RelationGetRelationName(relation));
-
-		/*
-		 * Add the page to the FSM without initializing. If we were to
-		 * initialize here, the page would potentially get flushed out to disk
-		 * before we add any useful content. There's no guarantee that that'd
-		 * happen before a potential crash, so we need to deal with
-		 * uninitialized pages anyway, thus avoid the potential for
-		 * unnecessary writes.
-		 */
-
-		/* we'll need this info below */
-		blockNum = BufferGetBlockNumber(buffer);
-		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
-
-		UnlockReleaseBuffer(buffer);
-
-		/* Remember first block number thus added. */
-		if (firstBlock == InvalidBlockNumber)
-			firstBlock = blockNum;
-
-		/*
-		 * Immediately update the bottom level of the FSM.  This has a good
-		 * chance of making this page visible to other concurrently inserting
-		 * backends, and we want that to happen without delay.
-		 */
-		RecordPageWithFreeSpace(relation, blockNum, freespace);
-	}
-	while (--extraBlocks > 0);
-
-	/*
-	 * Updating the upper levels of the free space map is too expensive to do
-	 * for every block, but it's worth doing once at the end to make sure that
-	 * subsequent insertion activity sees all of those nifty free pages we
-	 * just inserted.
-	 */
-	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
-}
-
 /*
  * RelationGetBufferForTuple
  *
@@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
 
 	len = MAXALIGN(len);		/* be conservative */
 
+	if (num_pages <= 0)
+		num_pages = 1;
+
 	/* Bulk insert is not supported for updates, only inserts. */
 	Assert(otherBuffer == InvalidBuffer || !bistate);
 
@@ -558,18 +477,46 @@ loop:
 			ReleaseBuffer(buffer);
 		}
 
-		/* Without FSM, always fall out of the loop and extend */
-		if (!use_fsm)
-			break;
+		if (bistate
+			&& bistate->next_free != InvalidBlockNumber
+			&& bistate->next_free <= bistate->last_free)
+		{
+			/*
+			 * We bulk extended the relation before, and there are still some
+			 * unused pages from that extension, so we don't need to look in
+			 * the FSM for a new page. But do record the free space from the
+			 * last page, somebody might insert narrower tuples later.
+			 */
+			if (use_fsm)
+				RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
 
-		/*
-		 * Update FSM as to condition of this page, and ask for another page
-		 * to try.
-		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
-													targetBlock,
-													pageFreeSpace,
-													targetFreeSpace);
+			Assert(bistate->last_free != InvalidBlockNumber &&
+				   bistate->next_free <= bistate->last_free);
+			targetBlock = bistate->next_free;
+			if (bistate->next_free >= bistate->last_free)
+			{
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
+			}
+			else
+				bistate->next_free++;
+		}
+		else if (!use_fsm)
+		{
+			/* Without FSM, always fall out of the loop and extend */
+			break;
+		}
+		else
+		{
+			/*
+			 * Update FSM as to condition of this page, and ask for another
+			 * page to try.
+			 */
+			targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														targetBlock,
+														pageFreeSpace,
+														targetFreeSpace);
+		}
 	}
 
 	/*
@@ -582,60 +529,120 @@ loop:
 	 */
 	needLock = !RELATION_IS_LOCAL(relation);
 
-	/*
-	 * If we need the lock but are not able to acquire it immediately, we'll
-	 * consider extending the relation by multiple blocks at a time to manage
-	 * contention on the relation extension lock.  However, this only makes
-	 * sense if we're using the FSM; otherwise, there's no point.
-	 */
-	if (needLock)
 	{
-		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+#define MAX_BUFFERS 64
+		Buffer		victim_buffers[MAX_BUFFERS];
+		BlockNumber firstBlock = InvalidBlockNumber;
+		BlockNumber firstBlockFSM = InvalidBlockNumber;
+		BlockNumber curBlock;
+		uint32		extend_by_pages;
+		uint32		no_fsm_pages;
+		uint32		waitcount;
+
+		extend_by_pages = num_pages;
+
+		/*
+		 * Multiply the number of pages to extend by the number of waiters. Do
+		 * this even if we're not using the FSM, as it does relieve
+		 * contention. Pages will be found via bistate->next_free.
+		 */
+		if (needLock)
+			waitcount = RelationExtensionLockWaiterCount(relation);
+		else
+			waitcount = 0;
+		extend_by_pages += extend_by_pages * waitcount;
+
+		/*
+		 * can't extend by more than MAX_BUFFERS, we need to pin them all
+		 * concurrently. FIXME: Need an NBuffers / MaxBackends type limit
+		 * here.
+		 */
+		extend_by_pages = Min(extend_by_pages, MAX_BUFFERS);
+
+		/*
+		 * How many of the extended pages not to enter into the FSM.
+		 *
+		 * Only enter pages that we don't need ourselves into the FSM.
+		 * Otherwise every other backend will immediately try to use the pages
+		 * this backend neds itself, causing unnecessary contention.
+		 *
+		 * Bulk extended pages are remembered in bistate->next_free_buffer. So
+		 * without a bistate we can't directly make use of them.
+		 *
+		 * Never enter the page returned into the FSM, we'll immediately use
+		 * it.
+		 */
+		if (num_pages > 1 && bistate == NULL)
+			no_fsm_pages = 1;
+		else
+			no_fsm_pages = num_pages;
+
+		if (bistate && bistate->current_buf != InvalidBuffer)
 		{
-			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			ReleaseBuffer(bistate->current_buf);
+			bistate->current_buf = InvalidBuffer;
+		}
 
-			/*
-			 * Check if some other backend has extended a block for us while
-			 * we were waiting on the lock.
-			 */
-			targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
+		firstBlock = ExtendBufferedRelBy(EB_REL(relation), MAIN_FORKNUM,
+										 bistate ? bistate->strategy : NULL,
+										 EB_LOCK_FIRST,
+										 extend_by_pages,
+										 victim_buffers,
+										 &extend_by_pages);
 
-			/*
-			 * If some other waiter has already extended the relation, we
-			 * don't need to do so; just use the existing freespace.
-			 */
-			if (targetBlock != InvalidBlockNumber)
+		/*
+		 * Relation is now extended. Make all but the first buffer available
+		 * to other backends.
+		 *
+		 * XXX: We don't necessarily need to release pin / update FSM while
+		 * holding the extension lock. But there are some advantages.
+		 */
+		curBlock = firstBlock;
+		for (uint32 i = 0; i < extend_by_pages; i++, curBlock++)
+		{
+			Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
+			Assert(BlockNumberIsValid(curBlock));
+
+			/* don't release the pin on the page returned by this function */
+			if (i > 0)
+				ReleaseBuffer(victim_buffers[i]);
+
+			if (i >= no_fsm_pages && use_fsm)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
-				goto loop;
-			}
+				if (firstBlockFSM == InvalidBlockNumber)
+					firstBlockFSM = curBlock;
 
-			/* Time to bulk-extend. */
-			RelationAddExtraBlocks(relation, bistate);
+				RecordPageWithFreeSpace(relation,
+										curBlock,
+										BufferGetPageSize(victim_buffers[i]) - SizeOfPageHeaderData);
+			}
+		}
+
+		if (use_fsm && firstBlockFSM != InvalidBlockNumber)
+			FreeSpaceMapVacuumRange(relation, firstBlockFSM, firstBlock + num_pages);
+
+		if (bistate)
+		{
+			if (extend_by_pages > 1)
+			{
+				bistate->next_free = firstBlock + 1;
+				bistate->last_free = firstBlock + extend_by_pages - 1;
+			}
+			else
+			{
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
+			}
+		}
+
+		buffer = victim_buffers[0];
+		if (bistate)
+		{
+			IncrBufferRefCount(buffer);
+			bistate->current_buf = buffer;
 		}
 	}
 
-	/*
-	 * In addition to whatever extension we performed above, we always add at
-	 * least one block to satisfy our own request.
-	 *
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
-- 
2.38.0

v4-0013-bufmgr-debug-Add-PrintBuffer-Desc.patchtext/x-diff; charset=us-asciiDownload

From e64466076efce2926eb3450c58168da3b08b7505 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:41:46 -0700
Subject: [PATCH v4 13/15] bufmgr: debug: Add PrintBuffer[Desc]

Useful for development. Perhaps we should polish these and keep them?
---
 src/include/storage/buf_internals.h |  3 ++
 src/backend/storage/buffer/bufmgr.c | 49 +++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index feca19f5620..3d66d24958b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -391,6 +391,9 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
+extern void PrintBuffer(Buffer buffer, const char *msg);
+extern void PrintBufferDesc(BufferDesc *buf_hdr, const char *msg);
+
 /* freelist.c */
 extern IOContext IOContextForStrategy(BufferAccessStrategy bas);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f79e244b74a..645061f153f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3974,6 +3974,55 @@ DropDatabaseBuffers(Oid dbid)
  *		use only.
  * -----------------------------------------------------------------
  */
+
+#include "utils/memutils.h"
+
+void
+PrintBufferDesc(BufferDesc *buf_hdr, const char *msg)
+{
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	uint32		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+	char	   *path = "";
+	BlockNumber blockno = InvalidBlockNumber;
+
+	CurrentMemoryContext->allowInCritSection = true;
+	if (buf_state & BM_TAG_VALID)
+	{
+		path = relpathbackend(BufTagGetRelFileLocator(&buf_hdr->tag),
+							  InvalidBackendId, BufTagGetForkNum(&buf_hdr->tag));
+		blockno = buf_hdr->tag.blockNum;
+	}
+
+	fprintf(stderr, "%d: [%u] msg: %s, rel: %s, block %u: refcount: %u / %u, usagecount: %u, flags:%s%s%s%s%s%s%s%s%s%s\n",
+			MyProcPid,
+			buffer,
+			msg,
+			path,
+			blockno,
+			BUF_STATE_GET_REFCOUNT(buf_state),
+			GetPrivateRefCount(buffer),
+			BUF_STATE_GET_USAGECOUNT(buf_state),
+			buf_state & BM_LOCKED ? " BM_LOCKED" : "",
+			buf_state & BM_DIRTY ? " BM_DIRTY" : "",
+			buf_state & BM_VALID ? " BM_VALID" : "",
+			buf_state & BM_TAG_VALID ? " BM_TAG_VALID" : "",
+			buf_state & BM_IO_IN_PROGRESS ? " BM_IO_IN_PROGRESS" : "",
+			buf_state & BM_IO_ERROR ? " BM_IO_ERROR" : "",
+			buf_state & BM_JUST_DIRTIED ? " BM_JUST_DIRTIED" : "",
+			buf_state & BM_PIN_COUNT_WAITER ? " BM_PIN_COUNT_WAITER" : "",
+			buf_state & BM_CHECKPOINT_NEEDED ? " BM_CHECKPOINT_NEEDED" : "",
+			buf_state & BM_PERMANENT ? " BM_PERMANENT" : ""
+		);
+}
+
+void
+PrintBuffer(Buffer buffer, const char *msg)
+{
+	BufferDesc *buf_hdr = GetBufferDescriptor(buffer - 1);
+
+	PrintBufferDesc(buf_hdr, msg);
+}
+
 #ifdef NOT_USED
 void
 PrintBufferDescs(void)
-- 
2.38.0

v4-0014-WIP-Don-t-initialize-page-in-vm-fsm-_extend-not-n.patchtext/x-diff; charset=us-asciiDownload

From 73d17acec1f2c87cd73bd3a84335c06436404e4f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Feb 2023 20:56:32 -0800
Subject: [PATCH v4 14/15] WIP: Don't initialize page in {vm,fsm}_extend(), not
 needed

---
 src/backend/access/heap/visibilitymap.c   | 6 +-----
 src/backend/storage/freespace/freespace.c | 5 +----
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb172..709213d0d97 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -623,11 +623,9 @@ static void
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
 	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
-
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
 	 * extend the visibility map at the same time. It also locks out extension
@@ -663,8 +661,6 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
 		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
 		vm_nblocks_now++;
 	}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 3e9693b293b..90c529958e7 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,10 +608,9 @@ static void
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
 	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
@@ -649,8 +648,6 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
 		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
 				   pg.data, false);
 		fsm_nblocks_now++;
-- 
2.38.0

v4-0015-Convert-a-few-places-to-ExtendBufferedRelTo.patchtext/x-diff; charset=us-asciiDownload

From 8a22aa9070562245de83a819b93d3b4bdcac6bc7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:08:17 -0800
Subject: [PATCH v4 15/15] Convert a few places to ExtendBufferedRelTo

---
 src/backend/access/heap/visibilitymap.c   | 80 +++++++---------------
 src/backend/access/transam/xlogutils.c    | 29 ++------
 src/backend/storage/freespace/freespace.c | 83 +++++++----------------
 3 files changed, 55 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 709213d0d97..60b96f5df80 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -126,7 +126,7 @@
 
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
-static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static Buffer vm_extend(Relation rel, BlockNumber vm_nblocks);
 
 
 /*
@@ -575,22 +575,29 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary. It's always safe to clear bits, so it's better to clear
+	 * corrupt pages than error out.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
 	{
 		if (extend)
-			vm_extend(rel, blkno + 1);
+			buf = vm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
+								 RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
-	 * always safe to clear bits, so it's better to clear corrupt pages than
-	 * error out.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -603,8 +610,6 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
-							 RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -619,51 +624,16 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
  * Ensure that the visibility map fork is at least vm_nblocks long, extending
  * it if necessary with zeroed pages.
  */
-static void
+static Buffer
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
-	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
+	Buffer buf;
 
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the visibility map at the same time. It also locks out extension
-	 * of the main fork, unnecessarily, but extending the visibility map
-	 * happens seldom enough that it doesn't seem worthwhile to have a
-	 * separate lock tag type for it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, VISIBILITYMAP_FORKNUM))
-		smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
-	vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
-
-	/* Now extend the file */
-	while (vm_nblocks_now < vm_nblocks)
-	{
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
-		vm_nblocks_now++;
-	}
+	buf = ExtendBufferedRelTo(EB_REL(rel), VISIBILITYMAP_FORKNUM, NULL,
+							  EB_CREATE_FORK_IF_NEEDED |
+							  EB_CLEAR_SIZE_CACHE,
+							  vm_nblocks,
+							  RBM_ZERO_ON_ERROR);
 
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
@@ -672,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * to keep checking for creation or extension of the file, which happens
 	 * infrequently.
 	 */
-	CacheInvalidateSmgr(reln->smgr_rlocator);
+	CacheInvalidateSmgr(RelationGetSmgr(rel)->smgr_rlocator);
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return buf;
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 2c28956b1aa..a9c9f2dcac2 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -524,28 +524,13 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
-		buffer = InvalidBuffer;
-		do
-		{
-			if (buffer != InvalidBuffer)
-			{
-				if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-				ReleaseBuffer(buffer);
-			}
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum,
-											   P_NEW, mode, NULL, true);
-		}
-		while (BufferGetBlockNumber(buffer) < blkno);
-		/* Handle the corner case that P_NEW returns non-consecutive pages */
-		if (BufferGetBlockNumber(buffer) != blkno)
-		{
-			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno,
-											   mode, NULL, true);
-		}
+		buffer = ExtendBufferedRelTo(EB_SMGR(smgr, RELPERSISTENCE_PERMANENT),
+									 forknum,
+									 NULL,
+									 EB_IN_RECOVERY |
+									 EB_SKIP_EXTENSION_LOCK,
+									 blkno + 1,
+									 mode);
 	}
 
 recent_buffer_fast_path:
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 90c529958e7..2face615d07 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -98,7 +98,7 @@ static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot);
 static BlockNumber fsm_logical_to_physical(FSMAddress addr);
 
 static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
-static void fsm_extend(Relation rel, BlockNumber fsm_nblocks);
+static Buffer fsm_extend(Relation rel, BlockNumber fsm_nblocks);
 
 /* functions to convert amount of free space to a FSM category */
 static uint8 fsm_space_avail_to_cat(Size avail);
@@ -558,24 +558,30 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary.  The FSM information is not accurate anyway, so it's better
+	 * to clear corrupt pages than error out. Since the FSM changes are not
+	 * WAL-logged, the so-called torn page problem on crash can lead to pages
+	 * with corrupt headers, for example.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM])
 	{
 		if (extend)
-			fsm_extend(rel, blkno + 1);
+			buf = fsm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. The FSM
-	 * information is not accurate anyway, so it's better to clear corrupt
-	 * pages than error out. Since the FSM changes are not WAL-logged, the
-	 * so-called torn page problem on crash can lead to pages with corrupt
-	 * headers, for example.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -588,7 +594,6 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -604,56 +609,14 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
  * it if necessary with empty pages. And by empty, I mean pages filled
  * with zeros, meaning there's no free space.
  */
-static void
+static Buffer
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
-	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
-
-
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the FSM at the same time. It also locks out extension of the
-	 * main fork, unnecessarily, but extending the FSM happens seldom enough
-	 * that it doesn't seem worthwhile to have a separate lock tag type for
-	 * it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, FSM_FORKNUM))
-		smgrcreate(reln, FSM_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-	fsm_nblocks_now = smgrnblocks(reln, FSM_FORKNUM);
-
-	/* Extend as needed. */
-	while (fsm_nblocks_now < fsm_nblocks)
-	{
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
-		fsm_nblocks_now++;
-	}
-
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return ExtendBufferedRelTo(EB_REL(rel), FSM_FORKNUM, NULL,
+							   EB_CREATE_FORK_IF_NEEDED |
+							   EB_CLEAR_SIZE_CACHE,
+							   fsm_nblocks,
+							   RBM_ZERO_ON_ERROR);
 }
 
 /*
-- 
2.38.0

#34

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#1)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2022-10-28 19:54:20 -0700, Andres Freund wrote:

I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.

server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1], ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2],
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 58.63 207 50.22 242 54.35 224
2 32.67 372 25.82 472 27.30 446
4 22.53 540 13.30 916 14.33 851
8 15.14 804 7.43 1640 7.48 1632
16 14.69 829 4.79 2544 4.50 2718
32 15.28 797 4.41 2763 3.32 3710
64 15.34 794 5.22 2334 3.06 4061
128 15.49 786 4.97 2452 3.13 3926
256 15.85 768 5.02 2427 3.26 3769
512 16.02 760 5.29 2303 3.54 3471

I just spent a few hours trying to reproduce these benchmark results. For the
longest time I could not get the numbers for *HEAD* to even get close to the
above, while the numbers for the patch were very close.

I was worried it was a performance regression in HEAD etc. But no, same git
commit as back then produces the same issue.

As it turns out, I somehow screwed up my benchmark tooling, and I did not set
set the CPU "scaling_governor" and "energy_performance_preference" to
"performance". In a crazy turn of events, that approximately makes no
difference with the patch applied, and a 2x difference for HEAD.

I suspect this is some pathological issue when encountering heavy lock
contention (likely leading to the CPU reducing speed into a deeper state,
which then takes longer to get out of when the lock is released). As the lock
contention is drastically reduced with the patch, that affect is not visible
anymore.

After fixing the performance scaling issue, the results are quite close to the
above numbers again...

Aargh, I want my afternoon back.

Greetings,

Andres Freund

#35

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#33)

13 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
relation extension related code in hio.c back into its own function.

While reviewing the hio.c code, I did realize that too much stuff is done
while holding the buffer lock. See also the pre-existing issue
/messages/by-id/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de

Greetings,

Andres Freund

Attachments:

v5-0001-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload

From d0e47591a10e5bb2745a366fce2a51d9964f28cb Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v5 01/14] Add some error checking around pinning

---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 42 ++++++++++++++++++++---------
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b8a18b8081f..973547a8baf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -134,6 +134,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
+extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 95212a39416..fa20fab5a29 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1755,6 +1755,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	Assert(!BufferIsLocal(b));
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1900,6 +1902,8 @@ UnpinBuffer(BufferDesc *buf)
 	PrivateRefCountEntry *ref;
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 
+	Assert(!BufferIsLocal(b));
+
 	/* not moving as we're likely deleting it soon anyway */
 	ref = GetPrivateRefCountEntry(b, false);
 	Assert(ref != NULL);
@@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 LocalRefCount[-buffer - 1]);
+	}
+	else
+	{
+		/* There should be exactly one local pin */
+		if (GetPrivateRefCount(buffer) != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 GetPrivateRefCount(buffer));
+	}
+}
+
 /*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
@@ -4310,20 +4333,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	if (BufferIsLocal(buffer))
-	{
-		/* There should be exactly one pin */
-		if (LocalRefCount[-buffer - 1] != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 LocalRefCount[-buffer - 1]);
-		/* Nobody else to wait for */
-		return;
-	}
+	BufferCheckOneLocalPin(buffer);
 
-	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		elog(ERROR, "incorrect local pin count: %d",
-			 GetPrivateRefCount(buffer));
+	/* Nobody else to wait for */
+	if (BufferIsLocal(buffer))
+		return;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -4824,6 +4838,8 @@ LockBufHdr(BufferDesc *desc)
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
 	init_local_spin_delay(&delayStatus);
 
 	while (true)
-- 
2.38.0

v5-0002-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload

From fcd9364e3bbd93410d748bd4a308a40b93cc6451 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v5 02/14] hio: Release extension lock before initializing page
 / pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.
---
 src/backend/access/heap/hio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e152807d2dc..7479212d4e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

v5-0003-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload

From 5aa4e216299176ce0f7e0e54e7d97269b418c8d0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 27 Feb 2023 17:36:37 -0800
Subject: [PATCH v5 03/14] Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Backpatch:
---
 src/include/storage/fd.h        |   3 +
 src/include/storage/md.h        |   2 +
 src/include/storage/smgr.h      |   2 +
 src/backend/storage/file/fd.c   |  90 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   | 111 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  28 ++++++++
 6 files changed, 236 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d083..daceafd4732 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -106,6 +106,9 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
+extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
 extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8f32af9ef3d..941879ee6a8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,6 +28,8 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 0935144f425..a9a179aabac 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9fd8444ed4d..c34ed41d52e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2206,6 +2206,92 @@ FileSync(File file, uint32 wait_event_info)
 	return returnCode;
 }
 
+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+	int			returnCode;
+	ssize_t		written;
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	pgstat_report_wait_start(wait_event_info);
+	written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+	pgstat_report_wait_end();
+
+	if (written < 0)
+		return -1;
+	else if (written != amount)
+	{
+		/* if errno is unset, assume problem is no disk space */
+		if (errno == 0)
+			errno = ENOSPC;
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to reserve file space with posix_fallocate(). If posix_fallocate() is
+ * not implemented on the operating system or fails with EINVAL / EOPNOTSUPP,
+ * use FileZero() instead.
+ *
+ * Note that at least glibc() implements posix_fallocate() in userspace if not
+ * implemented by the filesystem. That's not the case for all environments
+ * though.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ *
+ * Even though posix_fallocate has the offset before the length argument, it
+ * seems better to keep the argument order consistent with most of the other
+ * functions in this file.
+ */
+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	pgstat_report_wait_start(wait_event_info);
+	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+	pgstat_report_wait_end();
+
+	if (returnCode == 0)
+		return 0;
+
+	/* for compatibility with %m printing etc */
+	errno = returnCode;
+
+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return returnCode;
+
+	if (returnCode == 0 ||
+		(returnCode != EINVAL && returnCode != EINVAL))
+		return returnCode;
+#endif
+
+	return FileZero(file, offset, amount, wait_event_info);
+}
+
 off_t
 FileSize(File file)
 {
@@ -2278,6 +2364,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file) < 0)
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1feb..59a65a8305c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -500,6 +501,116 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
+/*
+ *	mdzeroextend() -- Add ew zeroed out blocks to the specified relation.
+ *
+ *		Similar to mdrextend(), except the relation can be extended by
+ *		multiple blocks at once, and that the added blocks will be filled with
+ *		zeroes.
+ */
+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	MdfdVec    *v;
+	BlockNumber curblocknum = blocknum;
+	int			remblocks = nblocks;
+
+	Assert(nblocks > 0);
+
+	/* This assert is too expensive to have on normally ... */
+#ifdef CHECK_WRITE_VS_EXTEND
+	Assert(blocknum >= mdnblocks(reln, forknum));
+#endif
+
+	/*
+	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
+	 * more --- we mustn't create a block whose number actually is
+	 * InvalidBlockNumber or larger.
+	 */
+	if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend file \"%s\" beyond %u blocks",
+						relpath(reln->smgr_rlocator, forknum),
+						InvalidBlockNumber)));
+
+	while (remblocks > 0)
+	{
+		int			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		int			segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks;
+		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		int			numblocks;
+
+		if (segendblock > RELSEG_SIZE)
+			segendblock = RELSEG_SIZE;
+
+		numblocks = segendblock - segstartblock;
+
+		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+
+		Assert(segstartblock < RELSEG_SIZE);
+		Assert(segendblock <= RELSEG_SIZE);
+
+		/*
+		 * If available and useful, use posix_fallocate() (via FileAllocate())
+		 * to extend the relation. That's often more efficient than using
+		 * write(), as it commonly won't cause the kernel to allocate page
+		 * cache space for the extended pages.
+		 *
+		 * However, we don't use FileAllocate() for small extensions, as it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 */
+		if (numblocks > 8)
+		{
+			int			ret;
+
+			ret = FileFallocate(v->mdfd_vfd,
+								seekpos, (off_t) BLCKSZ * numblocks,
+								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret != 0)
+			{
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+			}
+		}
+		else
+		{
+			int			ret;
+
+			/*
+			 * Even if we don't want to use fallocate, we can still extend a
+			 * bit more efficiently than writing each 8kB block individually.
+			 * pg_pwrite_zeroes() (via FileZero()) uses
+			 * pg_pwritev_with_retry() to avoid multiple writes or needing a
+			 * zeroed buffer for the whole length of the extension.
+			 */
+			ret = FileZero(v->mdfd_vfd,
+						   seekpos, (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\": %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+		remblocks -= segendblock - segstartblock;
+		curblocknum += segendblock - segstartblock;
+	}
+}
+
 /*
  *	mdopenfork() -- Open one fork of the specified relation.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e54145..5224ca52592 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -75,6 +77,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
 		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
@@ -507,6 +510,31 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+/*
+ *	smgrzeroextend() -- Add new zeroed out blocks to a file.
+ *
+ *		Similar to smgrextend(), except the relation can be extended by
+ *		multiple blocks at once, and that the added blocks will be filled with
+ *		zeroes.
+ */
+void
+smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   int nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
+											 nblocks, skipFsync);
+
+	/*
+	 * Normally we expect this to increase the fork size by nblocks, but if
+	 * the cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+}
+
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
-- 
2.38.0

v5-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 4216dadd94583dac85f29c6763bb8df540abe457 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v5 04/14] bufmgr: Add Pin/UnpinLocalBuffer()

So far these were open-coded in quite a few places, without a good reason.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/buf_internals.h   |  2 +
 src/include/storage/bufmgr.h          |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 36 +++-------------
 src/backend/storage/buffer/localbuf.c | 62 +++++++++++++++++----------
 4 files changed, 49 insertions(+), 53 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0b448147407..fa5c451b1a9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -415,6 +415,8 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
+extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
+extern void UnpinLocalBuffer(Buffer buffer);
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 973547a8baf..d30733c65a1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -134,7 +134,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
-extern void BufferCheckOneLocalPin(Buffer buffer);
+extern void BufferCheckWePinOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fa20fab5a29..6f50dbd212e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -636,20 +636,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		/* Is it still valid and holding the right tag? */
 		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * Bump buffer's ref and usage counts. This is equivalent of
-			 * PinBuffer for a shared buffer.
-			 */
-			if (LocalRefCount[b] == 0)
-			{
-				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-				{
-					buf_state += BUF_USAGECOUNT_ONE;
-					pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-				}
-			}
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner, recent_buffer);
+			PinLocalBuffer(bufHdr, true);
 
 			pgBufferUsage.local_blks_hit++;
 
@@ -1708,8 +1695,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
 				BufTagGetForkNum(&bufHdr->tag) == forkNum)
 				return buffer;
-			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-			LocalRefCount[-buffer - 1]--;
+			UnpinLocalBuffer(buffer);
 		}
 		else
 		{
@@ -4012,15 +3998,9 @@ ReleaseBuffer(Buffer buffer)
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
-	{
-		ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-
-		Assert(LocalRefCount[-buffer - 1] > 0);
-		LocalRefCount[-buffer - 1]--;
-		return;
-	}
-
-	UnpinBuffer(GetBufferDescriptor(buffer - 1));
+		UnpinLocalBuffer(buffer);
+	else
+		UnpinBuffer(GetBufferDescriptor(buffer - 1));
 }
 
 /*
@@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)
 {
 	if (BufferIsLocal(buffer))
 	{
-		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			elog(ERROR, "incorrect local pin count: %d",
 				 LocalRefCount[-buffer - 1]);
 	}
 	else
 	{
-		/* There should be exactly one local pin */
 		if (GetPrivateRefCount(buffer) != 1)
 			elog(ERROR, "incorrect local pin count: %d",
 				 GetPrivateRefCount(buffer));
@@ -4333,7 +4311,7 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	BufferCheckOneLocalPin(buffer);
+	BufferCheckWePinOnce(buffer);
 
 	/* Nobody else to wait for */
 	if (BufferIsLocal(buffer))
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663d..798c5b93a87 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -145,27 +145,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
 				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
 #endif
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		/* this part is equivalent to PinBuffer for a shared buffer */
-		if (LocalRefCount[b] == 0)
-		{
-			if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-			{
-				buf_state += BUF_USAGECOUNT_ONE;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-		}
-		LocalRefCount[b]++;
-		ResourceOwnerRememberBuffer(CurrentResourceOwner,
-									BufferDescriptorGetBuffer(bufHdr));
-		if (buf_state & BM_VALID)
-			*foundPtr = true;
-		else
-		{
-			/* Previous read attempt must have failed; try again */
-			*foundPtr = false;
-		}
+		*foundPtr = PinLocalBuffer(bufHdr, true);
 		return bufHdr;
 	}
 
@@ -202,9 +183,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			else
 			{
 				/* Found a usable buffer */
-				LocalRefCount[b]++;
-				ResourceOwnerRememberBuffer(CurrentResourceOwner,
-											BufferDescriptorGetBuffer(bufHdr));
+				PinLocalBuffer(bufHdr, false);
 				break;
 			}
 		}
@@ -491,6 +470,43 @@ InitLocalBuffers(void)
 	NLocBuffer = nbufs;
 }
 
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+	uint32		buf_state;
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	int			bufid = -(buffer + 1);
+
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	if (LocalRefCount[bufid] == 0)
+	{
+		if (adjust_usagecount &&
+			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
+		{
+			buf_state += BUF_USAGECOUNT_ONE;
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		}
+	}
+	LocalRefCount[bufid]++;
+	ResourceOwnerRememberBuffer(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf_hdr));
+
+	return buf_state & BM_VALID;
+}
+
+void
+UnpinLocalBuffer(Buffer buffer)
+{
+	int			buffid = -buffer - 1;
+
+	Assert(BufferIsLocal(buffer));
+	Assert(LocalRefCount[buffid] > 0);
+
+	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
+	LocalRefCount[buffid]--;
+}
+
 /*
  * GUC check_hook for temp_buffers
  */
-- 
2.38.0

v5-0005-bufmgr-Remove-buffer-write-dirty-tracepoints.patchtext/x-diff; charset=us-asciiDownload

From ef5aa196aff618585c9cb1466623c8207a643713 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:21:57 -0800
Subject: [PATCH v5 05/14] bufmgr: Remove buffer-write-dirty tracepoints

The trace point was using the relfileno / fork / block for the to-be-read-in
buffer. Some upcoming work would make that more expensive to provide. We still
have buffer-flush-start/done, which can serve most tracing needs that
buffer-write-dirty could serve.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c | 10 ----------
 doc/src/sgml/monitoring.sgml        | 17 -----------------
 2 files changed, 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f50dbd212e..3d0683593fd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1277,21 +1277,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				}
 
 				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-														  smgr->smgr_rlocator.locator.spcOid,
-														  smgr->smgr_rlocator.locator.dbOid,
-														  smgr->smgr_rlocator.locator.relNumber);
-
 				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
 											  &buf->tag);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-														 smgr->smgr_rlocator.locator.spcOid,
-														 smgr->smgr_rlocator.locator.dbOid,
-														 smgr->smgr_rlocator.locator.relNumber);
 			}
 			else
 			{
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c809ff1ba4a..b2ccd8d7fef 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7806,23 +7806,6 @@ FROM pg_stat_get_backend_idset() AS backendid;
       it's typically not actually been written to disk yet.)
       The arguments are the same as for <literal>buffer-flush-start</literal>.</entry>
     </row>
-    <row>
-     <entry><literal>buffer-write-dirty-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
-     <entry>Probe that fires when a server process begins to write a dirty
-      buffer.  (If this happens often, it implies that
-      <xref linkend="guc-shared-buffers"/> is too
-      small or the background writer control parameters need adjustment.)
-      arg0 and arg1 contain the fork and block numbers of the page.
-      arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
-      identifying the relation.</entry>
-    </row>
-    <row>
-     <entry><literal>buffer-write-dirty-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
-     <entry>Probe that fires when a dirty-buffer write is complete.
-      The arguments are the same as for <literal>buffer-write-dirty-start</literal>.</entry>
-    </row>
     <row>
      <entry><literal>wal-buffer-write-dirty-start</literal></entry>
      <entry><literal>()</literal></entry>
-- 
2.38.0

v5-0006-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload

From 063d53b677eea7bc3a9ff2cb37d4b7fe05b52fea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:26:34 -0800
Subject: [PATCH v5 06/14] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/storage/buffer/bufmgr.c   | 581 ++++++++++++++------------
 src/backend/storage/buffer/localbuf.c | 115 ++---
 2 files changed, 379 insertions(+), 317 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3d0683593fd..02281303440 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -473,6 +473,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr, IOContext *io_context);
+static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -1128,18 +1129,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr, IOContext *io_context)
 {
-	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
-	uint32		oldFlags;
-	int			buf_id;
-	BufferDesc *buf;
-	bool		valid;
-	uint32		buf_state;
+	int			existing_buf_id;
+
+	Buffer		victim_buffer;
+	BufferDesc *victim_buf_hdr;
+	uint32		victim_buf_state;
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -1150,15 +1147,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+	existing_buf_id = BufTableLookup(&newTag, newHash);
+	if (existing_buf_id >= 0)
 	{
+		BufferDesc *buf;
+		bool		valid;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
 		 * buffer pool, and check to see if the correct data has been loaded
 		 * into the buffer.
 		 */
-		buf = GetBufferDescriptor(buf_id);
+		buf = GetBufferDescriptor(existing_buf_id);
 
 		valid = PinBuffer(buf, strategy);
 
@@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	*io_context = IOContextForStrategy(strategy);
 
-	/* Loop here in case we have to try another victim buffer */
-	for (;;)
+	/*
+	 * Acquire a victim buffer. Somebody else might try to do the same, we
+	 * don't hold any conflicting locks. If so we'll have to undo our work
+	 * later.
+	 */
+	victim_buffer = GetVictimBuffer(strategy, *io_context);
+	victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. If
+	 * somebody else inserted another buffer for the tag, we'll release the
+	 * victim buffer we acquired and use the already inserted one.
+	 */
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
+	if (existing_buf_id >= 0)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
+		BufferDesc *existing_buf_hdr;
+		bool		valid;
 
 		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to
+		 * use.
+		 *
+		 * We could do this after releasing the partition lock, but then we'd
+		 * have to call ResourceOwnerEnlargeBuffers() &
+		 * ReservePrivateRefCountEntry() before acquiring the lock, for the
+		 * rare case of such a collision.
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		UnpinBuffer(victim_buf_hdr);
 
-		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+		/* FIXME: Should we put the victim buffer onto the freelist? */
 
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf_state & BUF_FLAG_MASK;
+		/* remaining code should match code at top of routine */
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-										 LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
+		valid = PinBuffer(existing_buf_hdr, strategy);
 
-					/* Read the LSN while holding buffer header lock */
-					buf_state = LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf, buf_state);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf, from_ring))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(BufferDescriptorGetContentLock(buf));
-						UnpinBuffer(buf);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
-				LWLockRelease(BufferDescriptorGetContentLock(buf));
-
-				ScheduleBufferTagForWriteback(&BackendWritebackContext,
-											  &buf->tag);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf);
-				continue;
-			}
-		}
-
-		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
-		 */
-		if (oldFlags & BM_TAG_VALID)
-		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* remember we have no old-partition lock or tag */
-			oldPartitionLock = NULL;
-			/* keep the compiler quiet about uninitialized variables */
-			oldHash = 0;
-		}
-
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
-		/*
-		 * Need to lock the buffer header too in order to change its tag.
-		 */
-		buf_state = LockBufHdr(buf);
-
-		/*
-		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
-		 */
-		oldFlags = buf_state & BUF_FLAG_MASK;
-		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
-			break;
-
-		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(existing_buf_hdr, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return existing_buf_hdr;
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
-	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
-	 *
+	 * Need to lock the buffer header too in order to change its tag.
+	 */
+	victim_buf_state = LockBufHdr(victim_buf_hdr);
+
+	/* some sanity checks while we hold the buffer header lock */
+	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+
+	victim_buf_hdr->tag = newTag;
+
+	/*
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
 	 */
-	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
+	victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
-	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		victim_buf_state |= BM_PERMANENT;
 
-	UnlockBufHdr(buf, buf_state);
-
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
+	UnlockBufHdr(victim_buf_hdr, victim_buf_state);
 
 	LWLockRelease(newPartitionLock);
 
-	if (oldFlags & BM_VALID)
-	{
-		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
-		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, *io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT);
-	}
-
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
 	 * to read it before we did, so there's nothing left for BufferAlloc() to
 	 * do.
 	 */
-	if (StartBufferIO(buf, true))
+	if (StartBufferIO(victim_buf_hdr, true))
 		*foundPtr = false;
 	else
 		*foundPtr = true;
 
-	return buf;
+	return victim_buf_hdr;
 }
 
 /*
@@ -1595,6 +1413,237 @@ retry:
 	StrategyFreeBuffer(buf);
 }
 
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+	uint32		buf_state;
+	uint32		hash;
+	LWLock	   *partition_lock;
+	BufferTag	tag;
+
+	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+
+	/* have buffer pinned, so it's safe to read tag without lock */
+	tag = buf_hdr->tag;
+
+	hash = BufTableHashCode(&tag);
+	partition_lock = BufMappingPartitionLock(hash);
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+	/* lock the buffer header */
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We have the buffer pinned nobody else should have been able to unset
+	 * this concurrently.
+	 */
+	Assert(buf_state & BM_TAG_VALID);
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BufferTagsEqual(&buf_hdr->tag, &tag));
+
+	/*
+	 * If somebody else pinned the buffer since, or even worse, dirtied
+	 * it, give up on this buffer: It's clearly in use.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) != 1 || (buf_state & BM_DIRTY))
+	{
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		UnlockBufHdr(buf_hdr, buf_state);
+		LWLockRelease(partition_lock);
+
+		return false;
+	}
+
+	/*
+	 * Clear out the buffer's tag and flags and usagecount.  This is not
+	 * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+	 * doing anything with the buffer. But currently it's beneficial as the
+	 * pre-check for several linear scans of shared buffers just checks the
+	 * tag.
+	 */
+	ClearBufferTag(&buf_hdr->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+	/* finally delete buffer from the buffer mapping table */
+	BufTableDelete(&tag, hash);
+
+	LWLockRelease(partition_lock);
+
+	Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID)));
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0);
+
+	return true;
+}
+
+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+	BufferDesc *buf_hdr;
+	Buffer		buf;
+	uint32		buf_state;
+	bool		from_ring;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/* we return here if a prospective victim buffer gets used concurrently */
+again:
+
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+	buf = BufferDescriptorGetBuffer(buf_hdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf_hdr);
+
+	/*
+	 * We shouldn't have any other pins for this buffer.
+	 */
+	BufferCheckWePinOnce(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released the
+	 * buffer header lock above, or even while we are writing it out (since
+	 * our share-lock won't prevent hint-bit updates).  We will recheck the
+	 * dirty bit after re-locking the buffer header.
+	 */
+	if (buf_state & BM_DIRTY)
+	{
+		LWLock	   *content_lock;
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_VALID);
+
+		/*
+		 * We need a share-lock on the buffer contents to write it out (else
+		 * we might write invalid data, eg because someone else is compacting
+		 * the page contents while we write).  We must use a conditional lock
+		 * acquisition here to avoid deadlock.  Even though the buffer was not
+		 * pinned (and therefore surely not locked) when StrategyGetBuffer
+		 * returned it, someone else could have pinned and exclusive-locked it
+		 * by the time we get here. If we try to get the lock unconditionally,
+		 * we'd block waiting for them; if they later block waiting for us,
+		 * deadlock ensues. (This has been observed to happen when two
+		 * backends are both trying to split btree index pages, and the second
+		 * one just happens to be trying to split the page the first one got
+		 * from StrategyGetBuffer.)
+		 */
+		content_lock = BufferDescriptorGetContentLock(buf_hdr);
+		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop back
+			 * to get another one.
+			 */
+			UnpinBuffer(buf_hdr);
+			goto again;
+		}
+
+		/*
+		 * If using a nondefault strategy, and writing the buffer would
+		 * require a WAL flush, let the strategy decide whether to go ahead
+		 * and write/reuse the buffer or to choose another victim.  We need
+		 * lock to inspect the page LSN, so this can't be done inside
+		 * StrategyGetBuffer.
+		 */
+		if (strategy != NULL)
+		{
+			XLogRecPtr	lsn;
+
+			/* Read the LSN while holding buffer header lock */
+			buf_state = LockBufHdr(buf_hdr);
+			lsn = BufferGetLSN(buf_hdr);
+			UnlockBufHdr(buf_hdr, buf_state);
+
+			if (XLogNeedsFlush(lsn)
+				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(buf_hdr);
+				goto again;
+			}
+		}
+
+		/* OK, do the I/O */
+		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+		LWLockRelease(content_lock);
+
+		ScheduleBufferTagForWriteback(&BackendWritebackContext,
+									  &buf_hdr->tag);
+	}
+
+
+	if (buf_state & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
+	/*
+	 * If the buffer has an entry in the buffer mapping table, delete it. This
+	 * can fail because another backend could have pinned or dirtied the
+	 * buffer.
+	 */
+	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+	{
+		UnpinBuffer(buf_hdr);
+		goto again;
+	}
+
+	/* a final set of sanity checks */
+#ifdef USE_ASSERT_CHECKING
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+
+	BufferCheckWePinOnce(buf);
+#endif
+
+	return buf;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 798c5b93a87..5b44b0be8b5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -45,13 +45,14 @@ BufferDesc *LocalBufferDescriptors = NULL;
 Block	   *LocalBufferBlockPointers = NULL;
 int32	   *LocalRefCount = NULL;
 
-static int	nextFreeLocalBuf = 0;
+static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
+static Buffer GetLocalVictimBuffer(void);
 
 
 /*
@@ -113,10 +114,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
-	int			b;
-	int			trycounter;
+	Buffer		victim_buffer;
+	int			bufid;
 	bool		found;
-	uint32		buf_state;
 
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
@@ -138,23 +138,51 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 	if (hresult)
 	{
-		b = hresult->id;
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufid = hresult->id;
+		bufHdr = GetLocalBufferDescriptor(bufid);
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
-#ifdef LBDEBUG
-		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
-#endif
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
-		return bufHdr;
+	}
+	else
+	{
+		uint32		buf_state;
+
+		victim_buffer = GetLocalVictimBuffer();
+		bufid = -(victim_buffer + 1);
+		bufHdr = GetLocalBufferDescriptor(bufid);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
+		if (found)				/* shouldn't happen */
+			elog(ERROR, "local buffer hash table corrupted");
+		hresult->id = bufid;
+
+		/*
+		 * it's all ours now.
+		 */
+		bufHdr->tag = newTag;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+		*foundPtr = false;
 	}
 
-#ifdef LBDEBUG
-	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-			smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum,
-			-nextFreeLocalBuf - 1);
-#endif
+	return bufHdr;
+}
+
+static Buffer
+GetLocalVictimBuffer(void)
+{
+	int			victim_bufid;
+	int			trycounter;
+	uint32		buf_state;
+	BufferDesc *bufHdr;
+
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
 	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
@@ -163,14 +191,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	trycounter = NLocBuffer;
 	for (;;)
 	{
-		b = nextFreeLocalBuf;
+		victim_bufid = nextFreeLocalBufId;
 
-		if (++nextFreeLocalBuf >= NLocBuffer)
-			nextFreeLocalBuf = 0;
+		if (++nextFreeLocalBufId >= NLocBuffer)
+			nextFreeLocalBufId = 0;
 
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufHdr = GetLocalBufferDescriptor(victim_bufid);
 
-		if (LocalRefCount[b] == 0)
+		if (LocalRefCount[victim_bufid] == 0)
 		{
 			buf_state = pg_atomic_read_u32(&bufHdr->state);
 
@@ -193,6 +221,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 					 errmsg("no empty local buffer available")));
 	}
 
+	/*
+	 * lazy memory allocation: allocate space on first use of a buffer.
+	 */
+	if (LocalBufHdrGetBlock(bufHdr) == NULL)
+	{
+		/* Set pointer for use by BufferGetBlock() macro */
+		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
+	}
+
 	/*
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
@@ -223,48 +260,24 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	}
 
 	/*
-	 * lazy memory allocation: allocate space on first use of a buffer.
-	 */
-	if (LocalBufHdrGetBlock(bufHdr) == NULL)
-	{
-		/* Set pointer for use by BufferGetBlock() macro */
-		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
-	}
-
-	/*
-	 * Update the hash table: remove old entry, if any, and make new one.
+	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
 	if (buf_state & BM_TAG_VALID)
 	{
+		LocalBufferLookupEnt *hresult;
+
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
 		if (!hresult)			/* shouldn't happen */
 			elog(ERROR, "local buffer hash table corrupted");
 		/* mark buffer invalid just in case hash insert fails */
 		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BM_VALID | BM_TAG_VALID);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
-	hresult = (LocalBufferLookupEnt *)
-		hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
-	if (found)					/* shouldn't happen */
-		elog(ERROR, "local buffer hash table corrupted");
-	hresult->id = b;
-
-	/*
-	 * it's all ours now.
-	 */
-	bufHdr->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
-	buf_state |= BM_TAG_VALID;
-	buf_state &= ~BUF_USAGECOUNT_MASK;
-	buf_state += BUF_USAGECOUNT_ONE;
-	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-
-	*foundPtr = false;
-	return bufHdr;
+	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
@@ -431,7 +444,7 @@ InitLocalBuffers(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	nextFreeLocalBuf = 0;
+	nextFreeLocalBufId = 0;
 
 	/* initialize fields that need to start off nonzero */
 	for (i = 0; i < nbufs; i++)
-- 
2.38.0

v5-0007-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload

From bb5e72f3c1dbea32199f92ec077c9b2c3bf3401b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v5 07/14] bufmgr: Support multiple in-progress IOs by using
 resowner

---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/utils/resowner_private.h  |  5 ++
 src/backend/access/transam/xact.c     |  4 +-
 src/backend/postmaster/autovacuum.c   |  1 -
 src/backend/postmaster/bgwriter.c     |  1 -
 src/backend/postmaster/checkpointer.c |  1 -
 src/backend/postmaster/walwriter.c    |  1 -
 src/backend/storage/buffer/bufmgr.c   | 86 ++++++++++++---------------
 src/backend/utils/resowner/resowner.c | 60 +++++++++++++++++++
 9 files changed, 105 insertions(+), 56 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d30733c65a1..9ba25521ea1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -181,7 +181,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern void AbortBufferIO(void);
+extern void AbortBufferIO(Buffer buffer);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(struct WritebackContext *wb_context);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b54..ae58438ec76 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -30,6 +30,11 @@ extern void ResourceOwnerEnlargeBuffers(ResourceOwner owner);
 extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
+/* support for IO-in-progress management */
+extern void ResourceOwnerEnlargeBufferIOs(ResourceOwner owner);
+extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
+extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
+
 /* support for local lock management */
 extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b8764012607..a0f53e9936a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2725,8 +2725,7 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/* Clean up buffer I/O and buffer context locks, too */
-	AbortBufferIO();
+	/* Clean up buffer context locks, too */
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5086,7 +5085,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-	AbortBufferIO();
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 585d28148ca..e9ba0dc17cd 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -526,7 +526,6 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 9bb47da404d..caad642ec93 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -167,7 +167,6 @@ BackgroundWriterMain(void)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index aaad5c52281..ace9893d957 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -271,7 +271,6 @@ CheckpointerMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 513e580c513..65e84be39b9 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -163,7 +163,6 @@ WalWriterMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 02281303440..41153f99bc2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,10 +159,6 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
-/* local state for StartBufferIO and related functions */
-static BufferDesc *InProgressBuf = NULL;
-static bool IsForInput;
-
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -2712,7 +2708,6 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	AbortBufferIO();
 	UnlockBuffers();
 
 	CheckForBufferLeaks();
@@ -4658,7 +4653,7 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 {
 	uint32		buf_state;
 
-	Assert(!InProgressBuf);
+	ResourceOwnerEnlargeBufferIOs(CurrentResourceOwner);
 
 	for (;;)
 	{
@@ -4682,8 +4677,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = buf;
-	IsForInput = forInput;
+	ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+								  BufferDescriptorGetBuffer(buf));
 
 	return true;
 }
@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 {
 	uint32		buf_state;
 
-	Assert(buf == InProgressBuf);
-
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
@@ -4722,13 +4715,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = NULL;
+	ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
 }
 
 /*
- * AbortBufferIO: Clean up any active buffer I/O after an error.
+ * AbortBufferIO: Clean up active buffer I/O after an error.
  *
  *	All LWLocks we might have held have been released,
  *	but we haven't yet released buffer pins, so the buffer is still pinned.
@@ -4737,46 +4731,42 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
  *	possible the error condition wasn't related to the I/O.
  */
 void
-AbortBufferIO(void)
+AbortBufferIO(Buffer buf)
 {
-	BufferDesc *buf = InProgressBuf;
+	BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 
-	if (buf)
+	buf_state = LockBufHdr(buf_hdr);
+	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+
+	if (!(buf_state & BM_VALID))
 	{
-		uint32		buf_state;
-
-		buf_state = LockBufHdr(buf);
-		Assert(buf_state & BM_IO_IN_PROGRESS);
-		if (IsForInput)
-		{
-			Assert(!(buf_state & BM_DIRTY));
-
-			/* We'd better not think buffer is valid yet */
-			Assert(!(buf_state & BM_VALID));
-			UnlockBufHdr(buf, buf_state);
-		}
-		else
-		{
-			Assert(buf_state & BM_DIRTY);
-			UnlockBufHdr(buf, buf_state);
-			/* Issue notice if this is not the first failure... */
-			if (buf_state & BM_IO_ERROR)
-			{
-				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
-
-				path = relpathperm(BufTagGetRelFileLocator(&buf->tag),
-								   BufTagGetForkNum(&buf->tag));
-				ereport(WARNING,
-						(errcode(ERRCODE_IO_ERROR),
-						 errmsg("could not write block %u of %s",
-								buf->tag.blockNum, path),
-						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
-			}
-		}
-		TerminateBufferIO(buf, false, BM_IO_ERROR);
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
 	}
+	else
+	{
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		/* Issue notice if this is not the first failure... */
+		if (buf_state & BM_IO_ERROR)
+		{
+			/* Buffer is pinned, so we can read tag without spinlock */
+			char	   *path;
+
+			path = relpathperm(BufTagGetRelFileLocator(&buf_hdr->tag),
+							   BufTagGetForkNum(&buf_hdr->tag));
+			ereport(WARNING,
+					(errcode(ERRCODE_IO_ERROR),
+					 errmsg("could not write block %u of %s",
+							buf_hdr->tag.blockNum, path),
+					 errdetail("Multiple failures --- write error might be permanent.")));
+			pfree(path);
+		}
+	}
+
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45d..fccc59b39dd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
 
 	/* We have built-in support for remembering: */
 	ResourceArray bufferarr;	/* owned buffers */
+	ResourceArray bufferioarr;	/* in-progress buffer IO */
 	ResourceArray catrefarr;	/* catcache references */
 	ResourceArray catlistrefarr;	/* catcache-list pins */
 	ResourceArray relrefarr;	/* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	ResourceArrayInit(&(owner->bufferarr), BufferGetDatum(InvalidBuffer));
+	ResourceArrayInit(&(owner->bufferioarr), BufferGetDatum(InvalidBuffer));
 	ResourceArrayInit(&(owner->catrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->catlistrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->relrefarr), PointerGetDatum(NULL));
@@ -517,6 +519,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 	if (phase == RESOURCE_RELEASE_BEFORE_LOCKS)
 	{
+		/*
+		 * Abort failed buffer IO. AbortBufferIO()->TerminateBufferIO() calls
+		 * ResourceOwnerForgetBufferIOs(), so we just have to iterate till
+		 * there are none.
+		 *
+		 * Needs to be before we release buffer pins.
+		 *
+		 * During a commit, there shouldn't be any in-progress IO.
+		 */
+		while (ResourceArrayGetAny(&(owner->bufferioarr), &foundres))
+		{
+			Buffer		res = DatumGetBuffer(foundres);
+
+			if (isCommit)
+				elog(PANIC, "lost track of buffer IO on buffer %u", res);
+			AbortBufferIO(res);
+		}
+
 		/*
 		 * Release buffer pins.  Note that ReleaseBuffer will remove the
 		 * buffer entry from our array, so we just have to iterate till there
@@ -746,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And it better not own any resources, either */
 	Assert(owner->bufferarr.nitems == 0);
+	Assert(owner->bufferioarr.nitems == 0);
 	Assert(owner->catrefarr.nitems == 0);
 	Assert(owner->catlistrefarr.nitems == 0);
 	Assert(owner->relrefarr.nitems == 0);
@@ -775,6 +796,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And free the object. */
 	ResourceArrayFree(&(owner->bufferarr));
+	ResourceArrayFree(&(owner->bufferioarr));
 	ResourceArrayFree(&(owner->catrefarr));
 	ResourceArrayFree(&(owner->catlistrefarr));
 	ResourceArrayFree(&(owner->relrefarr));
@@ -976,6 +998,44 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 			 buffer, owner->name);
 }
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * buffer array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeBufferIOs(ResourceOwner owner)
+{
+	/* We used to allow pinning buffers without a resowner, but no more */
+	Assert(owner != NULL);
+	ResourceArrayEnlarge(&(owner->bufferioarr));
+}
+
+/*
+ * Remember that a buffer IO is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeBufferIOs()
+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+
+/*
+ * Forget that a buffer IO is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	if (!ResourceArrayRemove(&(owner->bufferioarr), BufferGetDatum(buffer)))
+		elog(PANIC, "buffer IO %d is not owned by resource owner %s",
+			 buffer, owner->name);
+}
+
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
  *
-- 
2.38.0

v5-0008-bufmgr-Move-relation-extension-handling-into-Exte.patchtext/x-diff; charset=us-asciiDownload

From 8f256d4f260921a93ad9b0f66d4a97218036059d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:24:19 -0800
Subject: [PATCH v5 08/14] bufmgr: Move relation extension handling into
 ExtendBufferedRel{By,To,}

---
 src/include/pgstat.h                   |   1 +
 src/include/storage/buf_internals.h    |   7 +
 src/include/storage/bufmgr.h           |  65 ++
 src/backend/storage/buffer/bufmgr.c    | 782 +++++++++++++++++++------
 src/backend/storage/buffer/localbuf.c  | 156 ++++-
 src/backend/utils/activity/pgstat_io.c |   8 +-
 src/backend/utils/probes.d             |   6 +-
 doc/src/sgml/monitoring.sgml           |  43 +-
 8 files changed, 886 insertions(+), 182 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c2bae8358a2..4aac1115664 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -500,6 +500,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index fa5c451b1a9..feca19f5620 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -422,6 +422,13 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
+extern BlockNumber ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+										  ForkNumber fork,
+										  uint32 flags,
+										  uint32 extend_by,
+										  BlockNumber extend_upto,
+										  Buffer *buffers,
+										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 9ba25521ea1..da0abf43150 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -60,6 +60,53 @@ typedef struct PrefetchBufferResult
 	bool		initiated_io;	/* If true, a miss resulting in async I/O */
 } PrefetchBufferResult;
 
+/*
+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+	/*
+	 * Don't acquire extension lock. This is safe only if the relation isn't
+	 * shared, an access exclusive lock is held or if this is the startup
+	 * process.
+	 */
+	EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+	/* Is this extension part of recovery? */
+	EB_PERFORMING_RECOVERY = (1 << 1),
+
+	/*
+	 * Should the fork be created if it does not currently exist? This likely
+	 * only ever makes sense for relation forks.
+	 */
+	EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+	/* Should the first (possibly only) return buffer be returned locked? */
+	EB_LOCK_FIRST = (1 << 3),
+
+	/* Should the smgr size cache be cleared? */
+	EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+	/* internal flags follow */
+	EB_LOCK_TARGET = (1 << 5),
+} ExtendBufferedFlags;
+
+/*
+ * To identify the relation - either relation or smgr + relpersistence has to
+ * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
+ * to use the same function for both crash recovery and normal operation.
+ */
+typedef struct ExtendBufferedWhat
+{
+	Relation rel;
+	struct SMgrRelationData *smgr;
+	char relpersistence;
+} ExtendBufferedWhat;
+
+#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
+#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -138,6 +185,24 @@ extern void BufferCheckWePinOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
+extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
+								ForkNumber forkNum,
+								BufferAccessStrategy strategy,
+								uint32 flags);
+extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
+									   ForkNumber fork,
+									   BufferAccessStrategy strategy,
+									   uint32 flags,
+									   uint32 extend_by,
+									   Buffer *buffers,
+									   uint32 *extended_by);
+extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
+								  ForkNumber fork,
+								  BufferAccessStrategy strategy,
+								  uint32 flags,
+								  BlockNumber extend_to,
+								  ReadBufferMode mode);
+
 extern void InitBufferPoolAccess(void);
 extern void AtEOXact_Buffers(bool isCommit);
 extern void PrintBufferLeakWarning(Buffer buffer);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 41153f99bc2..699eab5243d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -450,6 +451,22 @@ static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
+static BlockNumber ExtendBufferedRelCommon(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
+static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
 static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
@@ -785,6 +802,179 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * Convenience wrapper around ExtendBufferedRelBy() extending by one block.
+ */
+Buffer
+ExtendBufferedRel(ExtendBufferedWhat eb,
+				  ForkNumber forkNum,
+				  BufferAccessStrategy strategy,
+				  uint32 flags)
+{
+	Buffer		buf;
+	uint32		extend_by = 1;
+
+	ExtendBufferedRelBy(eb, forkNum, strategy, flags, extend_by,
+						&buf, &extend_by);
+
+	return buf;
+}
+
+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by blocks. Depending on the
+ * availability of resources the relation may end up being extended by a
+ * smaller number of pages (unless an error is thrown, always by at least one
+ * page). *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ *
+ * If EB_LOCK_FIRST is part of flags, the first returned buffer is
+ * locked. This is useful for callers that want a buffer that is guaranteed to
+ * be empty.
+ */
+BlockNumber
+ExtendBufferedRelBy(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					uint32 extend_by,
+					Buffer *buffers,
+					uint32 *extended_by)
+{
+	Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_by > 0);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	return ExtendBufferedRelCommon(eb, fork, strategy, flags,
+								   extend_by, InvalidBlockNumber,
+								   buffers, extended_by);
+}
+
+/*
+ * Extend the relation so it is at least extend_to blocks large, read buffer
+ * (extend_to - 1).
+ *
+ * This is useful for callers that want to write a specific page, regardless
+ * of the current size of the relation (e.g. useful for visibilitymap and for
+ * crash recovery).
+ */
+Buffer
+ExtendBufferedRelTo(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					BlockNumber extend_to,
+					ReadBufferMode mode)
+{
+	BlockNumber current_size;
+	uint32 extended_by = 0;
+	Buffer buffer = InvalidBuffer;
+	Buffer buffers[64];
+
+	Assert((eb.rel != NULL) ^ (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
+	Assert(mode == RBM_NORMAL || mode == RBM_ZERO_ON_ERROR ||
+		   mode == RBM_ZERO_AND_LOCK);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * If desired, create the file if it doesn't exist.  If
+	 * smgr_cached_nblocks[fork] is positive then it must exist, no need for
+	 * an smgrexists call.
+	 */
+	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
+		(eb.smgr->smgr_cached_nblocks[fork] == 0 ||
+		 eb.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(eb.smgr, fork))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+
+		/* could have been closed while waiting for lock */
+		eb.smgr = RelationGetSmgr(eb.rel);
+
+		/* recheck, fork might have been created concurrently */
+		if (!smgrexists(eb.smgr, fork))
+			smgrcreate(eb.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	/*
+	 * Estimate how many pages we'll need to extend by. This avoids acquiring
+	 * unnecessarily many victim buffers.
+	 */
+	current_size = smgrnblocks(eb.smgr, fork);
+
+	if (mode == RBM_ZERO_AND_LOCK)
+		flags |= EB_LOCK_TARGET;
+
+	while (current_size < extend_to)
+	{
+		uint32 num_pages = lengthof(buffers);
+		BlockNumber first_block;
+
+		if ((uint64) current_size + num_pages > extend_to)
+			num_pages = extend_to - current_size;
+
+		first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
+											  num_pages, extend_to,
+											  buffers, &extended_by);
+
+		current_size = first_block + extended_by;
+		Assert(current_size <= extend_to);
+		Assert(num_pages != 0 || current_size >= extend_to);
+
+		for (int i = 0; i < extended_by; i++)
+		{
+			if (first_block + i != extend_to - 1)
+				ReleaseBuffer(buffers[i]);
+			else
+				buffer = buffers[i];
+		}
+	}
+
+	/*
+	 * It's possible that another backend concurrently extended the
+	 * relation. In that case read the buffer.
+	 *
+	 * XXX: Should we control this via a flag?
+	 */
+	if (buffer == InvalidBuffer)
+	{
+		bool hit;
+
+		Assert(extended_by == 0);
+		buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+								   fork, extend_to - 1, mode, strategy,
+								   &hit);
+	}
+
+	return buffer;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	bool		found;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
 	*hit = false;
 
+	/*
+	 * Backward compatibility path, most code should use
+	 * ExtendRelationBuffered() instead, as acquiring the extension lock
+	 * inside ExtendRelationBuffered() scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+	{
+		uint32 flags = EB_SKIP_EXTENSION_LOCK;
+
+		Assert(mode == RBM_NORMAL || mode == RBM_ZERO_AND_LOCK);
+
+		if (mode == RBM_ZERO_AND_LOCK)
+			flags |= EB_LOCK_FIRST;
+
+		return ExtendBufferedRel(EB_SMGR(smgr, relpersistence),
+								 forkNum, strategy, flags);
+	}
+
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
-	isExtend = (blockNum == P_NEW);
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
 									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend,
-									   isExtend);
-
-	/* Substitute proper block number if caller asked for P_NEW */
-	if (isExtend)
-	{
-		blockNum = smgrnblocks(smgr, forkNum);
-		/* Fail if relation is already at maximum possible length */
-		if (blockNum == P_NEW)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot extend relation %s beyond %u blocks",
-							relpath(smgr->smgr_rlocator, forkNum),
-							P_NEW)));
-	}
+									   smgr->smgr_rlocator.backend);
 
 	if (isLocalBuf)
 	{
@@ -843,8 +1034,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.local_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
@@ -859,8 +1048,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.shared_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
@@ -871,103 +1058,40 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
-		if (!isExtend)
-		{
-			/* Just need to update stats before we exit */
-			*hit = true;
-			VacuumPageHit++;
+		/* Just need to update stats before we exit */
+		*hit = true;
+		VacuumPageHit++;
 
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
 
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-											  smgr->smgr_rlocator.locator.spcOid,
-											  smgr->smgr_rlocator.locator.dbOid,
-											  smgr->smgr_rlocator.locator.relNumber,
-											  smgr->smgr_rlocator.backend,
-											  isExtend,
-											  found);
-
-			/*
-			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
-			 * locked on return.
-			 */
-			if (!isLocalBuf)
-			{
-				if (mode == RBM_ZERO_AND_LOCK)
-					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-								  LW_EXCLUSIVE);
-				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-			}
-
-			return BufferDescriptorGetBuffer(bufHdr);
-		}
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  smgr->smgr_rlocator.locator.spcOid,
+										  smgr->smgr_rlocator.locator.dbOid,
+										  smgr->smgr_rlocator.locator.relNumber,
+										  smgr->smgr_rlocator.backend,
+										  found);
 
 		/*
-		 * We get here only in the corner case where we are trying to extend
-		 * the relation but we found a pre-existing buffer marked BM_VALID.
-		 * This can happen because mdread doesn't complain about reads beyond
-		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
-		 * read a block beyond EOF could have left a "valid" zero-filled
-		 * buffer.  Unfortunately, we have also seen this case occurring
-		 * because of buggy Linux kernels that sometimes return an
-		 * lseek(SEEK_END) result that doesn't account for a recent write. In
-		 * that situation, the pre-existing buffer would contain valid data
-		 * that we don't want to overwrite.  Since the legitimate case should
-		 * always have left a zero-filled buffer, complain if not PageIsNew.
+		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
+		 * on return.
 		 */
-		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-		if (!PageIsNew((Page) bufBlock))
-			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rlocator, forkNum)),
-					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
-
-		/*
-		 * We *must* do smgrextend before succeeding, else the page will not
-		 * be reserved by the kernel, and the next P_NEW call will decide to
-		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
-		 * call that BufferAlloc didn't, and proceed.
-		 */
-		if (isLocalBuf)
+		if (!isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			Assert(buf_state & BM_VALID);
-			buf_state &= ~BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			if (mode == RBM_ZERO_AND_LOCK)
+				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+							  LW_EXCLUSIVE);
+			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
+				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
 		}
-		else
-		{
-			/*
-			 * Loop to handle the very small possibility that someone re-sets
-			 * BM_VALID between our clearing it and StartBufferIO inspecting
-			 * it.
-			 */
-			do
-			{
-				uint32		buf_state = LockBufHdr(bufHdr);
 
-				Assert(buf_state & BM_VALID);
-				buf_state &= ~BM_VALID;
-				UnlockBufHdr(bufHdr, buf_state);
-			} while (!StartBufferIO(bufHdr, true));
-		}
+		return BufferDescriptorGetBuffer(bufHdr);
 	}
 
 	/*
 	 * if we have gotten to this point, we have allocated a buffer for the
 	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
 	 * if it's a shared buffer.
-	 *
-	 * Note: if smgrextend fails, we will end up with a buffer that is
-	 * allocated but not marked BM_VALID.  P_NEW will still select the same
-	 * block number (because the relation didn't get any longer on disk) and
-	 * so future attempts to extend the relation will find the same buffer (if
-	 * it's not been recycled) but come right back here to try smgrextend
-	 * again.
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
@@ -982,72 +1106,51 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		io_object = IOOBJECT_RELATION;
 	}
 
-	if (isExtend)
-	{
-		/* new buffers are zero-filled */
+	/*
+	 * Read in the page, unless the caller intends to overwrite it and just
+	 * wants us to allocate a buffer.
+	 */
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
-
-		/*
-		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
-		 * although we're essentially performing a write. At least on linux
-		 * doing so defeats the 'delayed allocation' mechanism, leading to
-		 * increased file fragmentation.
-		 */
-	}
 	else
 	{
-		/*
-		 * Read in the page, unless the caller intends to overwrite it and
-		 * just wants us to allocate a buffer.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			MemSet((char *) bufBlock, 0, BLCKSZ);
-		else
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
+		smgrread(smgr, forkNum, blockNum, bufBlock);
+
+		if (track_io_timing)
 		{
-			instr_time	io_start,
-						io_time;
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
+		pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
+		/* check for garbage data */
+		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
+									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		{
+			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			{
+				ereport(WARNING,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s; zeroing out page",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
+				MemSet((char *) bufBlock, 0, BLCKSZ);
+			}
 			else
-				INSTR_TIME_SET_ZERO(io_start);
-
-			smgrread(smgr, forkNum, blockNum, bufBlock);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-					MemSet((char *) bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-			}
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
 		}
 	}
 
@@ -1090,7 +1193,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									  smgr->smgr_rlocator.locator.dbOid,
 									  smgr->smgr_rlocator.locator.relNumber,
 									  smgr->smgr_rlocator.backend,
-									  isExtend,
 									  found);
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1640,6 +1742,354 @@ again:
 	return buf;
 }
 
+/*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, as otherwise the operation likely
+ * cannot be performed at all.
+ *
+ * The number of allowed pins for a backend is computed based on
+ * shared_buffers and the maximum number of connections possible. That's very
+ * pessimistic, but oustide of toy-sized shared_buffers it should allow
+ * sufficient pins.
+ */
+static void
+LimitAdditionalPins(uint32 *additional_pins)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	/*
+	 * Subtract the approximate number of buffers already pinned by this
+	 * backend. We get the number of "overflowed" pins for free, but don't
+	 * know the number of pins in PrivateRefCountArray. The cost of
+	 * calculating that exactly doesn't seem worth it, so just assume the max.
+	 */
+	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	if (max_proportional_pins < 0)
+		max_proportional_pins = 1;
+
+	if (*additional_pins > max_proportional_pins)
+		*additional_pins = max_proportional_pins;
+}
+
+/*
+ * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
+ * avoid duplicating the tracing and relpersistence related logic.
+ */
+static BlockNumber
+ExtendBufferedRelCommon(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
+										 eb.smgr->smgr_rlocator.locator.spcOid,
+										 eb.smgr->smgr_rlocator.locator.dbOid,
+										 eb.smgr->smgr_rlocator.locator.relNumber,
+										 eb.smgr->smgr_rlocator.backend,
+										 extend_by);
+
+	if (eb.relpersistence == RELPERSISTENCE_TEMP)
+		first_block = ExtendBufferedRelLocal(eb, fork, flags,
+											 extend_by, extend_upto,
+											 buffers, &extend_by);
+	else
+		first_block = ExtendBufferedRelShared(eb, fork, strategy, flags,
+											  extend_by, extend_upto,
+											  buffers, &extend_by);
+	*extended_by = extend_by;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
+										eb.smgr->smgr_rlocator.locator.spcOid,
+										eb.smgr->smgr_rlocator.locator.dbOid,
+										eb.smgr->smgr_rlocator.locator.relNumber,
+										eb.smgr->smgr_rlocator.backend,
+										*extended_by,
+										first_block);
+
+	return first_block;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * shared buffers.
+ */
+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	LimitAdditionalPins(&extend_by);
+
+	/*
+	 * Acquire victim buffers for extension without holding extension lock.
+	 * Writing out victim buffers is the most expensive part of extending the
+	 * relation, particularly when doing so requires WAL flushes. Zeroing out
+	 * the buffers is also quite expensive, so do that before holding the
+	 * extension lock as well.
+	 *
+	 * These pages are pinned by us and not valid. While we hold the pin they
+	 * can't be acquired as victim buffers by another backend.
+	 */
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		Block		buf_block;
+
+		buffers[i] = GetVictimBuffer(strategy, io_context);
+		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	/*
+	 * Lock relation against concurrent extensions, unless requested not to.
+	 *
+	 * We use the same extension lock for all forks. That's unnecessarily
+	 * restrictive, but currently extensions for forks don't happen often
+	 * enough to make it worth locking more granularly.
+	 *
+	 * Note that another backend might have extended the relation by the time
+	 * we get the lock.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+		eb.smgr = RelationGetSmgr(eb.rel);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	if (extend_upto != InvalidBlockNumber)
+	{
+		uint32 old_num_pages = extend_by;
+
+		if (first_block > extend_upto)
+			extend_by = 0;
+		else if ((uint64) first_block + extend_by > extend_upto)
+			extend_by = extend_upto - first_block;
+
+		for (uint32 i = extend_by; i < old_num_pages; i++)
+		{
+			BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(buf_hdr);
+			UnpinBuffer(buf_hdr);
+		}
+
+		if (extend_by == 0)
+		{
+			UnlockRelationForExtension(eb.rel, ExclusiveLock);
+			*extended_by = extend_by;
+			return first_block;
+		}
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	/*
+	 * Insert buffers into buffer table, mark as IO_IN_PROGRESS.
+	 *
+	 * This needs to happen before we extend the relation, because as soon as
+	 * we do, other backends can start to read in those pages.
+	 */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		victim_buf = buffers[i];
+		BufferDesc *victim_buf_hdr = GetBufferDescriptor(victim_buf - 1);
+		BufferTag	tag;
+		uint32		hash;
+		LWLock	   *partition_lock;
+		int			existing_id;
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		existing_id = BufTableInsert(&tag, hash, victim_buf_hdr->buf_id);
+
+		/*
+		 * We get here only in the corner case where we are trying to extend
+		 * the relation but we found a pre-existing buffer. This can happen
+		 * because a prior attempt at extending the relation failed, and
+		 * because mdread doesn't complain about reads beyond EOF (when
+		 * zero_damaged_pages is ON) and so a previous attempt to read a block
+		 * beyond EOF could have left a "valid" zero-filled buffer.
+		 * Unfortunately, we have also seen this case occurring because of
+		 * buggy Linux kernels that sometimes return an lseek(SEEK_END) result
+		 * that doesn't account for a recent write. In that situation, the
+		 * pre-existing buffer would contain valid data that we don't want to
+		 * overwrite.  Since the legitimate cases should always have left a
+		 * zero-filled buffer, complain if not PageIsNew.
+		 */
+		if (existing_id >= 0)
+		{
+			BufferDesc *existing_hdr = GetBufferDescriptor(existing_id);
+			Block		buf_block;
+			bool		valid;
+
+			/*
+			 * Pin the existing buffer before releasing the partition lock,
+			 * preventing it from being evicted.
+			 */
+			valid = PinBuffer(existing_hdr, strategy);
+
+			LWLockRelease(partition_lock);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(victim_buf_hdr);
+			UnpinBuffer(victim_buf_hdr);
+
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+			buf_block = BufHdrGetBlock(existing_hdr);
+
+			if (valid && !PageIsNew((Page) buf_block))
+				ereport(ERROR,
+						(errmsg("unexpected data beyond EOF in block %u of relation %s",
+								existing_hdr->tag.blockNum, relpath(eb.smgr->smgr_rlocator, fork)),
+						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
+
+			/*
+			 * We *must* do smgr[zero]extend before succeeding, else the page
+			 * will not be reserved by the kernel, and the next P_NEW call
+			 * will decide to return the same page.  Clear the BM_VALID bit,
+			 * do StartBufferIO() and proceed.
+			 *
+			 * Loop to handle the very small possibility that someone re-sets
+			 * BM_VALID between our clearing it and StartBufferIO inspecting
+			 * it.
+			 */
+			do
+			{
+				uint32		buf_state = LockBufHdr(existing_hdr);
+
+				buf_state &= ~BM_VALID;
+				UnlockBufHdr(existing_hdr, buf_state);
+			} while (!StartBufferIO(existing_hdr, true));
+		}
+		else
+		{
+			uint32		buf_state;
+
+			buf_state = LockBufHdr(victim_buf_hdr);
+
+			/* some sanity checks while we hold the buffer header lock */
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+			if (eb.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+				buf_state |= BM_PERMANENT;
+
+			UnlockBufHdr(victim_buf_hdr, buf_state);
+
+			LWLockRelease(partition_lock);
+
+			/* XXX: could combine the locked operations in it with the above */
+			StartBufferIO(victim_buf_hdr, true);
+		}
+	}
+
+	/*
+	 * Note: if smgzerorextend fails, we will end up with buffers that are
+	 * allocated but not marked BM_VALID.  The next relation extension will
+	 * still select the same block number (because the relation didn't get any
+	 * longer on disk) and so future attempts to extend the relation will find
+	 * the same buffers (if they have not been recycled) but come right back
+	 * here to try smgrzeroextend again.
+	 *
+	 * We don't need to set checksum for all-zero pages.
+	 */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 *
+	 * We remove IO_IN_PROGRESS after this, as waking up waiting backends can
+	 * take noticeable time.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+
+	/* Set BM_VALID, terminate IO, and wake up any waiters */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+		bool		lock = false;
+
+		if (flags & EB_LOCK_FIRST && i == 0)
+			lock = true;
+		else if (flags & EB_LOCK_TARGET)
+		{
+			Assert(extend_upto != InvalidBlockNumber);
+			if (first_block + i + 1 == extend_upto)
+				lock = true;
+		}
+
+		if (lock)
+			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+
+		TerminateBufferIO(buf_hdr, false, BM_VALID);
+	}
+
+	pgBufferUsage.shared_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
+						 extend_by);
+
+	*extended_by = extend_by;
+
+	return first_block;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5b44b0be8b5..0528fddf992 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -49,6 +49,9 @@ static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
+/* number of local buffers pinned at least once */
+static int	NLocalPinnedBuffers = 0;
+
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
@@ -280,6 +283,154 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see LimitAdditionalPins() */
+static void
+LimitAdditionalLocalPins(uint32 *additional_pins)
+{
+	uint32		max_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	/*
+	 * In contrast to LimitAdditionalPins() other backends don't play a role
+	 * here. We can allow up to NLocBuffer pins in total.
+	 */
+	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+
+	if (*additional_pins >= max_pins)
+		*additional_pins = max_pins;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * temporary buffers.
+ */
+BlockNumber
+ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+					   ForkNumber fork,
+					   uint32 flags,
+					   uint32 extend_by,
+					   BlockNumber extend_upto,
+					   Buffer *buffers,
+					   uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	/* Initialize local buffers if first request in this session */
+	if (LocalBufHash == NULL)
+		InitLocalBuffers();
+
+	LimitAdditionalLocalPins(&extend_by);
+
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		BufferDesc *buf_hdr;
+		Block		buf_block;
+
+		buffers[i] = GetLocalVictimBuffer();
+		buf_hdr = GetLocalBufferDescriptor(-(buffers[i] + 1));
+		buf_block = LocalBufHdrGetBlock(buf_hdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	if (extend_upto != InvalidBlockNumber)
+	{
+		/*
+		 * In contranst to shared relations, nothing could change the relation
+		 * size concurrently. Thus we shouldn't end up finding that we don't
+		 * need to do anything.
+		 */
+		Assert(first_block <= extend_upto);
+
+		Assert((uint64) first_block + extend_by <= extend_upto);
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		int			victim_buf_id;
+		BufferDesc *victim_buf_hdr;
+		BufferTag	tag;
+		LocalBufferLookupEnt *hresult;
+		bool		found;
+
+		victim_buf_id = -(buffers[i] + 1);
+		victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
+		if (found)
+		{
+			BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			uint32		buf_state;
+
+			UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
+
+			existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			PinLocalBuffer(existing_hdr, false);
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+
+			buf_state = pg_atomic_read_u32(&existing_hdr->state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(!(buf_state & BM_DIRTY));
+			buf_state &= BM_VALID;
+			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+		}
+		else
+		{
+			uint32		buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
+
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
+
+			hresult->id = victim_buf_id;
+		}
+	}
+
+	/* actually extend relation */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr;
+		uint32		buf_state;
+
+		buf_hdr = GetLocalBufferDescriptor(-(buf + 1));
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	*extended_by = extend_by;
+
+	pgBufferUsage.temp_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
+						 extend_by);
+
+	return first_block;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
@@ -494,6 +645,7 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 
 	if (LocalRefCount[bufid] == 0)
 	{
+		NLocalPinnedBuffers++;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
@@ -515,9 +667,11 @@ UnpinLocalBuffer(Buffer buffer)
 
 	Assert(BufferIsLocal(buffer));
 	Assert(LocalRefCount[buffid] > 0);
+	Assert(NLocalPinnedBuffers > 0);
 
 	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-	LocalRefCount[buffid]--;
+	if (--LocalRefCount[buffid] == 0)
+		NLocalPinnedBuffers--;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d5546101..f2f6eae8031 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -64,13 +64,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	pgstat_count_io_op_n(io_object, io_context, io_op, 1);
+}
+
+void
+pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt)
 {
 	Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
 	Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.data[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index c064d679e94..fd3df2f7900 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,10 +55,12 @@ provider postgresql {
 	probe sort__start(int, bool, int, int, bool, int);
 	probe sort__done(bool, long);
 
-	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
+	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
+	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b2ccd8d7fef..00df5fcd9ba 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7761,33 +7761,52 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <entry>Probe that fires when the two-phase portion of a checkpoint is
       complete.</entry>
     </row>
+    <row>
+     <entry><literal>buffer-extend-start</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int)</literal></entry>
+     <entry>Probe that fires when a relation extension starts.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the caller would like to extend
+       by.</entry>
+    </row>
+    <row>
+     <entry><literal>buffer-extend-done</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber)</literal></entry>
+     <entry>Probe that fires when a relation extension is complete.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the relation was extended by, this
+       can be less than the number in the
+       <literal>buffer-extend-start</literal> due to resource
+       constraints. arg6 contains the BlockNumber of the first new
+       block.</entry>
+    </row>
     <row>
      <entry><literal>buffer-read-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int)</literal></entry>
      <entry>Probe that fires when a buffer read is started.
-      arg0 and arg1 contain the fork and block numbers of the page (but
-      arg1 will be -1 if this is a relation extension request).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.</entry>
+      </entry>
     </row>
     <row>
      <entry><literal>buffer-read-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
      <entry>Probe that fires when a buffer read is complete.
-      arg0 and arg1 contain the fork and block numbers of the page (if this
-      is a relation extension request, arg1 now contains the block number
-      of the newly added block).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.
-      arg7 is true if the buffer was found in the pool, false if not.</entry>
+      arg6 is true if the buffer was found in the pool, false if not.</entry>
     </row>
     <row>
      <entry><literal>buffer-flush-start</literal></entry>
-- 
2.38.0

v5-0009-Convert-a-few-places-to-ExtendBufferedRel.patchtext/x-diff; charset=us-asciiDownload

From 9f92ad2d5e09fab6d2348ba24fcf33c6d1fc7572 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:18:18 -0700
Subject: [PATCH v5 09/14] Convert a few places to ExtendBufferedRel

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/brin/brin.c         |  9 +++----
 src/backend/access/brin/brin_pageops.c |  4 +++
 src/backend/access/brin/brin_revmap.c  | 15 +++---------
 src/backend/access/gin/gininsert.c     | 10 +++-----
 src/backend/access/gin/ginutil.c       | 13 ++--------
 src/backend/access/gin/ginvacuum.c     |  8 ++++++
 src/backend/access/gist/gist.c         |  4 +--
 src/backend/access/gist/gistutil.c     | 14 ++---------
 src/backend/access/gist/gistvacuum.c   |  3 +++
 src/backend/access/nbtree/nbtpage.c    | 34 +++++++-------------------
 src/backend/access/nbtree/nbtree.c     |  3 +++
 src/backend/access/spgist/spgutils.c   | 13 ++--------
 contrib/bloom/blutils.c                | 12 ++-------
 13 files changed, 48 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 53e4721a54e..41bf950a4af 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -837,9 +837,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -904,9 +904,8 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	metabuf = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index ad5a89bd051..b578d259545 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -730,6 +730,10 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * There's not enough free space in any existing index page,
 			 * according to the FSM: extend the relation to obtain a shiny new
 			 * page.
+			 *
+			 * XXX: It's likely possible to use RBM_ZERO_AND_LOCK here,
+			 * which'd avoid the need to hold the extension lock during buffer
+			 * reclaim.
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 7fc5226bf74..f4271ba31c9 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -538,7 +538,6 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
-	bool		needLock = !RELATION_IS_LOCAL(irel);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -570,10 +569,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
-
-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendBufferedRel(EB_REL(irel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
@@ -582,17 +579,11 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * up and have caller start over.  We will have to evacuate that
 			 * page from under whoever is using it.
 			 */
-			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buf);
+			UnlockReleaseBuffer(buf);
 			return;
 		}
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
-
-		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d5d748009ea..be1841de7bf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,12 +440,10 @@ ginbuildempty(Relation index)
 				MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
-	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	MetaBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	RootBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 03fec1704e9..3c6f87236c5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -299,7 +299,6 @@ Buffer
 GinNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -327,16 +326,8 @@ GinNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index e5d310d8362..13251d7e07d 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -736,6 +736,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	needLock = !RELATION_IS_LOCAL(index);
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	npages = RelationGetNumberOfBlocks(index);
@@ -786,6 +790,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	stats->pages_free = totFreePages;
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ea72bcce1bc..d3539e96d87 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -134,8 +134,8 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+							   EB_SKIP_EXTENSION_LOCK | EB_LOCK_FIRST);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff1..446fcb3cc5e 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -824,7 +824,6 @@ Buffer
 gistNewBuffer(Relation r)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -877,17 +876,8 @@ gistNewBuffer(Relation r)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(r);
-
-	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
-
-	buffer = ReadBuffer(r, P_NEW);
-	LockBuffer(buffer, GIST_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(r), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3f60d3274d2..cc711b04986 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -203,6 +203,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * we must already have processed any tuples due to be moved into such a
 	 * page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d197..c4f16b3d2b7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -881,7 +881,6 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	}
 	else
 	{
-		bool		needLock;
 		Page		page;
 
 		Assert(access == BT_WRITE);
@@ -962,31 +961,16 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		}
 
 		/*
-		 * Extend the relation by one page.
-		 *
-		 * We have to use a lock to ensure no one else is extending the rel at
-		 * the same time, else we will both try to initialize the same new
-		 * page.  We can skip locking for new or temp relations, however,
-		 * since no one else could be accessing them.
+		 * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
+		 * we risk a race condition against btvacuumscan --- see comments
+		 * therein. This forces us to repeat the valgrind request that
+		 * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
+		 * without introducing a race.
 		 */
-		needLock = !RELATION_IS_LOCAL(rel);
-
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-
-		buf = ReadBuffer(rel, P_NEW);
-
-		/* Acquire buffer lock on new page */
-		_bt_lockbuf(rel, buf, BT_WRITE);
-
-		/*
-		 * Release the file-extension lock; it's now OK for someone else to
-		 * extend the relation some more.  Note that we cannot release this
-		 * lock before we have buffer lock on the new page, or we risk a race
-		 * condition against btvacuumscan --- see comments therein.
-		 */
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+		buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a68dd07534f..86010443269 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -970,6 +970,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * write-lock on the left page before it adds a right page, so we must
 	 * already have processed any tuples due to be moved into such a page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4e7ff1d1603..190e4f76a9e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -366,7 +366,6 @@ Buffer
 SpGistNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -406,16 +405,8 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a6d9f09f315..d935ed8fbdf 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -353,7 +353,6 @@ Buffer
 BloomNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -387,15 +386,8 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
-- 
2.38.0

v5-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload

From 58cf78333a8fcb15f45a2b60174c8fee989b785b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v5 10/14] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/hio.h         | 14 +++++++-
 src/backend/access/heap/heapam.c | 59 +++++++++++++++++++++++++++++---
 src/backend/access/heap/hio.c    |  8 ++++-
 3 files changed, 75 insertions(+), 6 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 3f20b585326..7f877817597 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -30,6 +30,17 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+
+	/*
+	 * State for bulk extensions. Further pages that were unused at the time
+	 * of the extension. They might be in use by the time we use them though,
+	 * so rechecks are needed.
+	 *
+	 * FIXME: Perhaps these should live in RelationData instead, alongside the
+	 * targetblock?
+	 */
+	BlockNumber next_free;
+	BlockNumber last_free;
 } BulkInsertStateData;
 
 
@@ -38,6 +49,7 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
 extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
 										Buffer otherBuffer, int options,
 										BulkInsertStateData *bistate,
-										Buffer *vmbuffer, Buffer *vmbuffer_other);
+										Buffer *vmbuffer, Buffer *vmbuffer_other,
+										int num_pages);
 
 #endif							/* HIO_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8abc101c8cb..e1ca6938cd0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1774,6 +1774,8 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 	return bistate;
 }
 
@@ -1847,7 +1849,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
-									   &vmbuffer, NULL);
+									   &vmbuffer, NULL,
+									   0);
 
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
@@ -2050,6 +2053,33 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 		return tup;
 }
 
+/*
+ * Helper for heap_multi_insert() that computes the number of full pages s
+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace)
+{
+	size_t		page_avail;
+	int			npages = 0;
+
+	page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+	npages++;
+
+	for (int i = done; i < ntuples; i++)
+	{
+		size_t		tup_sz = sizeof(ItemIdData) + MAXALIGN(heaptuples[i]->t_len);
+
+		if (page_avail < tup_sz)
+		{
+			npages++;
+			page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+		}
+		page_avail -= tup_sz;
+	}
+
+	return npages;
+}
+
 /*
  *	heap_multi_insert	- insert multiple tuples into a heap
  *
@@ -2076,6 +2106,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	bool		starting_with_empty_page = false;
+	int			npages = 0;
+	int			npages_used = 0;
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2126,13 +2159,29 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	while (ndone < ntuples)
 	{
 		Buffer		buffer;
-		bool		starting_with_empty_page;
 		bool		all_visible_cleared = false;
 		bool		all_frozen_set = false;
 		int			nthispage;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Compute number of pages needed to insert tuples in the worst case.
+		 * This will be used to determine how much to extend the relation by
+		 * in RelationGetBufferForTuple(), if needed.  If we filled a prior
+		 * page from scratch, we can just update our last computation, but if
+		 * we started with a partially filled page recompute from scratch, the
+		 * number of potentially required pages can vary due to tuples needing
+		 * to fit onto the page, page headers etc.
+		 */
+		if (ndone == 0 || !starting_with_empty_page)
+		{
+			npages = heap_multi_insert_pages(heaptuples, ndone, ntuples, saveFreeSpace);
+			npages_used = 0;
+		}
+		else
+			npages_used++;
+
 		/*
 		 * Find buffer where at least the next tuple will fit.  If the page is
 		 * all-visible, this will also pin the requisite visibility map page.
@@ -2142,7 +2191,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		 */
 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
 										   InvalidBuffer, options, bistate,
-										   &vmbuffer, NULL);
+										   &vmbuffer, NULL,
+										   npages - npages_used);
 		page = BufferGetPage(buffer);
 
 		starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0;
@@ -3576,7 +3626,8 @@ l2:
 				/* It doesn't fit, must use RelationGetBufferForTuple. */
 				newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
 												   buffer, 0, NULL,
-												   &vmbuffer_new, &vmbuffer);
+												   &vmbuffer_new, &vmbuffer,
+												   0);
 				/* We're all done. */
 				break;
 			}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..65886839e70 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -275,6 +275,11 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	Returns pinned and exclusive-locked buffer of a page in given relation
  *	with free space >= given len.
  *
+ *	If num_pages is > 1, the relation will be extended by at least that many
+ *	pages when we decide to extend the relation. This is more efficient for
+ *	callers that know they will need multiple pages
+ *	(e.g. heap_multi_insert()).
+ *
  *	If otherBuffer is not InvalidBuffer, then it references a previously
  *	pinned buffer of another page in the same relation; on return, this
  *	buffer will also be exclusive-locked.  (This case is used by heap_update;
@@ -333,7 +338,8 @@ Buffer
 RelationGetBufferForTuple(Relation relation, Size len,
 						  Buffer otherBuffer, int options,
 						  BulkInsertState bistate,
-						  Buffer *vmbuffer, Buffer *vmbuffer_other)
+						  Buffer *vmbuffer, Buffer *vmbuffer_other,
+						  int num_pages)
 {
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
-- 
2.38.0

v5-0011-hio-Use-ExtendBufferedRelBy.patchtext/x-diff; charset=us-asciiDownload

From a803931d1790c178eee0c7ee0c099e453edd782c Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v5 11/14] hio: Use ExtendBufferedRelBy()

---
 src/backend/access/heap/hio.c | 332 ++++++++++++++++++++--------------
 1 file changed, 194 insertions(+), 138 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e70..40f53d7177e 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -186,89 +186,176 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
- * Extend a relation by multiple blocks to avoid future contention on the
- * relation extension lock.  Our goal is to pre-extend the relation by an
- * amount which ramps up as the degree of contention ramps up, but limiting
- * the result to some sane overall value.
+ * Extend the relation. By multiple pages, if beneficial.
+ *
+ * If the caller needs multiple pages (num_pages > 1), we always try to extend
+ * by at least that much.
+ *
+ * If there is contention on the extension lock, we don't just extend "for
+ * ourselves", but we try to help others. We can do so by adding empty pages
+ * into the FSM. Typically there is no contention when we can't use the FSM.
+ *
+ * We do have to limit the number of pages to extend by to some value, as the
+ * buffers for all the extended pages need to, temporarily, be pinned. For now
+ * we define MAX_BUFFERS_TO_EXTEND_BY to be 64 buffers, it's hard to see
+ * benefits with higher numbers. This partially is because copyfrom.c's
+ * MAX_BUFFERED_TUPLES / MAX_BUFFERED_BYTES prevents larger multi_inserts.
  */
-static void
-RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+static Buffer
+RelationAddBlocks(Relation relation, BulkInsertState bistate,
+				  int num_pages,  bool use_fsm)
 {
-	BlockNumber blockNum,
-				firstBlock = InvalidBlockNumber;
-	int			extraBlocks;
-	int			lockWaiters;
-
-	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
-	if (lockWaiters <= 0)
-		return;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+	BlockNumber firstBlock = InvalidBlockNumber;
+	BlockNumber firstBlockFSM = InvalidBlockNumber;
+	uint32		extend_by_pages;
+	uint32		not_in_fsm_pages;
+	BlockNumber curBlock;
+	uint32		waitcount;
+	Buffer		buffer;
 
 	/*
-	 * It might seem like multiplying the number of lock waiters by as much as
-	 * 20 is too aggressive, but benchmarking revealed that smaller numbers
-	 * were insufficient.  512 is just an arbitrary cap to prevent
-	 * pathological results.
+	 * Determine by how many pages to try to extend by.
 	 */
-	extraBlocks = Min(512, lockWaiters * 20);
-
-	do
+	if (bistate == NULL && !use_fsm)
 	{
-		Buffer		buffer;
-		Page		page;
-		Size		freespace;
-
 		/*
-		 * Extend by one page.  This should generally match the main-line
-		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout, and we don't immediately
-		 * initialize the page (see below).
+		 * If we have neither bistate, nor can use the FSM, we can't bulk
+		 * extend - there'd be no way to find the additional pages.
 		 */
-		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-		page = BufferGetPage(buffer);
-
-		if (!PageIsNew(page))
-			elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-				 BufferGetBlockNumber(buffer),
-				 RelationGetRelationName(relation));
-
-		/*
-		 * Add the page to the FSM without initializing. If we were to
-		 * initialize here, the page would potentially get flushed out to disk
-		 * before we add any useful content. There's no guarantee that that'd
-		 * happen before a potential crash, so we need to deal with
-		 * uninitialized pages anyway, thus avoid the potential for
-		 * unnecessary writes.
-		 */
-
-		/* we'll need this info below */
-		blockNum = BufferGetBlockNumber(buffer);
-		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
-
-		UnlockReleaseBuffer(buffer);
-
-		/* Remember first block number thus added. */
-		if (firstBlock == InvalidBlockNumber)
-			firstBlock = blockNum;
-
-		/*
-		 * Immediately update the bottom level of the FSM.  This has a good
-		 * chance of making this page visible to other concurrently inserting
-		 * backends, and we want that to happen without delay.
-		 */
-		RecordPageWithFreeSpace(relation, blockNum, freespace);
+		extend_by_pages = 1;
+	}
+	else
+	{
+		/*
+		 * Try to extend at least by the number of pages the caller needs. We
+		 * can remember the additional pages (either via FSM or bistate).
+		 */
+		extend_by_pages = num_pages;
+
+		if (!RELATION_IS_LOCAL(relation))
+			waitcount = RelationExtensionLockWaiterCount(relation);
+		else
+			waitcount = 0;
+
+		/*
+		 * Multiply the number of pages to extend by the number of waiters. Do
+		 * this even if we're not using the FSM, as it still relieves
+		 * contention, by deferring the next time this backend needs to
+		 * extend. In that case the extended pages will be found via
+		 * bistate->next_free.
+		 */
+		extend_by_pages += extend_by_pages * waitcount;
+
+		/*
+		 * Can't extend by more than MAX_BUFFERS, we need to pin them all
+		 * concurrently.
+		 */
+		extend_by_pages = Min(extend_by_pages, MAX_BUFFERS_TO_EXTEND_BY);
 	}
-	while (--extraBlocks > 0);
 
 	/*
-	 * Updating the upper levels of the free space map is too expensive to do
-	 * for every block, but it's worth doing once at the end to make sure that
-	 * subsequent insertion activity sees all of those nifty free pages we
-	 * just inserted.
+	 * How many of the extended pages should be entered into the FSM?
+	 *
+	 * If we have a bistate, only enter pages that we don't need ourselves
+	 * into the FSM.  Otherwise every other backend will immediately try to
+	 * use the pages this backend neds itself, causing unnecessary
+	 * contention. If we don't have a bistate, we can't avoid the FSM.
+	 *
+	 * Never enter the page returned into the FSM, we'll immediately use it.
 	 */
-	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
+	if (num_pages > 1 && bistate == NULL)
+		not_in_fsm_pages = 1;
+	else
+		not_in_fsm_pages = num_pages;
+
+	/* prepare to put another buffer into the bistate */
+	if (bistate && bistate->current_buf != InvalidBuffer)
+	{
+		ReleaseBuffer(bistate->current_buf);
+		bistate->current_buf = InvalidBuffer;
+	}
+
+	/*
+	 * Extend the relation. We ask for the first returned page to be locked,
+	 * so that we are sure that nobody has inserted into the page
+	 * concurrently.
+	 *
+	 * With the current MAX_BUFFERS_TO_EXTEND_BY there's no danger of
+	 * [auto]vacuum trying to truncate later pages as REL_TRUNCATE_MINIMUM is
+	 * way larger.
+	 */
+	firstBlock = ExtendBufferedRelBy(EB_REL(relation), MAIN_FORKNUM,
+									 bistate ? bistate->strategy : NULL,
+									 EB_LOCK_FIRST,
+									 extend_by_pages,
+									 victim_buffers,
+									 &extend_by_pages);
+	/* the buffer the function will return */
+	buffer = victim_buffers[0];
+
+	/*
+	 * Relation is now extended. Release pins on all buffers, except for the
+	 * first (which we'll return).  If we decided to put pages into the FSM,
+	 * we can do that as part of the same loop.
+	 *
+	 * FIXME: Figure out how to better deal with doing this operation while
+	 * holding a buffer lock. Likely we can just release the buffer lock (to
+	 * reacquire it later) after initializing the page - the target page isn't
+	 * in the FSM, so it's going to be very rare that it's found.
+	 */
+	curBlock = firstBlock;
+	for (uint32 i = 0; i < extend_by_pages; i++, curBlock++)
+	{
+		Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
+		Assert(BlockNumberIsValid(curBlock));
+
+		/* don't release the pin on the page returned by this function */
+		if (i > 0)
+			ReleaseBuffer(victim_buffers[i]);
+
+		if (i >= not_in_fsm_pages && use_fsm)
+		{
+			if (firstBlockFSM == InvalidBlockNumber)
+				firstBlockFSM = curBlock;
+
+			RecordPageWithFreeSpace(relation,
+									curBlock,
+									BufferGetPageSize(victim_buffers[i]) - SizeOfPageHeaderData);
+		}
+	}
+
+	if (use_fsm && firstBlockFSM != InvalidBlockNumber)
+		FreeSpaceMapVacuumRange(relation, firstBlockFSM, firstBlock + num_pages);
+
+	if (bistate)
+	{
+		/*
+		 * Remember the pages we extended by so we can use them without
+		 * looking into the FSM.
+		 */
+		if (extend_by_pages > 1)
+		{
+			bistate->next_free = firstBlock + 1;
+			bistate->last_free = firstBlock + extend_by_pages - 1;
+		}
+		else
+		{
+			bistate->next_free = InvalidBlockNumber;
+			bistate->last_free = InvalidBlockNumber;
+		}
+
+		/* maintain bistate->current_buf */
+		IncrBufferRefCount(buffer);
+		bistate->current_buf = buffer;
+	}
+
+	return buffer;
+#undef MAX_BUFFERS_TO_EXTEND_BY
 }
 
+
 /*
  * RelationGetBufferForTuple
  *
@@ -350,10 +437,12 @@ RelationGetBufferForTuple(Relation relation, Size len,
 				targetFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
 
+	if (num_pages <= 0)
+		num_pages = 1;
+
 	/* Bulk insert is not supported for updates, only inserts. */
 	Assert(otherBuffer == InvalidBuffer || !bistate);
 
@@ -558,83 +647,50 @@ loop:
 			ReleaseBuffer(buffer);
 		}
 
-		/* Without FSM, always fall out of the loop and extend */
-		if (!use_fsm)
-			break;
-
-		/*
-		 * Update FSM as to condition of this page, and ask for another page
-		 * to try.
-		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
-													targetBlock,
-													pageFreeSpace,
-													targetFreeSpace);
-	}
-
-	/*
-	 * Have to extend the relation.
-	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
-	 */
-	needLock = !RELATION_IS_LOCAL(relation);
-
-	/*
-	 * If we need the lock but are not able to acquire it immediately, we'll
-	 * consider extending the relation by multiple blocks at a time to manage
-	 * contention on the relation extension lock.  However, this only makes
-	 * sense if we're using the FSM; otherwise, there's no point.
-	 */
-	if (needLock)
-	{
-		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		if (bistate
+			&& bistate->next_free != InvalidBlockNumber
+			&& bistate->next_free <= bistate->last_free)
 		{
-			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
-
 			/*
-			 * Check if some other backend has extended a block for us while
-			 * we were waiting on the lock.
+			 * We bulk extended the relation before, and there are still some
+			 * unused pages from that extension, so we don't need to look in
+			 * the FSM for a new page. But do record the free space from the
+			 * last page, somebody might insert narrower tuples later.
 			 */
-			targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
+			if (use_fsm)
+				RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
 
-			/*
-			 * If some other waiter has already extended the relation, we
-			 * don't need to do so; just use the existing freespace.
-			 */
-			if (targetBlock != InvalidBlockNumber)
+			Assert(bistate->last_free != InvalidBlockNumber &&
+				   bistate->next_free <= bistate->last_free);
+			targetBlock = bistate->next_free;
+			if (bistate->next_free >= bistate->last_free)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
-				goto loop;
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
 			}
-
-			/* Time to bulk-extend. */
-			RelationAddExtraBlocks(relation, bistate);
+			else
+				bistate->next_free++;
+		}
+		else if (!use_fsm)
+		{
+			/* Without FSM, always fall out of the loop and extend */
+			break;
+		}
+		else
+		{
+			/*
+			 * Update FSM as to condition of this page, and ask for another
+			 * page to try.
+			 */
+			targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														targetBlock,
+														pageFreeSpace,
+														targetFreeSpace);
 		}
 	}
 
-	/*
-	 * In addition to whatever extension we performed above, we always add at
-	 * least one block to satisfy our own request.
-	 *
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+	/* Have to extend the relation */
+	buffer = RelationAddBlocks(relation, bistate, num_pages, use_fsm);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
-- 
2.38.0

v5-0012-WIP-Don-t-initialize-page-in-vm-fsm-_extend-not-n.patchtext/x-diff; charset=us-asciiDownload

From 751d65ff1f8394f0513e556dd7bf331c4a1c4943 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Feb 2023 20:56:32 -0800
Subject: [PATCH v5 12/14] WIP: Don't initialize page in {vm,fsm}_extend(), not
 needed

---
 src/backend/access/heap/visibilitymap.c   | 6 +-----
 src/backend/storage/freespace/freespace.c | 5 +----
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb172..709213d0d97 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -623,11 +623,9 @@ static void
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
 	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
-
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
 	 * extend the visibility map at the same time. It also locks out extension
@@ -663,8 +661,6 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
 		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
 		vm_nblocks_now++;
 	}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 3e9693b293b..90c529958e7 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,10 +608,9 @@ static void
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
 	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
@@ -649,8 +648,6 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
 		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
 				   pg.data, false);
 		fsm_nblocks_now++;
-- 
2.38.0

v5-0013-Convert-a-few-places-to-ExtendBufferedRelTo.patchtext/x-diff; charset=us-asciiDownload

From 6d60144e1deab40782ba5df4e53622e33fd4ed23 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:08:17 -0800
Subject: [PATCH v5 13/14] Convert a few places to ExtendBufferedRelTo

---
 src/backend/access/heap/visibilitymap.c   | 80 +++++++---------------
 src/backend/access/transam/xlogutils.c    | 29 ++------
 src/backend/storage/freespace/freespace.c | 83 +++++++----------------
 3 files changed, 55 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 709213d0d97..60b96f5df80 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -126,7 +126,7 @@
 
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
-static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static Buffer vm_extend(Relation rel, BlockNumber vm_nblocks);
 
 
 /*
@@ -575,22 +575,29 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary. It's always safe to clear bits, so it's better to clear
+	 * corrupt pages than error out.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
 	{
 		if (extend)
-			vm_extend(rel, blkno + 1);
+			buf = vm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
+								 RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
-	 * always safe to clear bits, so it's better to clear corrupt pages than
-	 * error out.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -603,8 +610,6 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
-							 RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -619,51 +624,16 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
  * Ensure that the visibility map fork is at least vm_nblocks long, extending
  * it if necessary with zeroed pages.
  */
-static void
+static Buffer
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
-	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
+	Buffer buf;
 
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the visibility map at the same time. It also locks out extension
-	 * of the main fork, unnecessarily, but extending the visibility map
-	 * happens seldom enough that it doesn't seem worthwhile to have a
-	 * separate lock tag type for it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, VISIBILITYMAP_FORKNUM))
-		smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
-	vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
-
-	/* Now extend the file */
-	while (vm_nblocks_now < vm_nblocks)
-	{
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
-		vm_nblocks_now++;
-	}
+	buf = ExtendBufferedRelTo(EB_REL(rel), VISIBILITYMAP_FORKNUM, NULL,
+							  EB_CREATE_FORK_IF_NEEDED |
+							  EB_CLEAR_SIZE_CACHE,
+							  vm_nblocks,
+							  RBM_ZERO_ON_ERROR);
 
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
@@ -672,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * to keep checking for creation or extension of the file, which happens
 	 * infrequently.
 	 */
-	CacheInvalidateSmgr(reln->smgr_rlocator);
+	CacheInvalidateSmgr(RelationGetSmgr(rel)->smgr_rlocator);
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return buf;
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 2c28956b1aa..e174a2a8919 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -524,28 +524,13 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
-		buffer = InvalidBuffer;
-		do
-		{
-			if (buffer != InvalidBuffer)
-			{
-				if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-				ReleaseBuffer(buffer);
-			}
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum,
-											   P_NEW, mode, NULL, true);
-		}
-		while (BufferGetBlockNumber(buffer) < blkno);
-		/* Handle the corner case that P_NEW returns non-consecutive pages */
-		if (BufferGetBlockNumber(buffer) != blkno)
-		{
-			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno,
-											   mode, NULL, true);
-		}
+		buffer = ExtendBufferedRelTo(EB_SMGR(smgr, RELPERSISTENCE_PERMANENT),
+									 forknum,
+									 NULL,
+									 EB_PERFORMING_RECOVERY |
+									 EB_SKIP_EXTENSION_LOCK,
+									 blkno + 1,
+									 mode);
 	}
 
 recent_buffer_fast_path:
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 90c529958e7..2face615d07 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -98,7 +98,7 @@ static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot);
 static BlockNumber fsm_logical_to_physical(FSMAddress addr);
 
 static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
-static void fsm_extend(Relation rel, BlockNumber fsm_nblocks);
+static Buffer fsm_extend(Relation rel, BlockNumber fsm_nblocks);
 
 /* functions to convert amount of free space to a FSM category */
 static uint8 fsm_space_avail_to_cat(Size avail);
@@ -558,24 +558,30 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary.  The FSM information is not accurate anyway, so it's better
+	 * to clear corrupt pages than error out. Since the FSM changes are not
+	 * WAL-logged, the so-called torn page problem on crash can lead to pages
+	 * with corrupt headers, for example.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM])
 	{
 		if (extend)
-			fsm_extend(rel, blkno + 1);
+			buf = fsm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. The FSM
-	 * information is not accurate anyway, so it's better to clear corrupt
-	 * pages than error out. Since the FSM changes are not WAL-logged, the
-	 * so-called torn page problem on crash can lead to pages with corrupt
-	 * headers, for example.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -588,7 +594,6 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -604,56 +609,14 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
  * it if necessary with empty pages. And by empty, I mean pages filled
  * with zeros, meaning there's no free space.
  */
-static void
+static Buffer
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
-	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
-
-
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the FSM at the same time. It also locks out extension of the
-	 * main fork, unnecessarily, but extending the FSM happens seldom enough
-	 * that it doesn't seem worthwhile to have a separate lock tag type for
-	 * it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, FSM_FORKNUM))
-		smgrcreate(reln, FSM_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-	fsm_nblocks_now = smgrnblocks(reln, FSM_FORKNUM);
-
-	/* Extend as needed. */
-	while (fsm_nblocks_now < fsm_nblocks)
-	{
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
-		fsm_nblocks_now++;
-	}
-
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return ExtendBufferedRelTo(EB_REL(rel), FSM_FORKNUM, NULL,
+							   EB_CREATE_FORK_IF_NEEDED |
+							   EB_CLEAR_SIZE_CACHE,
+							   fsm_nblocks,
+							   RBM_ZERO_ON_ERROR);
 }
 
 /*
-- 
2.38.0

#36

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#35)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

Below is my review of a slightly older version than you just posted --
much of it you may have already addressed.

From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 20 Mar 2023 21:57:40 -0700
Subject: [PATCH v5 01/15] createdb-using-wal-fixes

This could use a more detailed commit message -- I don't really get what
it is doing

From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v5 02/15] Add some error checking around pinning

---
src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++---------
src/include/storage/bufmgr.h | 1 +
2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 95212a3941..fa20fab5a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
                                     LW_EXCLUSIVE);
 }

+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+    if (BufferIsLocal(buffer))
+    {
+        /* There should be exactly one pin */
+        if (LocalRefCount[-buffer - 1] != 1)
+            elog(ERROR, "incorrect local pin count: %d",
+                LocalRefCount[-buffer - 1]);
+    }
+    else
+    {
+        /* There should be exactly one local pin */
+        if (GetPrivateRefCount(buffer) != 1)

I'd rather this be an else if (was already like this, but, no reason not
to change it now)

+            elog(ERROR, "incorrect local pin count: %d",
+                GetPrivateRefCount(buffer));
+    }
+}

From 00d3044770478eea31e00fee8d1680f22ca6adde Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 27 Feb 2023 17:36:37 -0800
Subject: [PATCH v5 04/15] Add smgrzeroextend(), FileZero(), FileFallocate()

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9fd8444ed4..c34ed41d52 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2206,6 +2206,92 @@ FileSync(File file, uint32 wait_event_info)
     return returnCode;
 }

+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+    int            returnCode;
+    ssize_t        written;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+    pgstat_report_wait_end();
+
+    if (written < 0)
+        return -1;
+    else if (written != amount)

this doesn't need to be an else if

+    {
+        /* if errno is unset, assume problem is no disk space */
+        if (errno == 0)
+            errno = ENOSPC;
+        return -1;
+    }

+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+    int            returnCode;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+    pgstat_report_wait_end();
+
+    if (returnCode == 0)
+        return 0;
+
+    /* for compatibility with %m printing etc */
+    errno = returnCode;
+
+    /*
+    * Return in cases of a "real" failure, if fallocate is not supported,
+    * fall through to the FileZero() backed implementation.
+    */
+    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+        return returnCode;

I'm pretty sure you can just delete the below if statement

+    if (returnCode == 0 ||
+        (returnCode != EINVAL && returnCode != EINVAL))
+        return returnCode;

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..59a65a8305 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -28,6 +28,7 @@
 #include "access/xlog.h"
 #include "access/xlogutils.h"
 #include "commands/tablespace.h"
+#include "common/file_utils.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "pgstat.h"
@@ -500,6 +501,116 @@ mdextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum,
     Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }

+/*
+ *    mdzeroextend() -- Add ew zeroed out blocks to the specified relation.

not sure what ew is

+ *
+ *        Similar to mdrextend(), except the relation can be extended by

mdrextend->mdextend

+ * multiple blocks at once, and that the added blocks will be
filled with

I would lose the comma and just say "and the added blocks will be filled..."

+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+            BlockNumber blocknum, int nblocks, bool skipFsync)

So, I think there are a few too many local variables in here, and it
actually makes it more confusing.
Assuming you would like to keep the input parameters blocknum and
nblocks unmodified for debugging/other reasons, here is a suggested
refactor of this function
Also, I think you can combine the two error cases (I don't know if the
user cares what you were trying to extend the file with). I've done this
below also.

void
mdzeroextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync)
{
MdfdVec *v;
BlockNumber curblocknum = blocknum;
int remblocks = nblocks;

Assert(nblocks > 0);

/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
#endif

/*
* If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
* more --- we mustn't create a block whose number actually is
* InvalidBlockNumber or larger.
*/
if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("cannot extend file \"%s\" beyond %u blocks",
relpath(reln->smgr_rlocator, forknum),
InvalidBlockNumber)));

while (remblocks > 0)
{
int segstartblock = curblocknum % ((BlockNumber)
RELSEG_SIZE);
int numblocks = remblocks;
off_t seekpos = (off_t) BLCKSZ * segstartblock;
int ret;

if (segstartblock + remblocks > RELSEG_SIZE)
numblocks = RELSEG_SIZE - segstartblock;

v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync,
EXTENSION_CREATE);

/*
* If available and useful, use posix_fallocate() (via FileAllocate())
* to extend the relation. That's often more efficient than using
* write(), as it commonly won't cause the kernel to allocate page
* cache space for the extended pages.
*
* However, we don't use FileAllocate() for small extensions, as it
* defeats delayed allocation on some filesystems. Not clear where
* that decision should be made though? For now just use a cutoff of
* 8, anything between 4 and 8 worked OK in some local testing.
*/
if (numblocks > 8)
ret = FileFallocate(v->mdfd_vfd,
seekpos, (off_t) BLCKSZ * numblocks,
WAIT_EVENT_DATA_FILE_EXTEND);
else
/*
* Even if we don't want to use fallocate, we can still extend a
* bit more efficiently than writing each 8kB block individually.
* pg_pwrite_zeroes() (via FileZero()) uses
* pg_pwritev_with_retry() to avoid multiple writes or needing a
* zeroed buffer for the whole length of the extension.
*/
ret = FileZero(v->mdfd_vfd,
seekpos, (off_t) BLCKSZ * numblocks,
WAIT_EVENT_DATA_FILE_EXTEND);

if (ret != 0)
ereport(ERROR,
errcode_for_file_access(),
errmsg("could not extend file \"%s\": %m",
FilePathName(v->mdfd_vfd)),
errhint("Check free disk space."));

if (!skipFsync && !SmgrIsTemp(reln))
register_dirty_segment(reln, forknum, v);

Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));

remblocks -= numblocks;
curblocknum += numblocks;
}
}

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e5414..5224ca5259 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
+/*
+ *    smgrzeroextend() -- Add new zeroed out blocks to a file.
+ *
+ *        Similar to smgrextend(), except the relation can be extended by
+ *        multiple blocks at once, and that the added blocks will be
filled with
+ *        zeroes.
+ */

Similar grammatical feedback as mdzeroextend.

From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index fa20fab5a2..6f50dbd212 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
 }

 void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)

This name is weird. Who is we?

diff --git a/src/backend/storage/buffer/localbuf.c
b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..798c5b93a8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+    uint32        buf_state;
+    Buffer        buffer = BufferDescriptorGetBuffer(buf_hdr);
+    int            bufid = -(buffer + 1);

You do
int buffid = -buffer - 1;
in UnpinLocalBuffer()
They should be consistent.

int bufid = -(buffer + 1);

I think this version is better:

int buffid = -buffer - 1;

Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
undefined behavior while the -buffer - 1 version doesn't.

From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:26:34 -0800
Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 3d0683593f..ea423ae484 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,

     /*
     * Buffer contents are currently invalid.  Try to obtain the right to
     * start I/O.  If StartBufferIO returns false, then someone else managed
     * to read it before we did, so there's nothing left for BufferAlloc() to
     * do.
     */
-    if (StartBufferIO(buf, true))
+    if (StartBufferIO(victim_buf_hdr, true))
         *foundPtr = false;
     else
         *foundPtr = true;

I know it was already like this, but since you edited the line already,
can we just make this this now?

*foundPtr = !StartBufferIO(victim_buf_hdr, true);

@@ -1595,6 +1413,237 @@ retry:
StrategyFreeBuffer(buf);
}

+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+    /*
+    * Clear out the buffer's tag and flags and usagecount.  This is not
+    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+    * doing anything with the buffer. But currently it's beneficial as the
+    * pre-check for several linear scans of shared buffers just checks the
+    * tag.

I don't really understand the above comment -- mainly the last sentence.

+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+    BufferDesc *buf_hdr;
+    Buffer        buf;
+    uint32        buf_state;
+    bool        from_ring;
+
+    /*
+    * Ensure, while the spinlock's not yet held, that there's a free refcount
+    * entry.
+    */
+    ReservePrivateRefCountEntry();
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    /* we return here if a prospective victim buffer gets used concurrently */
+again:

Why use goto instead of a loop here (again is the goto label)?

From a7597b79dffaf96807f4a9beea0a39634530298d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v5 08/15] bufmgr: Support multiple in-progress IOs by using
resowner

Commit message should describe why we couldn't support multiple
in-progress IOs before, I think (e.g. we couldn't be sure that we
cleared IO_IN_PROGRESS if something happened).

@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits)
{
uint32 buf_state;

I noticed that the comment above TermianteBufferIO() says
* TerminateBufferIO: release a buffer we were doing I/O on
* (Assumptions)
* My process is executing IO for the buffer

Can we still say this is an assumption? What about when it is being
cleaned up after being called from AbortBufferIO()

diff --git a/src/backend/utils/resowner/resowner.c
b/src/backend/utils/resowner/resowner.c
index 19b6241e45..fccc59b39d 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData

     /* We have built-in support for remembering: */
     ResourceArray bufferarr;    /* owned buffers */
+    ResourceArray bufferioarr;    /* in-progress buffer IO */
     ResourceArray catrefarr;    /* catcache references */
     ResourceArray catlistrefarr;    /* catcache-list pins */
     ResourceArray relrefarr;    /* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)

Maybe worth mentioning in-progress buffer IO in resowner README? I know
it doesn't claim to be exhaustive, so, up to you.

Also, I realize that existing code in this file has the extraneous
parantheses, but maybe it isn't worth staying consistent with that?
as in: &(owner->bufferioarr)

+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+    ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+

From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:24:19 -0800
Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
ExtendBufferedRel{By,To,}

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 3c95b87bca..4e07a5bc48 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by blocks. Depending on the
+ * availability of resources the relation may end up being extended by a
+ * smaller number of pages (unless an error is thrown, always by at least one
+ * page). *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ *
+ * If EB_LOCK_FIRST is part of flags, the first returned buffer is
+ * locked. This is useful for callers that want a buffer that is guaranteed to
+ * be empty.

This should document what the returned BlockNumber is.
Also, instead of having extend_by and extended_by, how about just having
one which is set by the caller to the desired number to extend by and
then overwritten in this function to the value it successfully extended
by.

It would be nice if the function returned the number it extended by
instead of the BlockNumber.

+ */
+BlockNumber
+ExtendBufferedRelBy(ExtendBufferedWhat eb,
+                    ForkNumber fork,
+                    BufferAccessStrategy strategy,
+                    uint32 flags,
+                    uint32 extend_by,
+                    Buffer *buffers,
+                    uint32 *extended_by)
+{
+    Assert((eb.rel != NULL) ^ (eb.smgr != NULL));

Can we turn these into !=

Assert((eb.rel != NULL) != (eb.smgr != NULL));

since it is easier to understand.

(it is also in ExtendBufferedRelTo())

+    Assert(eb.smgr == NULL || eb.relpersistence != 0);
+    Assert(extend_by > 0);
+
+    if (eb.smgr == NULL)
+    {
+        eb.smgr = RelationGetSmgr(eb.rel);
+        eb.relpersistence = eb.rel->rd_rel->relpersistence;
+    }
+
+    return ExtendBufferedRelCommon(eb, fork, strategy, flags,
+                                  extend_by, InvalidBlockNumber,
+                                  buffers, extended_by);
+}

+ * Extend the relation so it is at least extend_to blocks large, read buffer

Use of "read buffer" here is confusing. We only read the block if, after
we try extending the relation, someone else already did so and we have
to read the block they extended in, right?

+ * (extend_to - 1).
+ *
+ * This is useful for callers that want to write a specific page, regardless
+ * of the current size of the relation (e.g. useful for visibilitymap and for
+ * crash recovery).
+ */
+Buffer
+ExtendBufferedRelTo(ExtendBufferedWhat eb,
+                    ForkNumber fork,
+                    BufferAccessStrategy strategy,
+                    uint32 flags,
+                    BlockNumber extend_to,
+                    ReadBufferMode mode)
+{

+ while (current_size < extend_to)
+ {

Can declare buffers variable here.
+ Buffer buffers[64];

+        uint32 num_pages = lengthof(buffers);
+        BlockNumber first_block;
+
+        if ((uint64) current_size + num_pages > extend_to)
+            num_pages = extend_to - current_size;
+
+        first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
+                                             num_pages, extend_to,
+                                             buffers, &extended_by);
+
+        current_size = first_block + extended_by;
+        Assert(current_size <= extend_to);
+        Assert(num_pages != 0 || current_size >= extend_to);
+
+        for (int i = 0; i < extended_by; i++)
+        {
+            if (first_block + i != extend_to - 1)

Is there a way we could avoid pinning these other buffers to begin with
(e.g. passing a parameter to ExtendBufferedRelCommon())

+                ReleaseBuffer(buffers[i]);
+            else
+                buffer = buffers[i];
+        }
+    }

+    /*
+    * It's possible that another backend concurrently extended the
+    * relation. In that case read the buffer.
+    *
+    * XXX: Should we control this via a flag?
+    */

I feel like there needs to be a more explicit comment about how you
could end up in this situation -- e.g. someone else extends the relation
and so smgrnblocks returns a value that is greater than extend_to, so
buffer stays InvalidBuffer

+    if (buffer == InvalidBuffer)
+    {
+        bool hit;
+
+        Assert(extended_by == 0);
+        buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+                                  fork, extend_to - 1, mode, strategy,
+                                  &hit);
+    }
+
+    return buffer;
+}

Do we use compound literals? Here, this could be:

buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
fork, extend_to - 1, mode, strategy,
&(bool) {0});

To eliminate the extraneous hit variable.

/*
* ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
bool found;
IOContext io_context;
IOObject io_object;
- bool isExtend;
bool isLocalBuf = SmgrIsTemp(smgr);

*hit = false;

+    /*
+    * Backward compatibility path, most code should use
+    * ExtendRelationBuffered() instead, as acquiring the extension lock
+    * inside ExtendRelationBuffered() scales a lot better.

Think these are old function names in the comment

+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+                        ForkNumber fork,
+                        BufferAccessStrategy strategy,
+                        uint32 flags,
+                        uint32 extend_by,
+                        BlockNumber extend_upto,
+                        Buffer *buffers,
+                        uint32 *extended_by)
+{
+    BlockNumber first_block;
+    IOContext    io_context = IOContextForStrategy(strategy);
+
+    LimitAdditionalPins(&extend_by);
+
+    /*
+    * Acquire victim buffers for extension without holding extension lock.
+    * Writing out victim buffers is the most expensive part of extending the
+    * relation, particularly when doing so requires WAL flushes. Zeroing out
+    * the buffers is also quite expensive, so do that before holding the
+    * extension lock as well.
+    *
+    * These pages are pinned by us and not valid. While we hold the pin they
+    * can't be acquired as victim buffers by another backend.
+    */
+    for (uint32 i = 0; i < extend_by; i++)
+    {
+        Block        buf_block;
+
+        buffers[i] = GetVictimBuffer(strategy, io_context);
+        buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+        /* new buffers are zero-filled */
+        MemSet((char *) buf_block, 0, BLCKSZ);
+    }
+
+    /*
+    * Lock relation against concurrent extensions, unless requested not to.
+    *
+    * We use the same extension lock for all forks. That's unnecessarily
+    * restrictive, but currently extensions for forks don't happen often
+    * enough to make it worth locking more granularly.
+    *
+    * Note that another backend might have extended the relation by the time
+    * we get the lock.
+    */
+    if (!(flags & EB_SKIP_EXTENSION_LOCK))
+    {
+        LockRelationForExtension(eb.rel, ExclusiveLock);
+        eb.smgr = RelationGetSmgr(eb.rel);
+    }
+
+    /*
+    * If requested, invalidate size cache, so that smgrnblocks asks the
+    * kernel.
+    */
+    if (flags & EB_CLEAR_SIZE_CACHE)
+        eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;

I don't see this in master, is it new?

+    first_block = smgrnblocks(eb.smgr, fork);
+

The below needs a better comment explaining what it is handling. e.g. if
we end up extending by less than we planned, unpin all of the surplus
victim buffers.

+    if (extend_upto != InvalidBlockNumber)
+    {
+        uint32 old_num_pages = extend_by;

maybe call this something like original_extend_by

diff --git a/src/backend/storage/buffer/localbuf.c
b/src/backend/storage/buffer/localbuf.c
index 5b44b0be8b..0528fddf99 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
+BlockNumber
+ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+                      ForkNumber fork,
+                      uint32 flags,
+                      uint32 extend_by,
+                      BlockNumber extend_upto,
+                      Buffer *buffers,
+                      uint32 *extended_by)
+{

+ victim_buf_id = -(buffers[i] + 1);

same comment here as before.

+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+    /*
+    * Don't acquire extension lock. This is safe only if the relation isn't
+    * shared, an access exclusive lock is held or if this is the startup
+    * process.
+    */
+    EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+    /* Is this extension part of recovery? */
+    EB_PERFORMING_RECOVERY = (1 << 1),
+
+    /*
+    * Should the fork be created if it does not currently exist? This likely
+    * only ever makes sense for relation forks.
+    */
+    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+    /* Should the first (possibly only) return buffer be returned locked? */
+    EB_LOCK_FIRST = (1 << 3),
+
+    /* Should the smgr size cache be cleared? */
+    EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+    /* internal flags follow */

I don't understand what this comment means ("internal flags follow")

+ EB_LOCK_TARGET = (1 << 5),
+} ExtendBufferedFlags;

+typedef struct ExtendBufferedWhat

Maybe this should be called like BufferedExtendTarget or something?

+{
+    Relation rel;
+    struct SMgrRelationData *smgr;
+    char relpersistence;
+} ExtendBufferedWhat;

From e4438c0eb87035e4cefd1de89458a8d88c90c0e3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v5 11/15] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index cf4b917eb4..500904897d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2050,6 +2053,33 @@ heap_prepare_insert(Relation relation,
HeapTuple tup, TransactionId xid,
         return tup;
 }

+/*
+ * Helper for heap_multi_insert() that computes the number of full pages s

no space after page before s

+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples,
Size saveFreeSpace)
+{
+    size_t        page_avail;
+    int            npages = 0;
+
+    page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+    npages++;

can this not just be this:

size_t page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
int npages = 1;

From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy()

---
src/backend/access/heap/hio.c | 285 +++++++++++++++++-----------------
1 file changed, 146 insertions(+), 139 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e7..48cfcff975 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,

so in RelationGetBufferForTuple() up above where your changes start,
there is this code

/*
* We first try to put the tuple on the same page we last inserted a tuple
* on, as cached in the BulkInsertState or relcache entry. If that
* doesn't work, we ask the Free Space Map to locate a suitable page.
* Since the FSM's info might be out of date, we have to be prepared to
* loop around and retry multiple times. (To insure this isn't an infinite
* loop, we must update the FSM with the correct amount of free space on
* each page that proves not to be suitable.) If the FSM has no record of
* a page with enough free space, we give up and extend the relation.
*
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
if (bistate && bistate->current_buf != InvalidBuffer)
{
targetBlock = BufferGetBlockNumber(bistate->current_buf);
}
else
targetBlock = RelationGetTargetBlock(relation);

if (targetBlock == InvalidBlockNumber && use_fsm)
{
/*
* We have no cached target page, so ask the FSM for an initial
* target.
*/
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}

And, I was thinking how, ReadBufferBI() only has one caller now
(RelationGetBufferForTuple()) and, this caller basically already has
checked for the case in the inside of ReadBufferBI() (the code I pasted
above)

/* If we have the desired block already pinned, re-pin and return it */
if (bistate->current_buf != InvalidBuffer)
{
if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
{
/*
* Currently the LOCK variants are only used for extending
* relation, which should never reach this branch.
*/
Assert(mode != RBM_ZERO_AND_LOCK &&
mode != RBM_ZERO_AND_CLEANUP_LOCK);

IncrBufferRefCount(bistate->current_buf);
return bistate->current_buf;
}
/* ... else drop the old buffer */

So, I was thinking maybe there is some way to inline the logic for
ReadBufferBI(), because I think it would feel more streamlined to me.

@@ -558,18 +477,46 @@ loop:
ReleaseBuffer(buffer);
}

Oh, and I forget which commit introduced BulkInsertState->next_free and
last_free, but I remember thinking that it didn't seem to fit with the
other parts of that commit.

-        /* Without FSM, always fall out of the loop and extend */
-        if (!use_fsm)
-            break;
+        if (bistate
+            && bistate->next_free != InvalidBlockNumber
+            && bistate->next_free <= bistate->last_free)
+        {
+            /*
+            * We bulk extended the relation before, and there are still some
+            * unused pages from that extension, so we don't need to look in
+            * the FSM for a new page. But do record the free space from the
+            * last page, somebody might insert narrower tuples later.
+            */

Why couldn't we have found out that we bulk-extended before and get the
block from there up above the while loop?

+            if (use_fsm)
+                RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);

-        /*
-        * Update FSM as to condition of this page, and ask for another page
-        * to try.
-        */
-        targetBlock = RecordAndGetPageWithFreeSpace(relation,
-                                                    targetBlock,
-                                                    pageFreeSpace,
-                                                    targetFreeSpace);
+            Assert(bistate->last_free != InvalidBlockNumber &&

You don't need the below half of the assert.

+                  bistate->next_free <= bistate->last_free);
+            targetBlock = bistate->next_free;
+            if (bistate->next_free >= bistate->last_free)

they can only be equal at this point

+            {
+                bistate->next_free = InvalidBlockNumber;
+                bistate->last_free = InvalidBlockNumber;
+            }
+            else
+                bistate->next_free++;
+        }
+        else if (!use_fsm)
+        {
+            /* Without FSM, always fall out of the loop and extend */
+            break;
+        }

It would be nice to have a comment explaining why this is in its own
else if instead of breaking earlier (i.e. !use_fsm is still a valid case
in the if branch above it)

+        else
+        {
+            /*
+            * Update FSM as to condition of this page, and ask for another
+            * page to try.
+            */
+            targetBlock = RecordAndGetPageWithFreeSpace(relation,
+                                                        targetBlock,
+                                                        pageFreeSpace,
+                                                        targetFreeSpace);
+        }

we can get rid of needLock and waitcount variables like this

+#define MAX_BUFFERS 64
+        Buffer        victim_buffers[MAX_BUFFERS];
+        BlockNumber firstBlock = InvalidBlockNumber;
+        BlockNumber firstBlockFSM = InvalidBlockNumber;
+        BlockNumber curBlock;
+        uint32        extend_by_pages;
+        uint32        no_fsm_pages;
+        uint32        waitcount;
+
+        extend_by_pages = num_pages;
+
+        /*
+        * Multiply the number of pages to extend by the number of waiters. Do
+        * this even if we're not using the FSM, as it does relieve
+        * contention. Pages will be found via bistate->next_free.
+        */
+        if (needLock)
+            waitcount = RelationExtensionLockWaiterCount(relation);
+        else
+            waitcount = 0;
+        extend_by_pages += extend_by_pages * waitcount;

if (!RELATION_IS_LOCAL(relation))
extend_by_pages += extend_by_pages *
RelationExtensionLockWaiterCount(relation);

+
+        /*
+        * can't extend by more than MAX_BUFFERS, we need to pin them all
+        * concurrently. FIXME: Need an NBuffers / MaxBackends type limit
+        * here.
+        */
+        extend_by_pages = Min(extend_by_pages, MAX_BUFFERS);
+
+        /*
+        * How many of the extended pages not to enter into the FSM.
+        *
+        * Only enter pages that we don't need ourselves into the FSM.
+        * Otherwise every other backend will immediately try to use the pages
+        * this backend neds itself, causing unnecessary contention.
+        *
+        * Bulk extended pages are remembered in bistate->next_free_buffer. So
+        * without a bistate we can't directly make use of them.
+        *
+        * Never enter the page returned into the FSM, we'll immediately use
+        * it.
+        */
+        if (num_pages > 1 && bistate == NULL)
+            no_fsm_pages = 1;
+        else
+            no_fsm_pages = num_pages;

this is more clearly this:
no_fsm_pages = bistate == NULL ? 1 : num_pages;

-    /*
-    * Release the file-extension lock; it's now OK for someone else to extend
-    * the relation some more.
-    */
-    if (needLock)
-        UnlockRelationForExtension(relation, ExclusiveLock);
+        if (bistate)
+        {
+            if (extend_by_pages > 1)
+            {
+                bistate->next_free = firstBlock + 1;
+                bistate->last_free = firstBlock + extend_by_pages - 1;
+            }
+            else
+            {
+                bistate->next_free = InvalidBlockNumber;
+                bistate->last_free = InvalidBlockNumber;
+            }
+        }
+
+        buffer = victim_buffers[0];

If we move buffer = up, we can have only one if (bistate)

+        if (bistate)
+        {
+            IncrBufferRefCount(buffer);
+            bistate->current_buf = buffer;
+        }
+    }

like this:

buffer = victim_buffers[0];

if (bistate)
{
if (extend_by_pages > 1)
{
bistate->next_free = firstBlock + 1;
bistate->last_free = firstBlock + extend_by_pages - 1;
}
else
{
bistate->next_free = InvalidBlockNumber;
bistate->last_free = InvalidBlockNumber;
}

IncrBufferRefCount(buffer);
bistate->current_buf = buffer;
}

From 6711e45bed59ee07ec277b9462f4745603a3d4a4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:41:46 -0700
Subject: [PATCH v5 15/15] bufmgr: debug: Add PrintBuffer[Desc]

Useful for development. Perhaps we should polish these and keep them?
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 4e07a5bc48..0d382cd787 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
+
+    fprintf(stderr, "%d: [%u] msg: %s, rel: %s, block %u: refcount:
%u / %u, usagecount: %u, flags:%s%s%s%s%s%s%s%s%s%s\n",
+            MyProcPid,
+            buffer,
+            msg,
+            path,
+            blockno,
+            BUF_STATE_GET_REFCOUNT(buf_state),
+            GetPrivateRefCount(buffer),
+            BUF_STATE_GET_USAGECOUNT(buf_state),
+            buf_state & BM_LOCKED ? " BM_LOCKED" : "",
+            buf_state & BM_DIRTY ? " BM_DIRTY" : "",
+            buf_state & BM_VALID ? " BM_VALID" : "",
+            buf_state & BM_TAG_VALID ? " BM_TAG_VALID" : "",
+            buf_state & BM_IO_IN_PROGRESS ? " BM_IO_IN_PROGRESS" : "",
+            buf_state & BM_IO_ERROR ? " BM_IO_ERROR" : "",
+            buf_state & BM_JUST_DIRTIED ? " BM_JUST_DIRTIED" : "",
+            buf_state & BM_PIN_COUNT_WAITER ? " BM_PIN_COUNT_WAITER" : "",
+            buf_state & BM_CHECKPOINT_NEEDED ? " BM_CHECKPOINT_NEEDED" : "",
+            buf_state & BM_PERMANENT ? " BM_PERMANENT" : ""
+        );
+}

How about this

#define FLAG_DESC(flag) (buf_state & (flag) ? " " #flag : "")
FLAG_DESC(BM_LOCKED),
FLAG_DESC(BM_DIRTY),
FLAG_DESC(BM_VALID),
FLAG_DESC(BM_TAG_VALID),
FLAG_DESC(BM_IO_IN_PROGRESS),
FLAG_DESC(BM_IO_ERROR),
FLAG_DESC(BM_JUST_DIRTIED),
FLAG_DESC(BM_PIN_COUNT_WAITER),
FLAG_DESC(BM_CHECKPOINT_NEEDED),
FLAG_DESC(BM_PERMANENT)
#undef FLAG_DESC

+
+void
+PrintBuffer(Buffer buffer, const char *msg)
+{
+    BufferDesc *buf_hdr = GetBufferDescriptor(buffer - 1);

no need for this variable

+
+    PrintBufferDesc(buf_hdr, msg);
+}

- Melanie

#37

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 3 years ago

In reply to: Andres Freund (#35)

Re: refactoring relation extension and BufferAlloc(), faster COPY

At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
relation extension related code in hio.c back into its own function.

While reviewing the hio.c code, I did realize that too much stuff is done
while holding the buffer lock. See also the pre-existing issue
/messages/by-id/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de

0001, 0002 looks fine to me.

0003 adds the new function FileFallocte, but we already have
AllocateFile. Although fd.c contains functions with varying word
orders, it could be confusing that closely named functions have
different naming conventions.

+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return returnCode;

I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can
be returned. Some googling indicate that ENOSYS might need the same
amendment to EOPNOTSUPP. However, I'm not clear on why man
posix_fallocate donsn't mention the former.

+ (returnCode != EINVAL && returnCode != EINVAL))
:)

FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file) < 0)
+		return -1;
+

The function's comment is provided below.

* The returned file descriptor will be valid until the file is closed, but
* there are a lot of things that can make that happen. So the caller should
* be careful not to do much of anything else before it finishes using the
* returned file descriptor.

So, the responsibility to make sure the file is valid seems to lie
with the callers, although I'm not sure since there aren't any
function users in the tree. I'm unclear as to why FileSize omits the
case lruLessRecently != file. When examining similar functions, such
as FileGetRawFlags and FileGetRawMode, I'm puzzled to find that
FileAccess() nor BasicOpenFilePermthe don't set the struct members
referred to by the functions. This makes my question the usefulness
of these functions including FileGetRawDesc(). Regardless, since the
patchset doesn't use FileGetRawDesc(), I don't believe the fix is
necessary in this patch set.

+ if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)

I'm not sure it is appropriate to assume InvalidBlockNumber equals
MaxBlockNumber + 1 in this context.

+		int			segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		int			segendblock = (curblocknum % ((BlockNumber) RELSEG_SIZE)) + remblocks;
+		off_t		seekpos = (off_t) BLCKSZ * segstartblock;

segendblock can be defined as "segstartblock + remblocks", which would
be clearer.

+ * If available and useful, use posix_fallocate() (via FileAllocate())

FileFallocate()?

+		 * However, we don't use FileAllocate() for small extensions, as it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.

The chose is quite similar to what FileFallocate() makes. However, I'm
not sure FileFallocate() itself should be doing this.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#38

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#36)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:

Below is my review of a slightly older version than you just posted --
much of it you may have already addressed.

Far from all is already - thanks for the review!

From 3a6c3f41000e057bae12ab4431e6bb1c5f3ec4b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 20 Mar 2023 21:57:40 -0700
Subject: [PATCH v5 01/15] createdb-using-wal-fixes

This could use a more detailed commit message -- I don't really get what
it is doing

It's a fix for a bug that I encountered while hacking on this, it since has
been committed.

commit 5df319f3d55d09fadb4f7e4b58c5b476a3aeceb4
Author: Andres Freund <andres@anarazel.de>
Date: 2023-03-20 21:57:40 -0700

Fix memory leak and inefficiency in CREATE DATABASE ... STRATEGY WAL_LOG

RelationCopyStorageUsingBuffer() did not free the strategies used to access
the source / target relation. They memory was released at the end of the
transaction, but when using a template database with a lot of relations, the
temporary leak can become big prohibitively big.

RelationCopyStorageUsingBuffer() acquired the buffer for the target relation
with RBM_NORMAL, therefore requiring a read of a block guaranteed to be
zero. Use RBM_ZERO_AND_LOCK instead.

Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: /messages/by-id/20230321070113.o2vqqxogjykwgfrr@awork3.anarazel.de
Backpatch: 15-, where STRATEGY WAL_LOG was introduced

From 6faba69c241fd5513022bb042c33af09d91e84a6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v5 02/15] Add some error checking around pinning

---
src/backend/storage/buffer/bufmgr.c | 40 ++++++++++++++++++++---------
src/include/storage/bufmgr.h | 1 +
2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 95212a3941..fa20fab5a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
LW_EXCLUSIVE);
}

+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+    if (BufferIsLocal(buffer))
+    {
+        /* There should be exactly one pin */
+        if (LocalRefCount[-buffer - 1] != 1)
+            elog(ERROR, "incorrect local pin count: %d",
+                LocalRefCount[-buffer - 1]);
+    }
+    else
+    {
+        /* There should be exactly one local pin */
+        if (GetPrivateRefCount(buffer) != 1)

I'd rather this be an else if (was already like this, but, no reason not
to change it now)

I don't like that much - it'd break the symmetry between local / non-local.

+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+    int            returnCode;
+    ssize_t        written;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+    pgstat_report_wait_end();
+
+    if (written < 0)
+        return -1;
+    else if (written != amount)

this doesn't need to be an else if

You mean it could be a "bare" if instead? I don't really think that's clearer.

+    {
+        /* if errno is unset, assume problem is no disk space */
+        if (errno == 0)
+            errno = ENOSPC;
+        return -1;
+    }

+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+    int            returnCode;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+    pgstat_report_wait_end();
+
+    if (returnCode == 0)
+        return 0;
+
+    /* for compatibility with %m printing etc */
+    errno = returnCode;
+
+    /*
+    * Return in cases of a "real" failure, if fallocate is not supported,
+    * fall through to the FileZero() backed implementation.
+    */
+    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+        return returnCode;

I'm pretty sure you can just delete the below if statement

+    if (returnCode == 0 ||
+        (returnCode != EINVAL && returnCode != EINVAL))
+        return returnCode;

Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?

+/*
+ *    mdzeroextend() -- Add ew zeroed out blocks to the specified relation.

not sure what ew is

A hurried new :)

+ *
+ *        Similar to mdrextend(), except the relation can be extended by

mdrextend->mdextend

+ * multiple blocks at once, and that the added blocks will be
filled with

I would lose the comma and just say "and the added blocks will be filled..."

Done.

+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+            BlockNumber blocknum, int nblocks, bool skipFsync)
So, I think there are a few too many local variables in here, and it
actually makes it more confusing.
Assuming you would like to keep the input parameters blocknum and
nblocks unmodified for debugging/other reasons, here is a suggested
refactor of this function

I'm mostly adopting this.

Also, I think you can combine the two error cases (I don't know if the
user cares what you were trying to extend the file with).

Hm. I do find it a somewhat useful distinction for figuring out problems - we
haven't used posix_fallocate for files so far, it seems plausible we'd hit
some portability issues. We could make it an errdetail(), I guess?

void
mdzeroextend(SMgrRelation reln, ForkNumber forknum,
BlockNumber blocknum, int nblocks, bool skipFsync)
{
MdfdVec *v;
BlockNumber curblocknum = blocknum;
int remblocks = nblocks;
Assert(nblocks > 0);

/* This assert is too expensive to have on normally ... */
#ifdef CHECK_WRITE_VS_EXTEND
Assert(blocknum >= mdnblocks(reln, forknum));
#endif

/*
* If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
* more --- we mustn't create a block whose number actually is
* InvalidBlockNumber or larger.
*/
if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
ereport(ERROR,
(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
errmsg("cannot extend file \"%s\" beyond %u blocks",
relpath(reln->smgr_rlocator, forknum),
InvalidBlockNumber)));

while (remblocks > 0)
{
int segstartblock = curblocknum % ((BlockNumber)
RELSEG_SIZE);

Hm - this shouldn't be an int - that's my fault, not yours...

From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index fa20fab5a2..6f50dbd212 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
}
void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)
This name is weird. Who is we?

The current backend. I.e. the function checks the current backend pins the
buffer exactly once, rather that *any* backend pins it once.

I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
even though it also just applies to pins by the current backend.

diff --git a/src/backend/storage/buffer/localbuf.c
b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..798c5b93a8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+    uint32        buf_state;
+    Buffer        buffer = BufferDescriptorGetBuffer(buf_hdr);
+    int            bufid = -(buffer + 1);
You do
int buffid = -buffer - 1;
in UnpinLocalBuffer()
They should be consistent.

int bufid = -(buffer + 1);

I think this version is better:

int buffid = -buffer - 1;

Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
undefined behavior while the -buffer - 1 version doesn't.

You are right! Not sure what I was doing there...

Ah - turns out there's pre-existing code in localbuf.c that do it that way :(
See at least MarkLocalBufferDirty().

We really need to wrap this in something, rather than open coding it all over
bufmgr.c/localbuf.c. I really dislike this indexing :(.

From a0228218e2ac299aac754eeb5b2be7ddfc56918d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:26:34 -0800
Subject: [PATCH v5 07/15] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers. Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful. By separating out the victim buffer acquisition, future commits will
be able to change relation extensions to scale better.
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 3d0683593f..ea423ae484 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
/*
* Buffer contents are currently invalid.  Try to obtain the right to
* start I/O.  If StartBufferIO returns false, then someone else managed
* to read it before we did, so there's nothing left for BufferAlloc() to
* do.
*/
-    if (StartBufferIO(buf, true))
+    if (StartBufferIO(victim_buf_hdr, true))
*foundPtr = false;
else
*foundPtr = true;
I know it was already like this, but since you edited the line already,
can we just make this this now?

*foundPtr = !StartBufferIO(victim_buf_hdr, true);

Hm, I do think it's easier to review if largely unchanged code is just moved
around, rather also rewritten. So I'm hesitant to do that here.

@@ -1595,6 +1413,237 @@ retry:
StrategyFreeBuffer(buf);
}

+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+    /*
+    * Clear out the buffer's tag and flags and usagecount.  This is not
+    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+    * doing anything with the buffer. But currently it's beneficial as the
+    * pre-check for several linear scans of shared buffers just checks the
+    * tag.

I don't really understand the above comment -- mainly the last sentence.

To start with, it's s/checks/check/

"linear scans" is a reference to functions like DropRelationBuffers(), which
iterate over all buffers, and just check the tag for a match. If we leave the
tag around, it'll still work, as InvalidateBuffer() etc will figure out that
the buffer is invalid. But of course that's slower then just skipping the
buffer "early on".

+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+    BufferDesc *buf_hdr;
+    Buffer        buf;
+    uint32        buf_state;
+    bool        from_ring;
+
+    /*
+    * Ensure, while the spinlock's not yet held, that there's a free refcount
+    * entry.
+    */
+    ReservePrivateRefCountEntry();
+    ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+    /* we return here if a prospective victim buffer gets used concurrently */
+again:

Why use goto instead of a loop here (again is the goto label)?

I find it way more readable this way. I'd use a loop if it were the common
case to loop, but it's the rare case, and for that I find the goto more
readable.

@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits)
{
uint32 buf_state;

I noticed that the comment above TermianteBufferIO() says
* TerminateBufferIO: release a buffer we were doing I/O on
* (Assumptions)
* My process is executing IO for the buffer

Can we still say this is an assumption? What about when it is being
cleaned up after being called from AbortBufferIO()

That hasn't really changed - it was already called by AbortBufferIO().

I think it's still correct, too. We must have marked the IO as being in
progress to get there.

diff --git a/src/backend/utils/resowner/resowner.c
b/src/backend/utils/resowner/resowner.c
index 19b6241e45..fccc59b39d 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData

/* We have built-in support for remembering: */
ResourceArray bufferarr;    /* owned buffers */
+    ResourceArray bufferioarr;    /* in-progress buffer IO */
ResourceArray catrefarr;    /* catcache references */
ResourceArray catlistrefarr;    /* catcache-list pins */
ResourceArray relrefarr;    /* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)

Maybe worth mentioning in-progress buffer IO in resowner README? I know
it doesn't claim to be exhaustive, so, up to you.

Hm. Given the few types of resources mentioned in the README, I don't think
it's worth doing so.

Also, I realize that existing code in this file has the extraneous
parantheses, but maybe it isn't worth staying consistent with that?
as in: &(owner->bufferioarr)

I personally don't find it worth being consistent with that, but if you /
others think it is, I'd be ok with adapting to that.

diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index 3c95b87bca..4e07a5bc48 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c

+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by blocks. Depending on the
+ * availability of resources the relation may end up being extended by a
+ * smaller number of pages (unless an error is thrown, always by at least one
+ * page). *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ *
+ * If EB_LOCK_FIRST is part of flags, the first returned buffer is
+ * locked. This is useful for callers that want a buffer that is guaranteed to
+ * be empty.

This should document what the returned BlockNumber is.

Ok.

Also, instead of having extend_by and extended_by, how about just having
one which is set by the caller to the desired number to extend by and
then overwritten in this function to the value it successfully extended
by.

I had it that way at first - but I think it turned out to be more confusing.

It would be nice if the function returned the number it extended by
instead of the BlockNumber.

It's not actually free to get the block number from a buffer (it causes more
sharing of the BufferDesc cacheline, which then makes modifications of the
cacheline more expensive). We should work on removing all those
BufferGetBlockNumber(). So I don't want to introduce a new function that
requires using BufferGetBlockNumber().

So I don't think this would be an improvement.

+ Assert((eb.rel != NULL) ^ (eb.smgr != NULL));

Can we turn these into !=

Assert((eb.rel != NULL) != (eb.smgr != NULL));

since it is easier to understand.

Done.

+ * Extend the relation so it is at least extend_to blocks large, read buffer

Use of "read buffer" here is confusing. We only read the block if, after
we try extending the relation, someone else already did so and we have
to read the block they extended in, right?

That's one case, yes. I think there's also some unfortunate other case that
I'd like to get rid of. See my archeology at
/messages/by-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de

+        uint32 num_pages = lengthof(buffers);
+        BlockNumber first_block;
+
+        if ((uint64) current_size + num_pages > extend_to)
+            num_pages = extend_to - current_size;
+
+        first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
+                                             num_pages, extend_to,
+                                             buffers, &extended_by);
+
+        current_size = first_block + extended_by;
+        Assert(current_size <= extend_to);
+        Assert(num_pages != 0 || current_size >= extend_to);
+
+        for (int i = 0; i < extended_by; i++)
+        {
+            if (first_block + i != extend_to - 1)

Is there a way we could avoid pinning these other buffers to begin with
(e.g. passing a parameter to ExtendBufferedRelCommon())

We can't avoid pinning them. We could make ExtendBufferedRelCommon() release
them though - but I'm not sure that'd be an improvement. I actually had a
flag for that temporarily, but

+    if (buffer == InvalidBuffer)
+    {
+        bool hit;
+
+        Assert(extended_by == 0);
+        buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+                                  fork, extend_to - 1, mode, strategy,
+                                  &hit);
+    }
+
+    return buffer;
+}
Do we use compound literals? Here, this could be:

buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
fork, extend_to - 1, mode, strategy,
&(bool) {0});

To eliminate the extraneous hit variable.

We do use compound literals in a few places. However, I don't think it's a
good idea to pass a pointer to a temporary. At least I need to look up the
lifetime rules for those every time. And this isn't a huge win, so I wouldn't
go for it here.

/*
* ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +991,36 @@ ReadBuffer_common(SMgrRelation smgr, char
relpersistence, ForkNumber forkNum,
bool found;
IOContext io_context;
IOObject io_object;
- bool isExtend;
bool isLocalBuf = SmgrIsTemp(smgr);

*hit = false;
+    /*
+    * Backward compatibility path, most code should use
+    * ExtendRelationBuffered() instead, as acquiring the extension lock
+    * inside ExtendRelationBuffered() scales a lot better.
Think these are old function names in the comment

Indeed.

+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+                        ForkNumber fork,
+                        BufferAccessStrategy strategy,
+                        uint32 flags,
+                        uint32 extend_by,
+                        BlockNumber extend_upto,
+                        Buffer *buffers,
+                        uint32 *extended_by)
+{
+    BlockNumber first_block;
+    IOContext    io_context = IOContextForStrategy(strategy);
+
+    LimitAdditionalPins(&extend_by);
+
+    /*
+    * Acquire victim buffers for extension without holding extension lock.
+    * Writing out victim buffers is the most expensive part of extending the
+    * relation, particularly when doing so requires WAL flushes. Zeroing out
+    * the buffers is also quite expensive, so do that before holding the
+    * extension lock as well.
+    *
+    * These pages are pinned by us and not valid. While we hold the pin they
+    * can't be acquired as victim buffers by another backend.
+    */
+    for (uint32 i = 0; i < extend_by; i++)
+    {
+        Block        buf_block;
+
+        buffers[i] = GetVictimBuffer(strategy, io_context);
+        buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+        /* new buffers are zero-filled */
+        MemSet((char *) buf_block, 0, BLCKSZ);
+    }
+
+    /*
+    * Lock relation against concurrent extensions, unless requested not to.
+    *
+    * We use the same extension lock for all forks. That's unnecessarily
+    * restrictive, but currently extensions for forks don't happen often
+    * enough to make it worth locking more granularly.
+    *
+    * Note that another backend might have extended the relation by the time
+    * we get the lock.
+    */
+    if (!(flags & EB_SKIP_EXTENSION_LOCK))
+    {
+        LockRelationForExtension(eb.rel, ExclusiveLock);
+        eb.smgr = RelationGetSmgr(eb.rel);
+    }
+
+    /*
+    * If requested, invalidate size cache, so that smgrnblocks asks the
+    * kernel.
+    */
+    if (flags & EB_CLEAR_SIZE_CACHE)
+        eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;

I don't see this in master, is it new?

Not really - it's just elsewhere. See vm_extend() and fsm_extend(). I could
move this part back into "Convert a few places to ExtendBufferedRelTo", but I
doin't think that'd be better.

+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+    /*
+    * Don't acquire extension lock. This is safe only if the relation isn't
+    * shared, an access exclusive lock is held or if this is the startup
+    * process.
+    */
+    EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+    /* Is this extension part of recovery? */
+    EB_PERFORMING_RECOVERY = (1 << 1),
+
+    /*
+    * Should the fork be created if it does not currently exist? This likely
+    * only ever makes sense for relation forks.
+    */
+    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+    /* Should the first (possibly only) return buffer be returned locked? */
+    EB_LOCK_FIRST = (1 << 3),
+
+    /* Should the smgr size cache be cleared? */
+    EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+    /* internal flags follow */

I don't understand what this comment means ("internal flags follow")

Hm - just that the flags defined subsequently are for internal use, not for
callers to specify.

+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples,
Size saveFreeSpace)
+{
+    size_t        page_avail;
+    int            npages = 0;
+
+    page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+    npages++;
can this not just be this:

size_t page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
int npages = 1;

Yes.

From 5d2be27caf8f4ee8f26841b2aa1674c90bd51754 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 14:14:11 -0700
Subject: [PATCH v5 12/15] hio: Use ExtendBufferedRelBy()

---
src/backend/access/heap/hio.c | 285 +++++++++++++++++-----------------
1 file changed, 146 insertions(+), 139 deletions(-)
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e7..48cfcff975 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -354,6 +270,9 @@ RelationGetBufferForTuple(Relation relation, Size len,
so in RelationGetBufferForTuple() up above where your changes start,
there is this code

/*
* We first try to put the tuple on the same page we last inserted a tuple
* on, as cached in the BulkInsertState or relcache entry. If that
* doesn't work, we ask the Free Space Map to locate a suitable page.
* Since the FSM's info might be out of date, we have to be prepared to
* loop around and retry multiple times. (To insure this isn't an infinite
* loop, we must update the FSM with the correct amount of free space on
* each page that proves not to be suitable.) If the FSM has no record of
* a page with enough free space, we give up and extend the relation.
*
* When use_fsm is false, we either put the tuple onto the existing target
* page or extend the relation.
*/
if (bistate && bistate->current_buf != InvalidBuffer)
{
targetBlock = BufferGetBlockNumber(bistate->current_buf);
}
else
targetBlock = RelationGetTargetBlock(relation);

if (targetBlock == InvalidBlockNumber && use_fsm)
{
/*
* We have no cached target page, so ask the FSM for an initial
* target.
*/
targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
}

And, I was thinking how, ReadBufferBI() only has one caller now
(RelationGetBufferForTuple()) and, this caller basically already has
checked for the case in the inside of ReadBufferBI() (the code I pasted
above)

/* If we have the desired block already pinned, re-pin and return it */
if (bistate->current_buf != InvalidBuffer)
{
if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
{
/*
* Currently the LOCK variants are only used for extending
* relation, which should never reach this branch.
*/
Assert(mode != RBM_ZERO_AND_LOCK &&
mode != RBM_ZERO_AND_CLEANUP_LOCK);

IncrBufferRefCount(bistate->current_buf);
return bistate->current_buf;
}
/* ... else drop the old buffer */

So, I was thinking maybe there is some way to inline the logic for
ReadBufferBI(), because I think it would feel more streamlined to me.

I don't really see how - I'd welcome suggestions?

@@ -558,18 +477,46 @@ loop:
ReleaseBuffer(buffer);
}

Oh, and I forget which commit introduced BulkInsertState->next_free and
last_free, but I remember thinking that it didn't seem to fit with the
other parts of that commit.

I'll move it into the one using ExtendBufferedRelBy().

-        /* Without FSM, always fall out of the loop and extend */
-        if (!use_fsm)
-            break;
+        if (bistate
+            && bistate->next_free != InvalidBlockNumber
+            && bistate->next_free <= bistate->last_free)
+        {
+            /*
+            * We bulk extended the relation before, and there are still some
+            * unused pages from that extension, so we don't need to look in
+            * the FSM for a new page. But do record the free space from the
+            * last page, somebody might insert narrower tuples later.
+            */

Why couldn't we have found out that we bulk-extended before and get the
block from there up above the while loop?

I'm not quite sure I follow - above the loop there might still have been space
on the prior page? We also need the ability to loop if the space has been used
since.

I guess there's an argument for also checking above the loop, but I don't
think that'd currently ever be reachable.

+            {
+                bistate->next_free = InvalidBlockNumber;
+                bistate->last_free = InvalidBlockNumber;
+            }
+            else
+                bistate->next_free++;
+        }
+        else if (!use_fsm)
+        {
+            /* Without FSM, always fall out of the loop and extend */
+            break;
+        }
It would be nice to have a comment explaining why this is in its own
else if instead of breaking earlier (i.e. !use_fsm is still a valid case
in the if branch above it)

I'm not quite following. Breaking where earlier?

Note that that branch is old code, it's just that a new way of getting a page
that potentially has free space is preceding it.

we can get rid of needLock and waitcount variables like this

+#define MAX_BUFFERS 64
+        Buffer        victim_buffers[MAX_BUFFERS];
+        BlockNumber firstBlock = InvalidBlockNumber;
+        BlockNumber firstBlockFSM = InvalidBlockNumber;
+        BlockNumber curBlock;
+        uint32        extend_by_pages;
+        uint32        no_fsm_pages;
+        uint32        waitcount;
+
+        extend_by_pages = num_pages;
+
+        /*
+        * Multiply the number of pages to extend by the number of waiters. Do
+        * this even if we're not using the FSM, as it does relieve
+        * contention. Pages will be found via bistate->next_free.
+        */
+        if (needLock)
+            waitcount = RelationExtensionLockWaiterCount(relation);
+        else
+            waitcount = 0;
+        extend_by_pages += extend_by_pages * waitcount;

if (!RELATION_IS_LOCAL(relation))
extend_by_pages += extend_by_pages *
RelationExtensionLockWaiterCount(relation);

I guess I find it useful to be able to quickly add logging messages for stuff
like this. I don't think local variables are as bad as you make them out to be
:)

Thanks for the review!

Andres

#39

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Kyotaro Horiguchi (#37)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-27 15:32:47 +0900, Kyotaro Horiguchi wrote:

At Sun, 26 Mar 2023 12:26:59 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

Attached is v5. Lots of comment polishing, a bit of renaming. I extracted the
relation extension related code in hio.c back into its own function.

While reviewing the hio.c code, I did realize that too much stuff is done
while holding the buffer lock. See also the pre-existing issue
/messages/by-id/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de

0001, 0002 looks fine to me.

0003 adds the new function FileFallocte, but we already have
AllocateFile. Although fd.c contains functions with varying word
orders, it could be confusing that closely named functions have
different naming conventions.

The syscall is named fallocate, I don't think we'd gain anything by inventing
a different name for it? Given that there's a number of File$syscall
operations, I think it's clear enough that it just fits into that. Unless you
have a better proposal?

+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return returnCode;
I'm not entirely sure, but man 2 fallocate tells that ENOSYS also can
be returned. Some googling indicate that ENOSYS might need the same
amendment to EOPNOTSUPP. However, I'm not clear on why man
posix_fallocate donsn't mention the former.

posix_fallocate() and its errors are specified by posix, I guess. I think
glibc etc will map ENOSYS to EOPNOTSUPP.

I really dislike this bit from the posix_fallocate manpage:

EINVAL offset was less than 0, or len was less than or equal to 0, or the underlying filesystem does not support the operation.

Why oh why would you add the "or .." portion into EINVAL, when there's also
EOPNOTSUPP?

+ (returnCode != EINVAL && returnCode != EINVAL))
:)
FileGetRawDesc(File file)
{
Assert(FileIsValid(file));
+
+	if (FileAccess(file) < 0)
+		return -1;
+
The function's comment is provided below.

* The returned file descriptor will be valid until the file is closed, but
* there are a lot of things that can make that happen. So the caller should
* be careful not to do much of anything else before it finishes using the
* returned file descriptor.

So, the responsibility to make sure the file is valid seems to lie
with the callers, although I'm not sure since there aren't any
function users in the tree.

Except, as I think you realized as well, external callers *can't* call
FileAccess(), it's static.

I'm unclear as to why FileSize omits the case lruLessRecently != file.

Not quite following - why would FileSize() deal with lruLessRecently itself?
Or do you mean why FileSize() uses FileIsNotOpen() itself, rather than relying
on FileAccess() doing that internally?

When examining similar functions, such as FileGetRawFlags and
FileGetRawMode, I'm puzzled to find that FileAccess() nor
BasicOpenFilePermthe don't set the struct members referred to by the
functions.

Those aren't involved with LRU mechanism, IIRC. Note that BasicOpenFilePerm()
returns an actual fd, not a File. So you can't call FileGetRawMode() on it. As
BasicOpenFilePerm() says:
* This is exported for use by places that really want a plain kernel FD,
* but need to be proof against running out of FDs. ...

I don't think FileAccess() needs to set those struct members, that's already
been done in PathNameOpenFilePerm().

This makes my question the usefulness of these functions including
FileGetRawDesc().

It's quite weird that we have FileGetRawDesc(), but don't allow to use it in a
safe way...

Regardless, since the
patchset doesn't use FileGetRawDesc(), I don't believe the fix is
necessary in this patch set.

Yea. It was used in an earlier version, but not anymore.

+ if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)

I'm not sure it is appropriate to assume InvalidBlockNumber equals
MaxBlockNumber + 1 in this context.

Hm. This is just the multi-block equivalent of what mdextend() already
does. It's not pretty, indeed. I'm not sure there's really a better thing to
do for mdzeroextend(), given the mdextend() precedent? mdzeroextend() (just as
mdextend()) will be called with blockNum == InvalidBlockNumber, if you try to
extend past the size limit.

+		 * However, we don't use FileAllocate() for small extensions, as it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
The chose is quite similar to what FileFallocate() makes. However, I'm
not sure FileFallocate() itself should be doing this.

I'm not following - there's no such choice in FileFallocate()? Do you mean
that FileFallocate() also falls back to FileZero()? I don't think those are
comparable.

We don't want the code outside of fd.c to have to implement a fallback for
platforms that don't have fallocate (or something similar), that's why
FileFallocate() falls back to FileZero().

Here we care about not calling FileFallocate() in too small increments, when
the relation might extend further. If we somehow knew in mdzeroextend() that
the file won't be extended further, it'd be a good idea to call
FileFallocate(), even if just for a single block - it'd prevent the kernel
from wasting memory for delayed allocation. But unfortunately we don't know
if it's the final size, hence the heuristic.

Does that make sense?

Thanks!

Andres

#40

Melanie Plageman

melanieplageman@gmail.com

almost 3 years ago

In reply to: Andres Freund (#38)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On Tue, Mar 28, 2023 at 11:47 PM Andres Freund <andres@anarazel.de> wrote:

On 2023-03-26 17:42:45 -0400, Melanie Plageman wrote:

+    {
+        /* if errno is unset, assume problem is no disk space */
+        if (errno == 0)
+            errno = ENOSPC;
+        return -1;
+    }

+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+    int            returnCode;
+
+    Assert(FileIsValid(file));
+    returnCode = FileAccess(file);
+    if (returnCode < 0)
+        return returnCode;
+
+    pgstat_report_wait_start(wait_event_info);
+    returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+    pgstat_report_wait_end();
+
+    if (returnCode == 0)
+        return 0;
+
+    /* for compatibility with %m printing etc */
+    errno = returnCode;
+
+    /*
+    * Return in cases of a "real" failure, if fallocate is not supported,
+    * fall through to the FileZero() backed implementation.
+    */
+    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+        return returnCode;

I'm pretty sure you can just delete the below if statement

+    if (returnCode == 0 ||
+        (returnCode != EINVAL && returnCode != EINVAL))
+        return returnCode;

Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?

Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0,
you already would have returned and if returnCode wasn't EINVAL, you
also already would have returned.
Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains
two identical operands.

+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+            BlockNumber blocknum, int nblocks, bool skipFsync)
So, I think there are a few too many local variables in here, and it
actually makes it more confusing.
Assuming you would like to keep the input parameters blocknum and
nblocks unmodified for debugging/other reasons, here is a suggested
refactor of this function
I'm mostly adopting this.

Also, I think you can combine the two error cases (I don't know if the
user cares what you were trying to extend the file with).

Hm. I do find it a somewhat useful distinction for figuring out problems - we
haven't used posix_fallocate for files so far, it seems plausible we'd hit
some portability issues. We could make it an errdetail(), I guess?

I think that would be clearer.

From ad7cd10a6c340d7f7d0adf26d5e39224dfd8439d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v5 05/15] bufmgr: Add Pin/UnpinLocalBuffer()
diff --git a/src/backend/storage/buffer/bufmgr.c
b/src/backend/storage/buffer/bufmgr.c
index fa20fab5a2..6f50dbd212 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
}
void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)
This name is weird. Who is we?
The current backend. I.e. the function checks the current backend pins the
buffer exactly once, rather that *any* backend pins it once.

I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
even though it also just applies to pins by the current backend.

Maybe there is a way to use "self" instead of a pronoun? But, if you
feel quite strongly about a pronoun, I think "we" implies more than one
backend, so "I" would be better.

@@ -1595,6 +1413,237 @@ retry:
StrategyFreeBuffer(buf);
}
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+    /*
+    * Clear out the buffer's tag and flags and usagecount.  This is not
+    * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+    * doing anything with the buffer. But currently it's beneficial as the
+    * pre-check for several linear scans of shared buffers just checks the
+    * tag.
I don't really understand the above comment -- mainly the last sentence.
To start with, it's s/checks/check/

"linear scans" is a reference to functions like DropRelationBuffers(), which
iterate over all buffers, and just check the tag for a match. If we leave the
tag around, it'll still work, as InvalidateBuffer() etc will figure out that
the buffer is invalid. But of course that's slower then just skipping the
buffer "early on".

Ah. I see the updated comment on your branch and find it to be more clear.

@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits)
{
uint32 buf_state;

I noticed that the comment above TermianteBufferIO() says
* TerminateBufferIO: release a buffer we were doing I/O on
* (Assumptions)
* My process is executing IO for the buffer

Can we still say this is an assumption? What about when it is being
cleaned up after being called from AbortBufferIO()

That hasn't really changed - it was already called by AbortBufferIO().

I think it's still correct, too. We must have marked the IO as being in
progress to get there.

Oh, no I meant the "my process is executing IO for the buffer" --
couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one
who set it on all the victim buffers)?

Also, I realize that existing code in this file has the extraneous
parantheses, but maybe it isn't worth staying consistent with that?
as in: &(owner->bufferioarr)

I personally don't find it worth being consistent with that, but if you /
others think it is, I'd be ok with adapting to that.

Right, so I'm saying you should remove the extraneous parentheses in the
code you added.

From f26d1fa7e528d04436402aa8f94dc2442999dde3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:24:19 -0800
Subject: [PATCH v5 09/15] bufmgr: Move relation extension handling into
ExtendBufferedRel{By,To,}
+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+    /*
+    * Don't acquire extension lock. This is safe only if the relation isn't
+    * shared, an access exclusive lock is held or if this is the startup
+    * process.
+    */
+    EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+    /* Is this extension part of recovery? */
+    EB_PERFORMING_RECOVERY = (1 << 1),
+
+    /*
+    * Should the fork be created if it does not currently exist? This likely
+    * only ever makes sense for relation forks.
+    */
+    EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+    /* Should the first (possibly only) return buffer be returned locked? */
+    EB_LOCK_FIRST = (1 << 3),
+
+    /* Should the smgr size cache be cleared? */
+    EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+    /* internal flags follow */

I don't understand what this comment means ("internal flags follow")

Hm - just that the flags defined subsequently are for internal use, not for
callers to specify.

If EB_LOCK_TARGET is the only one of these, it might be more clear, for
now, to just say "an internal flag" or "for internal use" above
EB_LOCK_TARGET, since it is the only one.

-        /* Without FSM, always fall out of the loop and extend */
-        if (!use_fsm)
-            break;
+        if (bistate
+            && bistate->next_free != InvalidBlockNumber
+            && bistate->next_free <= bistate->last_free)
+        {
+            /*
+            * We bulk extended the relation before, and there are still some
+            * unused pages from that extension, so we don't need to look in
+            * the FSM for a new page. But do record the free space from the
+            * last page, somebody might insert narrower tuples later.
+            */
Why couldn't we have found out that we bulk-extended before and get the
block from there up above the while loop?
I'm not quite sure I follow - above the loop there might still have been space
on the prior page? We also need the ability to loop if the space has been used
since.

I guess there's an argument for also checking above the loop, but I don't
think that'd currently ever be reachable.

My idea was that directly below this code in RelationGetBufferForTuple():

if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);

We could check bistate->next_free instead of checking the freespace map first.

But, perhaps we couldn't hit this because we would have already have set
current_buf if we had a next_free?

- Melanie

#41

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Melanie Plageman (#40)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-29 20:51:04 -0400, Melanie Plageman wrote:

+    if (returnCode == 0)
+        return 0;
+
+    /* for compatibility with %m printing etc */
+    errno = returnCode;
+
+    /*
+    * Return in cases of a "real" failure, if fallocate is not supported,
+    * fall through to the FileZero() backed implementation.
+    */
+    if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+        return returnCode;
I'm pretty sure you can just delete the below if statement
+    if (returnCode == 0 ||
+        (returnCode != EINVAL && returnCode != EINVAL))
+        return returnCode;
Hm. I don't see how - wouldn't that lead us to call FileZero(), even if
FileFallocate() succeeded or failed (rather than not being supported)?
Uh...I'm confused...maybe my eyes aren't working. If returnCode was 0,
you already would have returned and if returnCode wasn't EINVAL, you
also already would have returned.
Not to mention (returnCode != EINVAL && returnCode != EINVAL) contains
two identical operands.

I'm afraid it was not your eyes that weren't working...

void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)
This name is weird. Who is we?
The current backend. I.e. the function checks the current backend pins the
buffer exactly once, rather that *any* backend pins it once.

I now see that BufferIsPinned() is named, IMO, misleadingly, more generally,
even though it also just applies to pins by the current backend.
Maybe there is a way to use "self" instead of a pronoun? But, if you
feel quite strongly about a pronoun, I think "we" implies more than one
backend, so "I" would be better.

I have no strong feelings around this in any form :)

@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool
clear_dirty, uint32 set_flag_bits)
{
uint32 buf_state;

I noticed that the comment above TermianteBufferIO() says
* TerminateBufferIO: release a buffer we were doing I/O on
* (Assumptions)
* My process is executing IO for the buffer

Can we still say this is an assumption? What about when it is being
cleaned up after being called from AbortBufferIO()

That hasn't really changed - it was already called by AbortBufferIO().

I think it's still correct, too. We must have marked the IO as being in
progress to get there.

Oh, no I meant the "my process is executing IO for the buffer" --
couldn't another backend clear IO_IN_PROGRESS (i.e. not the original one
who set it on all the victim buffers)?

No. Or at least not yet ;) - with AIO we will... Only the IO issuing backend
currently is allowed to reset IO_IN_PROGRESS.

+    /* Should the smgr size cache be cleared? */
+    EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+    /* internal flags follow */
I don't understand what this comment means ("internal flags follow")
Hm - just that the flags defined subsequently are for internal use, not for
callers to specify.
If EB_LOCK_TARGET is the only one of these, it might be more clear, for
now, to just say "an internal flag" or "for internal use" above
EB_LOCK_TARGET, since it is the only one.

I am quite certain it won't be the only one...

-        /* Without FSM, always fall out of the loop and extend */
-        if (!use_fsm)
-            break;
+        if (bistate
+            && bistate->next_free != InvalidBlockNumber
+            && bistate->next_free <= bistate->last_free)
+        {
+            /*
+            * We bulk extended the relation before, and there are still some
+            * unused pages from that extension, so we don't need to look in
+            * the FSM for a new page. But do record the free space from the
+            * last page, somebody might insert narrower tuples later.
+            */
Why couldn't we have found out that we bulk-extended before and get the
block from there up above the while loop?
I'm not quite sure I follow - above the loop there might still have been space
on the prior page? We also need the ability to loop if the space has been used
since.

I guess there's an argument for also checking above the loop, but I don't
think that'd currently ever be reachable.
My idea was that directly below this code in RelationGetBufferForTuple():

if (bistate && bistate->current_buf != InvalidBuffer)
targetBlock = BufferGetBlockNumber(bistate->current_buf);
else
targetBlock = RelationGetTargetBlock(relation);

We could check bistate->next_free instead of checking the freespace map first.

But, perhaps we couldn't hit this because we would have already have set
current_buf if we had a next_free?

Correct. I think it might be worth doing a larger refactoring of that function
at some point not too far away...

It's definitely somewhat sad that we spend time locking the buffer, recheck
pins etc, for the callers like heap_multi_insert() and heap_update() that
already know that the page is full. But that seems like independent enough
that I'd not tackle it now.

Greetings,

Andres Freund

#42

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#41)

16 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

Attached is v6. Changes:

- Try to address Melanie and Horiguchi-san's review. I think there's one or
two further things that need to be done

- Avoided inserting newly extended pages into the FSM while holding a buffer
lock. If we need to do so, we now drop the buffer lock and recheck if there
still is space (very very likely). See also [1]/messages/by-id/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de. I use the infrastructure
introduced over in that in this patchset.

- Lots of comment and commit message polishing. More needed, particularly for
the latter, but ...

- Added a patch to fix the pre-existing undefined behaviour in localbuf.c that
Melanie pointed out. Plan to commit that soon.

- Added a patch to fix some pre-existing DO_DB() format code issues. Plan to
commit that soon.

I did some benchmarking on "bufmgr: Acquire and clean victim buffer
separately" in isolation. For workloads that do a *lot* of reads, that proves
to be a substantial benefit on its own. For the, obviously unrealistically
extreme, workload of N backends doing
SELECT pg_prewarm('pgbench_accounts', 'buffer');
in a scale 100 database (with a 1281MB pgbench_accounts) and shared_buffers of
128MB, I see > 2x gains at 128, 512 clients. Of course realistic workloads
will have much smaller gains, but it's still good to see.

Looking at the patchset, I am mostly happy with the breakdown into individual
commits. However "bufmgr: Move relation extension handling into
ExtendBufferedRel{By,To,}" is quite large. But I don't quite see how to break
it into smaller pieces without making things awkward (e.g. due to static
functions being unused, or temporarily duplicating the code doing relation
extensions).

Greetings,

Andres Freund

[1]: /messages/by-id/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de

Attachments:

v6-0003-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload

From 0347709751206a42f54615a601f098eedac63bf6 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:28:06 -0700
Subject: [PATCH v6 03/17] hio: Release extension lock before initializing page
 / pinning VM

PageInit() while holding the extension lock is unnecessary after 0d1fe9f74e3
started to use RBM_ZERO_AND_LOCK - nobody can look at the new page before we
release the page lock. PageInit() zeroes the page, which isn't that cheap, so
deferring it until after the extension lock is released seems like a good idea.

Doing visibilitymap_pin() while holding the extension lock, introduced in
7db0cd2145f2, looks like an accident. Due to the restrictions on
HEAP_INSERT_FROZEN it's unlikely to be a performance issue, but it still seems
better to move it out.  We also are doing the visibilitymap_pin() while
holding the buffer lock, which will be fixed in a separate commit.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
---
 src/backend/access/heap/hio.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e152807d2dc..7479212d4e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -623,6 +623,13 @@ loop:
 	 */
 	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -647,13 +654,6 @@ loop:
 		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
 	}
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
 	/*
 	 * Lock the other buffer. It's guaranteed to be of a lower page number
 	 * than the new page. To conform with the deadlock prevent rules, we ought
-- 
2.38.0

v6-0004-hio-Don-t-pin-the-VM-while-holding-buffer-lock-wh.patchtext/x-diff; charset=us-asciiDownload

From 06e1311739979333a3ddb29a908934abdb5d1568 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Mar 2023 18:17:11 -0700
Subject: [PATCH v6 04/17] hio: Don't pin the VM while holding buffer lock
 while extending

Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
---
 src/backend/access/heap/hio.c | 120 +++++++++++++++++++++++-----------
 1 file changed, 81 insertions(+), 39 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..8b3dfa0ccae 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -134,14 +134,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * buffer2 may be InvalidBuffer, if only one buffer is involved.  buffer1
  * must not be InvalidBuffer.  If both buffers are specified, block1 must
  * be less than block2.
+ *
+ * Returns whether buffer locks were temporarily released.
  */
-static void
+static bool
 GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 					 BlockNumber block1, BlockNumber block2,
 					 Buffer *vmbuffer1, Buffer *vmbuffer2)
 {
 	bool		need_to_pin_buffer1;
 	bool		need_to_pin_buffer2;
+	bool		released_locks = false;
 
 	Assert(BufferIsValid(buffer1));
 	Assert(buffer2 == InvalidBuffer || block1 <= block2);
@@ -155,9 +158,10 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			&& PageIsAllVisible(BufferGetPage(buffer2))
 			&& !visibilitymap_pin_ok(block2, *vmbuffer2);
 		if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
-			return;
+			break;
 
 		/* We must unlock both buffers before doing any I/O. */
+		released_locks = true;
 		LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
 		if (buffer2 != InvalidBuffer && buffer2 != buffer1)
 			LockBuffer(buffer2, BUFFER_LOCK_UNLOCK);
@@ -183,6 +187,8 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			|| (need_to_pin_buffer1 && need_to_pin_buffer2))
 			break;
 	}
+
+	return released_locks;
 }
 
 /*
@@ -345,6 +351,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
+	bool		unlockedTargetBuffer;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -630,6 +637,9 @@ loop:
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+	unlockedTargetBuffer = false;
+	targetBlock = BufferGetBlockNumber(buffer);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -639,75 +649,107 @@ loop:
 
 	if (!PageIsNew(page))
 		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
+			 targetBlock,
 			 RelationGetRelationName(relation));
 
 	PageInit(page, BufferGetPageSize(buffer), 0);
 	MarkBufferDirty(buffer);
 
 	/*
-	 * The page is empty, pin vmbuffer to set all_frozen bit.
+	 * The page is empty, pin vmbuffer to set all_frozen bit. We don't want to
+	 * do IO while the buffer is locked, so we unlock the page first if IO is
+	 * needed (necessitating checks below).
 	 */
 	if (options & HEAP_INSERT_FROZEN)
 	{
-		Assert(PageGetMaxOffsetNumber(BufferGetPage(buffer)) == 0);
-		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
+		Assert(PageGetMaxOffsetNumber(page) == 0);
+
+		if (!visibilitymap_pin_ok(targetBlock, *vmbuffer))
+		{
+			unlockedTargetBuffer = true;
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			visibilitymap_pin(relation, targetBlock, vmbuffer);
+		}
 	}
 
 	/*
-	 * Lock the other buffer. It's guaranteed to be of a lower page number
-	 * than the new page. To conform with the deadlock prevent rules, we ought
-	 * to lock otherBuffer first, but that would give other backends a chance
-	 * to put tuples on our page. To reduce the likelihood of that, attempt to
-	 * lock the other buffer conditionally, that's very likely to work.
-	 * Otherwise we need to lock buffers in the correct order, and retry if
-	 * the space has been used in the mean time.
+	 * If we unlocked the target buffer above, it's unlikely, but possible,
+	 * that another backend used space on this page.
+	 *
+	 * If we didn't, and otherBuffer is valid, we need to lock the other
+	 * buffer. It's guaranteed to be of a lower page number than the new page.
+	 * To conform with the deadlock prevent rules, we ought to lock
+	 * otherBuffer first, but that would give other backends a chance to put
+	 * tuples on our page. To reduce the likelihood of that, attempt to lock
+	 * the other buffer conditionally, that's very likely to work.  Otherwise
+	 * we need to lock buffers in the correct order, and retry if the space
+	 * has been used in the mean time.
 	 *
 	 * Alternatively, we could acquire the lock on otherBuffer before
 	 * extending the relation, but that'd require holding the lock while
 	 * performing IO, which seems worse than an unlikely retry.
 	 */
-	if (otherBuffer != InvalidBuffer)
+	if (unlockedTargetBuffer)
+	{
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	}
+	else if (otherBuffer != InvalidBuffer)
 	{
 		Assert(otherBuffer != buffer);
-		targetBlock = BufferGetBlockNumber(buffer);
 		Assert(targetBlock > otherBlock);
 
 		if (unlikely(!ConditionalLockBuffer(otherBuffer)))
 		{
+			unlockedTargetBuffer = true;
 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		}
+	}
 
-		/*
-		 * Because the buffers were unlocked for a while, it's possible,
-		 * although unlikely, that an all-visible flag became set or that
-		 * somebody used up the available space in the new page.  We can use
-		 * GetVisibilityMapPins to deal with the first case.  In the second
-		 * case, just retry from start.
-		 */
-		GetVisibilityMapPins(relation, otherBuffer, buffer,
-							 otherBlock, targetBlock, vmbuffer_other,
-							 vmbuffer);
-
-		/*
-		 * Note that we have to check the available space even if our
-		 * conditional lock succeeded, because GetVisibilityMapPins might've
-		 * transiently released lock on the target buffer to acquire a VM pin
-		 * for the otherBuffer.
-		 */
-		if (len > PageGetHeapFreeSpace(page))
+	/*
+	 * If one of the buffers was unlocked, it's possible, although unlikely,
+	 * that an all-visible flag became set.  We can use GetVisibilityMapPins
+	 * to deal with that.
+	 */
+	if (unlockedTargetBuffer || otherBuffer != InvalidBuffer)
+	{
+		if (otherBuffer != InvalidBuffer)
 		{
-			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+			if (GetVisibilityMapPins(relation, otherBuffer, buffer,
+									 otherBlock, targetBlock, vmbuffer_other,
+									 vmbuffer))
+				unlockedTargetBuffer = true;
+		}
+		else
+		{
+			if (GetVisibilityMapPins(relation, buffer, InvalidBuffer,
+									 targetBlock, InvalidBlockNumber,
+									 vmbuffer, InvalidBuffer))
+				unlockedTargetBuffer = true;
+		}
+	}
+
+	/*
+	 * If the target buffer was temporarily unlocked since the relation
+	 * extension, it's possible, although unlikely, that all the space on the
+	 * page was already used. If so, we just retry from the start.  If we
+	 * didn't unlock, something has gone wrong if there's not enough space -
+	 * the test at the top should have prevented reaching this case.
+	 */
+	pageFreeSpace = PageGetHeapFreeSpace(page);
+	if (len > pageFreeSpace)
+	{
+		if (unlockedTargetBuffer)
+		{
+			if (otherBuffer != InvalidBuffer)
+				LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
 			UnlockReleaseBuffer(buffer);
 
 			goto loop;
 		}
-	}
-	else if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
 		elog(PANIC, "tuple is too big: size %zu", len);
 	}
 
@@ -720,7 +762,7 @@ loop:
 	 * current backend to make more insertions or not, which is probably a
 	 * good bet most of the time.  So for now, don't add it to FSM yet.
 	 */
-	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
+	RelationSetTargetBlock(relation, targetBlock);
 
 	return buffer;
 }
-- 
2.38.0

v6-0005-bufmgr-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload

From 79d447105589a4f6b9bae2ff0eba20b8f5fb9ba7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v6 05/17] bufmgr: Add some error checking around pinning

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 42 ++++++++++++++++++++---------
 2 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index b8a18b8081f..973547a8baf 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -134,6 +134,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
+extern void BufferCheckOneLocalPin(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 95212a39416..fa20fab5a29 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1755,6 +1755,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	Assert(!BufferIsLocal(b));
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1900,6 +1902,8 @@ UnpinBuffer(BufferDesc *buf)
 	PrivateRefCountEntry *ref;
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 
+	Assert(!BufferIsLocal(b));
+
 	/* not moving as we're likely deleting it soon anyway */
 	ref = GetPrivateRefCountEntry(b, false);
 	Assert(ref != NULL);
@@ -4283,6 +4287,25 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+void
+BufferCheckOneLocalPin(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		/* There should be exactly one pin */
+		if (LocalRefCount[-buffer - 1] != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 LocalRefCount[-buffer - 1]);
+	}
+	else
+	{
+		/* There should be exactly one local pin */
+		if (GetPrivateRefCount(buffer) != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 GetPrivateRefCount(buffer));
+	}
+}
+
 /*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
@@ -4310,20 +4333,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	if (BufferIsLocal(buffer))
-	{
-		/* There should be exactly one pin */
-		if (LocalRefCount[-buffer - 1] != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 LocalRefCount[-buffer - 1]);
-		/* Nobody else to wait for */
-		return;
-	}
+	BufferCheckOneLocalPin(buffer);
 
-	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		elog(ERROR, "incorrect local pin count: %d",
-			 GetPrivateRefCount(buffer));
+	/* Nobody else to wait for */
+	if (BufferIsLocal(buffer))
+		return;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -4824,6 +4838,8 @@ LockBufHdr(BufferDesc *desc)
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
 	init_local_spin_delay(&delayStatus);
 
 	while (true)
-- 
2.38.0

v6-0006-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload

From 2f7302a2232516c4ced6cf6ea43f9368c97278e8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 27 Feb 2023 17:36:37 -0800
Subject: [PATCH v6 06/17] Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/include/storage/fd.h        |   3 +
 src/include/storage/md.h        |   2 +
 src/include/storage/smgr.h      |   2 +
 src/backend/storage/file/fd.c   |  89 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   | 108 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  28 +++++++++
 6 files changed, 232 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d083..daceafd4732 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -106,6 +106,9 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
+extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
 extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8f32af9ef3d..941879ee6a8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,6 +28,8 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 0935144f425..a9a179aabac 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2ac365e97cc..9eabdbc589e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2206,6 +2206,94 @@ FileSync(File file, uint32 wait_event_info)
 	return returnCode;
 }
 
+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+	int			returnCode;
+	ssize_t		written;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileZero: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
+			   file, VfdCache[file].fileName,
+			   (int64) offset, (int64) amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	pgstat_report_wait_start(wait_event_info);
+	written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+	pgstat_report_wait_end();
+
+	if (written < 0)
+		return -1;
+	else if (written != amount)
+	{
+		/* if errno is unset, assume problem is no disk space */
+		if (errno == 0)
+			errno = ENOSPC;
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to reserve file space with posix_fallocate(). If posix_fallocate() is
+ * not implemented on the operating system or fails with EINVAL / EOPNOTSUPP,
+ * use FileZero() instead.
+ *
+ * Note that at least glibc() implements posix_fallocate() in userspace if not
+ * implemented by the filesystem. That's not the case for all environments
+ * though.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileFallocate: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
+			   file, VfdCache[file].fileName,
+			   (int64) offset, (int64) amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return -1;
+
+	pgstat_report_wait_start(wait_event_info);
+	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+	pgstat_report_wait_end();
+
+	if (returnCode == 0)
+		return 0;
+
+	/* for compatibility with %m printing etc */
+	errno = returnCode;
+
+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return -1;
+#endif
+
+	return FileZero(file, offset, amount, wait_event_info);
+}
+
 off_t
 FileSize(File file)
 {
@@ -2278,6 +2366,7 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
 	return VfdCache[file].fd;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1feb..51e1599dcf4 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -500,6 +500,114 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
+/*
+ *	mdzeroextend() -- Add new zeroed out blocks to the specified relation.
+ *
+ *		Similar to mdextend(), except the relation can be extended by multiple
+ *		blocks at once and the added blocks will be filled with zeroes.
+ */
+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	MdfdVec    *v;
+	BlockNumber curblocknum = blocknum;
+	int			remblocks = nblocks;
+
+	Assert(nblocks > 0);
+
+	/* This assert is too expensive to have on normally ... */
+#ifdef CHECK_WRITE_VS_EXTEND
+	Assert(blocknum >= mdnblocks(reln, forknum));
+#endif
+
+	/*
+	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
+	 * more --- we mustn't create a block whose number actually is
+	 * InvalidBlockNumber or larger.
+	 */
+	if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend file \"%s\" beyond %u blocks",
+						relpath(reln->smgr_rlocator, forknum),
+						InvalidBlockNumber)));
+
+	while (remblocks > 0)
+	{
+		BlockNumber	segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		int			numblocks;
+
+		if (segstartblock + remblocks > RELSEG_SIZE)
+			numblocks = RELSEG_SIZE - segstartblock;
+		else
+			numblocks = remblocks;
+
+		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+
+		Assert(segstartblock < RELSEG_SIZE);
+		Assert(segstartblock + numblocks <= RELSEG_SIZE);
+
+		/*
+		 * If available and useful, use posix_fallocate() (via FileAllocate())
+		 * to extend the relation. That's often more efficient than using
+		 * write(), as it commonly won't cause the kernel to allocate page
+		 * cache space for the extended pages.
+		 *
+		 * However, we don't use FileAllocate() for small extensions, as it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 */
+		if (numblocks > 8)
+		{
+			int			ret;
+
+			ret = FileFallocate(v->mdfd_vfd,
+								seekpos, (off_t) BLCKSZ * numblocks,
+								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret != 0)
+			{
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\" with posix_fallocate(): %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+			}
+		}
+		else
+		{
+			int			ret;
+
+			/*
+			 * Even if we don't want to use fallocate, we can still extend a
+			 * bit more efficiently than writing each 8kB block individually.
+			 * pg_pwrite_zeroes() (via FileZero()) uses
+			 * pg_pwritev_with_retry() to avoid multiple writes or needing a
+			 * zeroed buffer for the whole length of the extension.
+			 */
+			ret = FileZero(v->mdfd_vfd,
+						   seekpos, (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\": %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+		remblocks -= numblocks;
+		curblocknum += numblocks;
+	}
+}
+
 /*
  *	mdopenfork() -- Open one fork of the specified relation.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e54145..c37c246b77f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -75,6 +77,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
 		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
@@ -507,6 +510,31 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+/*
+ *	smgrzeroextend() -- Add new zeroed out blocks to a file.
+ *
+ *		Similar to smgrextend(), except the relation can be extended by
+ *		multiple blocks at once and the added blocks will be filled with
+ *		zeroes.
+ */
+void
+smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   int nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
+											 nblocks, skipFsync);
+
+	/*
+	 * Normally we expect this to increase the fork size by nblocks, but if
+	 * the cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+}
+
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
-- 
2.38.0

v6-0007-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 2ba48f017f384c6fa8cd3225ed8ce958a5196f70 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 26 Oct 2022 12:05:07 -0700
Subject: [PATCH v6 07/17] bufmgr: Add Pin/UnpinLocalBuffer()

So far these were open-coded in quite a few places, without a good reason.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
---
 src/include/storage/buf_internals.h   |  2 +
 src/include/storage/bufmgr.h          |  2 +-
 src/backend/storage/buffer/bufmgr.c   | 36 +++-------------
 src/backend/storage/buffer/localbuf.c | 62 +++++++++++++++++----------
 4 files changed, 49 insertions(+), 53 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0b448147407..fa5c451b1a9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -415,6 +415,8 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
+extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
+extern void UnpinLocalBuffer(Buffer buffer);
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 973547a8baf..d30733c65a1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -134,7 +134,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
-extern void BufferCheckOneLocalPin(Buffer buffer);
+extern void BufferCheckWePinOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fa20fab5a29..6f50dbd212e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -636,20 +636,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		/* Is it still valid and holding the right tag? */
 		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * Bump buffer's ref and usage counts. This is equivalent of
-			 * PinBuffer for a shared buffer.
-			 */
-			if (LocalRefCount[b] == 0)
-			{
-				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-				{
-					buf_state += BUF_USAGECOUNT_ONE;
-					pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-				}
-			}
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner, recent_buffer);
+			PinLocalBuffer(bufHdr, true);
 
 			pgBufferUsage.local_blks_hit++;
 
@@ -1708,8 +1695,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
 				BufTagGetForkNum(&bufHdr->tag) == forkNum)
 				return buffer;
-			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-			LocalRefCount[-buffer - 1]--;
+			UnpinLocalBuffer(buffer);
 		}
 		else
 		{
@@ -4012,15 +3998,9 @@ ReleaseBuffer(Buffer buffer)
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
-	{
-		ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-
-		Assert(LocalRefCount[-buffer - 1] > 0);
-		LocalRefCount[-buffer - 1]--;
-		return;
-	}
-
-	UnpinBuffer(GetBufferDescriptor(buffer - 1));
+		UnpinLocalBuffer(buffer);
+	else
+		UnpinBuffer(GetBufferDescriptor(buffer - 1));
 }
 
 /*
@@ -4288,18 +4268,16 @@ ConditionalLockBuffer(Buffer buffer)
 }
 
 void
-BufferCheckOneLocalPin(Buffer buffer)
+BufferCheckWePinOnce(Buffer buffer)
 {
 	if (BufferIsLocal(buffer))
 	{
-		/* There should be exactly one pin */
 		if (LocalRefCount[-buffer - 1] != 1)
 			elog(ERROR, "incorrect local pin count: %d",
 				 LocalRefCount[-buffer - 1]);
 	}
 	else
 	{
-		/* There should be exactly one local pin */
 		if (GetPrivateRefCount(buffer) != 1)
 			elog(ERROR, "incorrect local pin count: %d",
 				 GetPrivateRefCount(buffer));
@@ -4333,7 +4311,7 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	BufferCheckOneLocalPin(buffer);
+	BufferCheckWePinOnce(buffer);
 
 	/* Nobody else to wait for */
 	if (BufferIsLocal(buffer))
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 68b4817c67b..d7181fcf7fc 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -145,27 +145,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
 				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
 #endif
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		/* this part is equivalent to PinBuffer for a shared buffer */
-		if (LocalRefCount[b] == 0)
-		{
-			if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-			{
-				buf_state += BUF_USAGECOUNT_ONE;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-		}
-		LocalRefCount[b]++;
-		ResourceOwnerRememberBuffer(CurrentResourceOwner,
-									BufferDescriptorGetBuffer(bufHdr));
-		if (buf_state & BM_VALID)
-			*foundPtr = true;
-		else
-		{
-			/* Previous read attempt must have failed; try again */
-			*foundPtr = false;
-		}
+		*foundPtr = PinLocalBuffer(bufHdr, true);
 		return bufHdr;
 	}
 
@@ -202,9 +183,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			else
 			{
 				/* Found a usable buffer */
-				LocalRefCount[b]++;
-				ResourceOwnerRememberBuffer(CurrentResourceOwner,
-											BufferDescriptorGetBuffer(bufHdr));
+				PinLocalBuffer(bufHdr, false);
 				break;
 			}
 		}
@@ -491,6 +470,43 @@ InitLocalBuffers(void)
 	NLocBuffer = nbufs;
 }
 
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+	uint32		buf_state;
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	int			bufid = -buffer - 1;
+
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	if (LocalRefCount[bufid] == 0)
+	{
+		if (adjust_usagecount &&
+			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
+		{
+			buf_state += BUF_USAGECOUNT_ONE;
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		}
+	}
+	LocalRefCount[bufid]++;
+	ResourceOwnerRememberBuffer(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf_hdr));
+
+	return buf_state & BM_VALID;
+}
+
+void
+UnpinLocalBuffer(Buffer buffer)
+{
+	int			buffid = -buffer - 1;
+
+	Assert(BufferIsLocal(buffer));
+	Assert(LocalRefCount[buffid] > 0);
+
+	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
+	LocalRefCount[buffid]--;
+}
+
 /*
  * GUC check_hook for temp_buffers
  */
-- 
2.38.0

v6-0001-bufmgr-Fix-undefined-behaviour-with-unrealistical.patchtext/x-diff; charset=us-asciiDownload

From 2abcb6061fb951d5df1abf8565854c5cd6e7b790 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 29 Mar 2023 18:07:17 -0700
Subject: [PATCH v6 01/17] bufmgr: Fix undefined behaviour with,
 unrealistically, large temp_buffers

Quoting Melanie:
> Since if buffer is INT_MAX, then the -(buffer + 1) version invokes
> undefined behavior while the -buffer - 1 version doesn't.

All other places were already using the correct version. I (Andres), copied
the code into more places in a patch. Melanie caught it in review, but to
prevent more people from copying the bad code, fix it. Even if it is a
theoretical issue.

We really ought to wrap these accesses in a helper function...

Reported-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/CAAKRu_aW2SX_LWtwHgfnqYpBrunMLfE9PD6-ioPpkh92XH0qpg@mail.gmail.com
---
 src/backend/storage/buffer/localbuf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663d..68b4817c67b 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -305,7 +305,7 @@ MarkLocalBufferDirty(Buffer buffer)
 	fprintf(stderr, "LB DIRTY %d\n", buffer);
 #endif
 
-	bufid = -(buffer + 1);
+	bufid = -buffer - 1;
 
 	Assert(LocalRefCount[bufid] > 0);
 
-- 
2.38.0

v6-0002-Fix-format-code-in-fd.c-debugging-infrastructure.patchtext/x-diff; charset=us-asciiDownload

From 8ee100bf1844625052dfdcf9201b52b4c201901a Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 29 Mar 2023 18:51:07 -0700
Subject: [PATCH v6 02/17] Fix format code in fd.c debugging infrastructure

These were not sufficiently adjusted in 2d4f1ba6cfc.
---
 src/backend/storage/file/fd.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9fd8444ed4d..2ac365e97cc 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -1988,9 +1988,9 @@ FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event_info)
 
 	Assert(FileIsValid(file));
 
-	DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " %d",
+	DO_DB(elog(LOG, "FilePrefetch: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
 			   file, VfdCache[file].fileName,
-			   (int64) offset, amount));
+			   (int64) offset, (int64) amount));
 
 	returnCode = FileAccess(file);
 	if (returnCode < 0)
@@ -2096,7 +2096,7 @@ FileWrite(File file, const void *buffer, size_t amount, off_t offset,
 
 	Assert(FileIsValid(file));
 
-	DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %d %p",
+	DO_DB(elog(LOG, "FileWrite: %d (%s) " INT64_FORMAT " %zu %p",
 			   file, VfdCache[file].fileName,
 			   (int64) offset,
 			   amount, buffer));
-- 
2.38.0

v6-0008-bufmgr-Remove-buffer-write-dirty-tracepoints.patchtext/x-diff; charset=us-asciiDownload

From c25a3e28fc658f076d31c009e1618f6bcd980299 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:21:57 -0800
Subject: [PATCH v6 08/17] bufmgr: Remove buffer-write-dirty tracepoints

The trace point was using the relfileno / fork / block for the to-be-read-in
buffer. Some upcoming work would make that more expensive to provide. We still
have buffer-flush-start/done, which can serve most tracing needs that
buffer-write-dirty could serve.

Discussion: https://postgr.es/m/f5164e7a-eef6-8972-75a3-8ac622ed0c6e@iki.fi
---
 src/backend/storage/buffer/bufmgr.c | 10 ----------
 doc/src/sgml/monitoring.sgml        | 17 -----------------
 2 files changed, 27 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f50dbd212e..3d0683593fd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1277,21 +1277,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				}
 
 				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-														  smgr->smgr_rlocator.locator.spcOid,
-														  smgr->smgr_rlocator.locator.dbOid,
-														  smgr->smgr_rlocator.locator.relNumber);
-
 				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
 											  &buf->tag);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-														 smgr->smgr_rlocator.locator.spcOid,
-														 smgr->smgr_rlocator.locator.dbOid,
-														 smgr->smgr_rlocator.locator.relNumber);
 			}
 			else
 			{
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c809ff1ba4a..b2ccd8d7fef 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7806,23 +7806,6 @@ FROM pg_stat_get_backend_idset() AS backendid;
       it's typically not actually been written to disk yet.)
       The arguments are the same as for <literal>buffer-flush-start</literal>.</entry>
     </row>
-    <row>
-     <entry><literal>buffer-write-dirty-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
-     <entry>Probe that fires when a server process begins to write a dirty
-      buffer.  (If this happens often, it implies that
-      <xref linkend="guc-shared-buffers"/> is too
-      small or the background writer control parameters need adjustment.)
-      arg0 and arg1 contain the fork and block numbers of the page.
-      arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
-      identifying the relation.</entry>
-    </row>
-    <row>
-     <entry><literal>buffer-write-dirty-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
-     <entry>Probe that fires when a dirty-buffer write is complete.
-      The arguments are the same as for <literal>buffer-write-dirty-start</literal>.</entry>
-    </row>
     <row>
      <entry><literal>wal-buffer-write-dirty-start</literal></entry>
      <entry><literal>()</literal></entry>
-- 
2.38.0

v6-0009-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload

From 329215a4ccad75cb03910f24ab60cba203b11e5d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 17 Feb 2023 18:26:34 -0800
Subject: [PATCH v6 09/17] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers.  Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful.But it also creates a bottleneck for workloads that just involve
reads.

Now we instead first acquire a victim buffer and write it out, if
necessary. Then we remove that buffer from the old partition with just the old
partition's partition lock held and insert it into the new partition with just
that partition's lock held.

By separating out the victim buffer acquisition, future commits will be able
to change relation extensions to scale better.

On my workstation, a micro-benchmark exercising buffered reads strenuously and
under a lot of concurrency, sees a >2x improvement.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie
Plageman <melanieplageman@gmail.com>
---
 src/backend/storage/buffer/bufmgr.c   | 581 ++++++++++++++------------
 src/backend/storage/buffer/localbuf.c | 115 ++---
 2 files changed, 379 insertions(+), 317 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3d0683593fd..02281303440 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -473,6 +473,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr, IOContext *io_context);
+static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -1128,18 +1129,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr, IOContext *io_context)
 {
-	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
-	uint32		oldFlags;
-	int			buf_id;
-	BufferDesc *buf;
-	bool		valid;
-	uint32		buf_state;
+	int			existing_buf_id;
+
+	Buffer		victim_buffer;
+	BufferDesc *victim_buf_hdr;
+	uint32		victim_buf_state;
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -1150,15 +1147,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+	existing_buf_id = BufTableLookup(&newTag, newHash);
+	if (existing_buf_id >= 0)
 	{
+		BufferDesc *buf;
+		bool		valid;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
 		 * buffer pool, and check to see if the correct data has been loaded
 		 * into the buffer.
 		 */
-		buf = GetBufferDescriptor(buf_id);
+		buf = GetBufferDescriptor(existing_buf_id);
 
 		valid = PinBuffer(buf, strategy);
 
@@ -1200,293 +1200,111 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	*io_context = IOContextForStrategy(strategy);
 
-	/* Loop here in case we have to try another victim buffer */
-	for (;;)
+	/*
+	 * Acquire a victim buffer. Somebody else might try to do the same, we
+	 * don't hold any conflicting locks. If so we'll have to undo our work
+	 * later.
+	 */
+	victim_buffer = GetVictimBuffer(strategy, *io_context);
+	victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. If
+	 * somebody else inserted another buffer for the tag, we'll release the
+	 * victim buffer we acquired and use the already inserted one.
+	 */
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
+	if (existing_buf_id >= 0)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
+		BufferDesc *existing_buf_hdr;
+		bool		valid;
 
 		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to
+		 * use.
+		 *
+		 * We could do this after releasing the partition lock, but then we'd
+		 * have to call ResourceOwnerEnlargeBuffers() &
+		 * ReservePrivateRefCountEntry() before acquiring the lock, for the
+		 * rare case of such a collision.
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		UnpinBuffer(victim_buf_hdr);
 
-		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+		/* FIXME: Should we put the victim buffer onto the freelist? */
 
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf_state & BUF_FLAG_MASK;
+		/* remaining code should match code at top of routine */
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-										 LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
+		valid = PinBuffer(existing_buf_hdr, strategy);
 
-					/* Read the LSN while holding buffer header lock */
-					buf_state = LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf, buf_state);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf, from_ring))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(BufferDescriptorGetContentLock(buf));
-						UnpinBuffer(buf);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				FlushBuffer(buf, NULL, IOOBJECT_RELATION, *io_context);
-				LWLockRelease(BufferDescriptorGetContentLock(buf));
-
-				ScheduleBufferTagForWriteback(&BackendWritebackContext,
-											  &buf->tag);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf);
-				continue;
-			}
-		}
-
-		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
-		 */
-		if (oldFlags & BM_TAG_VALID)
-		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* remember we have no old-partition lock or tag */
-			oldPartitionLock = NULL;
-			/* keep the compiler quiet about uninitialized variables */
-			oldHash = 0;
-		}
-
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
-		/*
-		 * Need to lock the buffer header too in order to change its tag.
-		 */
-		buf_state = LockBufHdr(buf);
-
-		/*
-		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
-		 */
-		oldFlags = buf_state & BUF_FLAG_MASK;
-		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
-			break;
-
-		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(existing_buf_hdr, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return existing_buf_hdr;
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
-	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
-	 *
+	 * Need to lock the buffer header too in order to change its tag.
+	 */
+	victim_buf_state = LockBufHdr(victim_buf_hdr);
+
+	/* some sanity checks while we hold the buffer header lock */
+	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+
+	victim_buf_hdr->tag = newTag;
+
+	/*
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
 	 */
-	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
+	victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
-	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		victim_buf_state |= BM_PERMANENT;
 
-	UnlockBufHdr(buf, buf_state);
-
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
+	UnlockBufHdr(victim_buf_hdr, victim_buf_state);
 
 	LWLockRelease(newPartitionLock);
 
-	if (oldFlags & BM_VALID)
-	{
-		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
-		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, *io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT);
-	}
-
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
 	 * to read it before we did, so there's nothing left for BufferAlloc() to
 	 * do.
 	 */
-	if (StartBufferIO(buf, true))
+	if (StartBufferIO(victim_buf_hdr, true))
 		*foundPtr = false;
 	else
 		*foundPtr = true;
 
-	return buf;
+	return victim_buf_hdr;
 }
 
 /*
@@ -1595,6 +1413,237 @@ retry:
 	StrategyFreeBuffer(buf);
 }
 
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+	uint32		buf_state;
+	uint32		hash;
+	LWLock	   *partition_lock;
+	BufferTag	tag;
+
+	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+
+	/* have buffer pinned, so it's safe to read tag without lock */
+	tag = buf_hdr->tag;
+
+	hash = BufTableHashCode(&tag);
+	partition_lock = BufMappingPartitionLock(hash);
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+	/* lock the buffer header */
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We have the buffer pinned nobody else should have been able to unset
+	 * this concurrently.
+	 */
+	Assert(buf_state & BM_TAG_VALID);
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BufferTagsEqual(&buf_hdr->tag, &tag));
+
+	/*
+	 * If somebody else pinned the buffer since, or even worse, dirtied
+	 * it, give up on this buffer: It's clearly in use.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) != 1 || (buf_state & BM_DIRTY))
+	{
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		UnlockBufHdr(buf_hdr, buf_state);
+		LWLockRelease(partition_lock);
+
+		return false;
+	}
+
+	/*
+	 * Clear out the buffer's tag and flags and usagecount.  This is not
+	 * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+	 * doing anything with the buffer. But currently it's beneficial as the
+	 * pre-check for several linear scans of shared buffers just checks the
+	 * tag.
+	 */
+	ClearBufferTag(&buf_hdr->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+	/* finally delete buffer from the buffer mapping table */
+	BufTableDelete(&tag, hash);
+
+	LWLockRelease(partition_lock);
+
+	Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID)));
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0);
+
+	return true;
+}
+
+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+	BufferDesc *buf_hdr;
+	Buffer		buf;
+	uint32		buf_state;
+	bool		from_ring;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/* we return here if a prospective victim buffer gets used concurrently */
+again:
+
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+	buf = BufferDescriptorGetBuffer(buf_hdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf_hdr);
+
+	/*
+	 * We shouldn't have any other pins for this buffer.
+	 */
+	BufferCheckWePinOnce(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released the
+	 * buffer header lock above, or even while we are writing it out (since
+	 * our share-lock won't prevent hint-bit updates).  We will recheck the
+	 * dirty bit after re-locking the buffer header.
+	 */
+	if (buf_state & BM_DIRTY)
+	{
+		LWLock	   *content_lock;
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_VALID);
+
+		/*
+		 * We need a share-lock on the buffer contents to write it out (else
+		 * we might write invalid data, eg because someone else is compacting
+		 * the page contents while we write).  We must use a conditional lock
+		 * acquisition here to avoid deadlock.  Even though the buffer was not
+		 * pinned (and therefore surely not locked) when StrategyGetBuffer
+		 * returned it, someone else could have pinned and exclusive-locked it
+		 * by the time we get here. If we try to get the lock unconditionally,
+		 * we'd block waiting for them; if they later block waiting for us,
+		 * deadlock ensues. (This has been observed to happen when two
+		 * backends are both trying to split btree index pages, and the second
+		 * one just happens to be trying to split the page the first one got
+		 * from StrategyGetBuffer.)
+		 */
+		content_lock = BufferDescriptorGetContentLock(buf_hdr);
+		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop back
+			 * to get another one.
+			 */
+			UnpinBuffer(buf_hdr);
+			goto again;
+		}
+
+		/*
+		 * If using a nondefault strategy, and writing the buffer would
+		 * require a WAL flush, let the strategy decide whether to go ahead
+		 * and write/reuse the buffer or to choose another victim.  We need
+		 * lock to inspect the page LSN, so this can't be done inside
+		 * StrategyGetBuffer.
+		 */
+		if (strategy != NULL)
+		{
+			XLogRecPtr	lsn;
+
+			/* Read the LSN while holding buffer header lock */
+			buf_state = LockBufHdr(buf_hdr);
+			lsn = BufferGetLSN(buf_hdr);
+			UnlockBufHdr(buf_hdr, buf_state);
+
+			if (XLogNeedsFlush(lsn)
+				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(buf_hdr);
+				goto again;
+			}
+		}
+
+		/* OK, do the I/O */
+		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+		LWLockRelease(content_lock);
+
+		ScheduleBufferTagForWriteback(&BackendWritebackContext,
+									  &buf_hdr->tag);
+	}
+
+
+	if (buf_state & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
+	/*
+	 * If the buffer has an entry in the buffer mapping table, delete it. This
+	 * can fail because another backend could have pinned or dirtied the
+	 * buffer.
+	 */
+	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+	{
+		UnpinBuffer(buf_hdr);
+		goto again;
+	}
+
+	/* a final set of sanity checks */
+#ifdef USE_ASSERT_CHECKING
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+
+	BufferCheckWePinOnce(buf);
+#endif
+
+	return buf;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index d7181fcf7fc..715fd682e6f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -45,13 +45,14 @@ BufferDesc *LocalBufferDescriptors = NULL;
 Block	   *LocalBufferBlockPointers = NULL;
 int32	   *LocalRefCount = NULL;
 
-static int	nextFreeLocalBuf = 0;
+static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
+static Buffer GetLocalVictimBuffer(void);
 
 
 /*
@@ -113,10 +114,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
-	int			b;
-	int			trycounter;
+	Buffer		victim_buffer;
+	int			bufid;
 	bool		found;
-	uint32		buf_state;
 
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
@@ -138,23 +138,51 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 	if (hresult)
 	{
-		b = hresult->id;
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufid = hresult->id;
+		bufHdr = GetLocalBufferDescriptor(bufid);
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
-#ifdef LBDEBUG
-		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
-#endif
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
-		return bufHdr;
+	}
+	else
+	{
+		uint32		buf_state;
+
+		victim_buffer = GetLocalVictimBuffer();
+		bufid = -victim_buffer - 1;
+		bufHdr = GetLocalBufferDescriptor(bufid);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
+		if (found)				/* shouldn't happen */
+			elog(ERROR, "local buffer hash table corrupted");
+		hresult->id = bufid;
+
+		/*
+		 * it's all ours now.
+		 */
+		bufHdr->tag = newTag;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+		*foundPtr = false;
 	}
 
-#ifdef LBDEBUG
-	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-			smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum,
-			-nextFreeLocalBuf - 1);
-#endif
+	return bufHdr;
+}
+
+static Buffer
+GetLocalVictimBuffer(void)
+{
+	int			victim_bufid;
+	int			trycounter;
+	uint32		buf_state;
+	BufferDesc *bufHdr;
+
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
 	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
@@ -163,14 +191,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	trycounter = NLocBuffer;
 	for (;;)
 	{
-		b = nextFreeLocalBuf;
+		victim_bufid = nextFreeLocalBufId;
 
-		if (++nextFreeLocalBuf >= NLocBuffer)
-			nextFreeLocalBuf = 0;
+		if (++nextFreeLocalBufId >= NLocBuffer)
+			nextFreeLocalBufId = 0;
 
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufHdr = GetLocalBufferDescriptor(victim_bufid);
 
-		if (LocalRefCount[b] == 0)
+		if (LocalRefCount[victim_bufid] == 0)
 		{
 			buf_state = pg_atomic_read_u32(&bufHdr->state);
 
@@ -193,6 +221,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 					 errmsg("no empty local buffer available")));
 	}
 
+	/*
+	 * lazy memory allocation: allocate space on first use of a buffer.
+	 */
+	if (LocalBufHdrGetBlock(bufHdr) == NULL)
+	{
+		/* Set pointer for use by BufferGetBlock() macro */
+		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
+	}
+
 	/*
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
@@ -223,48 +260,24 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	}
 
 	/*
-	 * lazy memory allocation: allocate space on first use of a buffer.
-	 */
-	if (LocalBufHdrGetBlock(bufHdr) == NULL)
-	{
-		/* Set pointer for use by BufferGetBlock() macro */
-		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
-	}
-
-	/*
-	 * Update the hash table: remove old entry, if any, and make new one.
+	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
 	if (buf_state & BM_TAG_VALID)
 	{
+		LocalBufferLookupEnt *hresult;
+
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
 		if (!hresult)			/* shouldn't happen */
 			elog(ERROR, "local buffer hash table corrupted");
 		/* mark buffer invalid just in case hash insert fails */
 		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BM_VALID | BM_TAG_VALID);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
-	hresult = (LocalBufferLookupEnt *)
-		hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
-	if (found)					/* shouldn't happen */
-		elog(ERROR, "local buffer hash table corrupted");
-	hresult->id = b;
-
-	/*
-	 * it's all ours now.
-	 */
-	bufHdr->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
-	buf_state |= BM_TAG_VALID;
-	buf_state &= ~BUF_USAGECOUNT_MASK;
-	buf_state += BUF_USAGECOUNT_ONE;
-	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-
-	*foundPtr = false;
-	return bufHdr;
+	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
@@ -431,7 +444,7 @@ InitLocalBuffers(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	nextFreeLocalBuf = 0;
+	nextFreeLocalBufId = 0;
 
 	/* initialize fields that need to start off nonzero */
 	for (i = 0; i < nbufs; i++)
-- 
2.38.0

v6-0010-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload

From 23aa6ec4bbff9b57bfb3c11e7e1ea2c755aa314b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v6 10/17] bufmgr: Support multiple in-progress IOs by using
 resowner

A future patch will add support for extending relations by multiple blocks at
once. To be concurrency safe, the buffers for those blocks need to be marked
as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from
an IO error for a single buffer. This commit extends that infrastructure to
multiple buffers by using the resource owner infrastructure.

This commit increases the size of the ResourceOwnerData struct, which appears
to have a just about measurable overhead in very extreme workloads. Medium
term we are planning to substantially shrink the size of
ResourceOwnerData. Short term the increase is small enough to not worry about
it for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de

IO on a single b
---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/utils/resowner_private.h  |  5 ++
 src/backend/access/transam/xact.c     |  4 +-
 src/backend/postmaster/autovacuum.c   |  1 -
 src/backend/postmaster/bgwriter.c     |  1 -
 src/backend/postmaster/checkpointer.c |  1 -
 src/backend/postmaster/walwriter.c    |  1 -
 src/backend/storage/buffer/bufmgr.c   | 86 ++++++++++++---------------
 src/backend/utils/resowner/resowner.c | 60 +++++++++++++++++++
 9 files changed, 105 insertions(+), 56 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index d30733c65a1..9ba25521ea1 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -181,7 +181,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern void AbortBufferIO(void);
+extern void AbortBufferIO(Buffer buffer);
 
 extern void BufmgrCommit(void);
 extern bool BgBufferSync(struct WritebackContext *wb_context);
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b54..ae58438ec76 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -30,6 +30,11 @@ extern void ResourceOwnerEnlargeBuffers(ResourceOwner owner);
 extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
+/* support for IO-in-progress management */
+extern void ResourceOwnerEnlargeBufferIOs(ResourceOwner owner);
+extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
+extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
+
 /* support for local lock management */
 extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b8764012607..a0f53e9936a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2725,8 +2725,7 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/* Clean up buffer I/O and buffer context locks, too */
-	AbortBufferIO();
+	/* Clean up buffer context locks, too */
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5086,7 +5085,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-	AbortBufferIO();
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 585d28148ca..e9ba0dc17cd 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -526,7 +526,6 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 9bb47da404d..caad642ec93 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -167,7 +167,6 @@ BackgroundWriterMain(void)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index aaad5c52281..ace9893d957 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -271,7 +271,6 @@ CheckpointerMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 513e580c513..65e84be39b9 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -163,7 +163,6 @@ WalWriterMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 02281303440..41153f99bc2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,10 +159,6 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
-/* local state for StartBufferIO and related functions */
-static BufferDesc *InProgressBuf = NULL;
-static bool IsForInput;
-
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -2712,7 +2708,6 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	AbortBufferIO();
 	UnlockBuffers();
 
 	CheckForBufferLeaks();
@@ -4658,7 +4653,7 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 {
 	uint32		buf_state;
 
-	Assert(!InProgressBuf);
+	ResourceOwnerEnlargeBufferIOs(CurrentResourceOwner);
 
 	for (;;)
 	{
@@ -4682,8 +4677,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = buf;
-	IsForInput = forInput;
+	ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+								  BufferDescriptorGetBuffer(buf));
 
 	return true;
 }
@@ -4709,8 +4704,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 {
 	uint32		buf_state;
 
-	Assert(buf == InProgressBuf);
-
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
@@ -4722,13 +4715,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = NULL;
+	ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
 }
 
 /*
- * AbortBufferIO: Clean up any active buffer I/O after an error.
+ * AbortBufferIO: Clean up active buffer I/O after an error.
  *
  *	All LWLocks we might have held have been released,
  *	but we haven't yet released buffer pins, so the buffer is still pinned.
@@ -4737,46 +4731,42 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
  *	possible the error condition wasn't related to the I/O.
  */
 void
-AbortBufferIO(void)
+AbortBufferIO(Buffer buf)
 {
-	BufferDesc *buf = InProgressBuf;
+	BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 
-	if (buf)
+	buf_state = LockBufHdr(buf_hdr);
+	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+
+	if (!(buf_state & BM_VALID))
 	{
-		uint32		buf_state;
-
-		buf_state = LockBufHdr(buf);
-		Assert(buf_state & BM_IO_IN_PROGRESS);
-		if (IsForInput)
-		{
-			Assert(!(buf_state & BM_DIRTY));
-
-			/* We'd better not think buffer is valid yet */
-			Assert(!(buf_state & BM_VALID));
-			UnlockBufHdr(buf, buf_state);
-		}
-		else
-		{
-			Assert(buf_state & BM_DIRTY);
-			UnlockBufHdr(buf, buf_state);
-			/* Issue notice if this is not the first failure... */
-			if (buf_state & BM_IO_ERROR)
-			{
-				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
-
-				path = relpathperm(BufTagGetRelFileLocator(&buf->tag),
-								   BufTagGetForkNum(&buf->tag));
-				ereport(WARNING,
-						(errcode(ERRCODE_IO_ERROR),
-						 errmsg("could not write block %u of %s",
-								buf->tag.blockNum, path),
-						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
-			}
-		}
-		TerminateBufferIO(buf, false, BM_IO_ERROR);
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
 	}
+	else
+	{
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		/* Issue notice if this is not the first failure... */
+		if (buf_state & BM_IO_ERROR)
+		{
+			/* Buffer is pinned, so we can read tag without spinlock */
+			char	   *path;
+
+			path = relpathperm(BufTagGetRelFileLocator(&buf_hdr->tag),
+							   BufTagGetForkNum(&buf_hdr->tag));
+			ereport(WARNING,
+					(errcode(ERRCODE_IO_ERROR),
+					 errmsg("could not write block %u of %s",
+							buf_hdr->tag.blockNum, path),
+					 errdetail("Multiple failures --- write error might be permanent.")));
+			pfree(path);
+		}
+	}
+
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45d..fccc59b39dd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
 
 	/* We have built-in support for remembering: */
 	ResourceArray bufferarr;	/* owned buffers */
+	ResourceArray bufferioarr;	/* in-progress buffer IO */
 	ResourceArray catrefarr;	/* catcache references */
 	ResourceArray catlistrefarr;	/* catcache-list pins */
 	ResourceArray relrefarr;	/* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	ResourceArrayInit(&(owner->bufferarr), BufferGetDatum(InvalidBuffer));
+	ResourceArrayInit(&(owner->bufferioarr), BufferGetDatum(InvalidBuffer));
 	ResourceArrayInit(&(owner->catrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->catlistrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->relrefarr), PointerGetDatum(NULL));
@@ -517,6 +519,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 	if (phase == RESOURCE_RELEASE_BEFORE_LOCKS)
 	{
+		/*
+		 * Abort failed buffer IO. AbortBufferIO()->TerminateBufferIO() calls
+		 * ResourceOwnerForgetBufferIOs(), so we just have to iterate till
+		 * there are none.
+		 *
+		 * Needs to be before we release buffer pins.
+		 *
+		 * During a commit, there shouldn't be any in-progress IO.
+		 */
+		while (ResourceArrayGetAny(&(owner->bufferioarr), &foundres))
+		{
+			Buffer		res = DatumGetBuffer(foundres);
+
+			if (isCommit)
+				elog(PANIC, "lost track of buffer IO on buffer %u", res);
+			AbortBufferIO(res);
+		}
+
 		/*
 		 * Release buffer pins.  Note that ReleaseBuffer will remove the
 		 * buffer entry from our array, so we just have to iterate till there
@@ -746,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And it better not own any resources, either */
 	Assert(owner->bufferarr.nitems == 0);
+	Assert(owner->bufferioarr.nitems == 0);
 	Assert(owner->catrefarr.nitems == 0);
 	Assert(owner->catlistrefarr.nitems == 0);
 	Assert(owner->relrefarr.nitems == 0);
@@ -775,6 +796,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And free the object. */
 	ResourceArrayFree(&(owner->bufferarr));
+	ResourceArrayFree(&(owner->bufferioarr));
 	ResourceArrayFree(&(owner->catrefarr));
 	ResourceArrayFree(&(owner->catlistrefarr));
 	ResourceArrayFree(&(owner->relrefarr));
@@ -976,6 +998,44 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 			 buffer, owner->name);
 }
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * buffer array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeBufferIOs(ResourceOwner owner)
+{
+	/* We used to allow pinning buffers without a resowner, but no more */
+	Assert(owner != NULL);
+	ResourceArrayEnlarge(&(owner->bufferioarr));
+}
+
+/*
+ * Remember that a buffer IO is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeBufferIOs()
+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+
+/*
+ * Forget that a buffer IO is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	if (!ResourceArrayRemove(&(owner->bufferioarr), BufferGetDatum(buffer)))
+		elog(PANIC, "buffer IO %d is not owned by resource owner %s",
+			 buffer, owner->name);
+}
+
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
  *
-- 
2.38.0

v6-0011-bufmgr-Move-relation-extension-handling-into-Exte.patchtext/x-diff; charset=us-asciiDownload

From 628d5702b5cd0bcabe05adcc7b8ae907a2cf20a3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:24:19 -0800
Subject: [PATCH v6 11/17] bufmgr: Move relation extension handling into
 ExtendBufferedRel{By,To,}

---
 src/include/pgstat.h                   |   1 +
 src/include/storage/buf_internals.h    |   7 +
 src/include/storage/bufmgr.h           |  65 ++
 src/backend/storage/buffer/bufmgr.c    | 797 +++++++++++++++++++------
 src/backend/storage/buffer/localbuf.c  | 156 ++++-
 src/backend/utils/activity/pgstat_io.c |   8 +-
 src/backend/utils/probes.d             |   6 +-
 doc/src/sgml/monitoring.sgml           |  43 +-
 8 files changed, 899 insertions(+), 184 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index c2bae8358a2..4aac1115664 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -500,6 +500,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index fa5c451b1a9..feca19f5620 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -422,6 +422,13 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr, IOContext *io_context);
+extern BlockNumber ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+										  ForkNumber fork,
+										  uint32 flags,
+										  uint32 extend_by,
+										  BlockNumber extend_upto,
+										  Buffer *buffers,
+										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 9ba25521ea1..da0abf43150 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -60,6 +60,53 @@ typedef struct PrefetchBufferResult
 	bool		initiated_io;	/* If true, a miss resulting in async I/O */
 } PrefetchBufferResult;
 
+/*
+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+	/*
+	 * Don't acquire extension lock. This is safe only if the relation isn't
+	 * shared, an access exclusive lock is held or if this is the startup
+	 * process.
+	 */
+	EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+	/* Is this extension part of recovery? */
+	EB_PERFORMING_RECOVERY = (1 << 1),
+
+	/*
+	 * Should the fork be created if it does not currently exist? This likely
+	 * only ever makes sense for relation forks.
+	 */
+	EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+	/* Should the first (possibly only) return buffer be returned locked? */
+	EB_LOCK_FIRST = (1 << 3),
+
+	/* Should the smgr size cache be cleared? */
+	EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+	/* internal flags follow */
+	EB_LOCK_TARGET = (1 << 5),
+} ExtendBufferedFlags;
+
+/*
+ * To identify the relation - either relation or smgr + relpersistence has to
+ * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
+ * to use the same function for both crash recovery and normal operation.
+ */
+typedef struct ExtendBufferedWhat
+{
+	Relation rel;
+	struct SMgrRelationData *smgr;
+	char relpersistence;
+} ExtendBufferedWhat;
+
+#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
+#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -138,6 +185,24 @@ extern void BufferCheckWePinOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
+extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
+								ForkNumber forkNum,
+								BufferAccessStrategy strategy,
+								uint32 flags);
+extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
+									   ForkNumber fork,
+									   BufferAccessStrategy strategy,
+									   uint32 flags,
+									   uint32 extend_by,
+									   Buffer *buffers,
+									   uint32 *extended_by);
+extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
+								  ForkNumber fork,
+								  BufferAccessStrategy strategy,
+								  uint32 flags,
+								  BlockNumber extend_to,
+								  ReadBufferMode mode);
+
 extern void InitBufferPoolAccess(void);
 extern void AtEOXact_Buffers(bool isCommit);
 extern void PrintBufferLeakWarning(Buffer buffer);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 41153f99bc2..b7e466ef85f 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -450,6 +451,22 @@ static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
+static BlockNumber ExtendBufferedRelCommon(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
+static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
 static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
@@ -785,6 +802,179 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * Convenience wrapper around ExtendBufferedRelBy() extending by one block.
+ */
+Buffer
+ExtendBufferedRel(ExtendBufferedWhat eb,
+				  ForkNumber forkNum,
+				  BufferAccessStrategy strategy,
+				  uint32 flags)
+{
+	Buffer		buf;
+	uint32		extend_by = 1;
+
+	ExtendBufferedRelBy(eb, forkNum, strategy, flags, extend_by,
+						&buf, &extend_by);
+
+	return buf;
+}
+
+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by blocks. Depending on the
+ * availability of resources the relation may end up being extended by a
+ * smaller number of pages (unless an error is thrown, always by at least one
+ * page). *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ *
+ * If EB_LOCK_FIRST is part of flags, the first returned buffer is
+ * locked. This is useful for callers that want a buffer that is guaranteed to
+ * be empty.
+ */
+BlockNumber
+ExtendBufferedRelBy(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					uint32 extend_by,
+					Buffer *buffers,
+					uint32 *extended_by)
+{
+	Assert((eb.rel != NULL) != (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_by > 0);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	return ExtendBufferedRelCommon(eb, fork, strategy, flags,
+								   extend_by, InvalidBlockNumber,
+								   buffers, extended_by);
+}
+
+/*
+ * Extend the relation so it is at least extend_to blocks large, read buffer
+ * (extend_to - 1).
+ *
+ * This is useful for callers that want to write a specific page, regardless
+ * of the current size of the relation (e.g. useful for visibilitymap and for
+ * crash recovery).
+ */
+Buffer
+ExtendBufferedRelTo(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					BlockNumber extend_to,
+					ReadBufferMode mode)
+{
+	BlockNumber current_size;
+	uint32 extended_by = 0;
+	Buffer buffer = InvalidBuffer;
+	Buffer buffers[64];
+
+	Assert((eb.rel != NULL) != (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
+	Assert(mode == RBM_NORMAL || mode == RBM_ZERO_ON_ERROR ||
+		   mode == RBM_ZERO_AND_LOCK);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * If desired, create the file if it doesn't exist.  If
+	 * smgr_cached_nblocks[fork] is positive then it must exist, no need for
+	 * an smgrexists call.
+	 */
+	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
+		(eb.smgr->smgr_cached_nblocks[fork] == 0 ||
+		 eb.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(eb.smgr, fork))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+
+		/* could have been closed while waiting for lock */
+		eb.smgr = RelationGetSmgr(eb.rel);
+
+		/* recheck, fork might have been created concurrently */
+		if (!smgrexists(eb.smgr, fork))
+			smgrcreate(eb.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	/*
+	 * Estimate how many pages we'll need to extend by. This avoids acquiring
+	 * unnecessarily many victim buffers.
+	 */
+	current_size = smgrnblocks(eb.smgr, fork);
+
+	if (mode == RBM_ZERO_AND_LOCK)
+		flags |= EB_LOCK_TARGET;
+
+	while (current_size < extend_to)
+	{
+		uint32 num_pages = lengthof(buffers);
+		BlockNumber first_block;
+
+		if ((uint64) current_size + num_pages > extend_to)
+			num_pages = extend_to - current_size;
+
+		first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
+											  num_pages, extend_to,
+											  buffers, &extended_by);
+
+		current_size = first_block + extended_by;
+		Assert(current_size <= extend_to);
+		Assert(num_pages != 0 || current_size >= extend_to);
+
+		for (int i = 0; i < extended_by; i++)
+		{
+			if (first_block + i != extend_to - 1)
+				ReleaseBuffer(buffers[i]);
+			else
+				buffer = buffers[i];
+		}
+	}
+
+	/*
+	 * It's possible that another backend concurrently extended the
+	 * relation. In that case read the buffer.
+	 *
+	 * XXX: Should we control this via a flag?
+	 */
+	if (buffer == InvalidBuffer)
+	{
+		bool hit;
+
+		Assert(extended_by == 0);
+		buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+								   fork, extend_to - 1, mode, strategy,
+								   &hit);
+	}
+
+	return buffer;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +991,38 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	bool		found;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
 	*hit = false;
 
+	/*
+	 * Backward compatibility path, most code should use ExtendBufferedRel()
+	 * instead, as acquiring the extension lock inside ExtendBufferedRel()
+	 * scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+	{
+		uint32 flags = EB_SKIP_EXTENSION_LOCK;
+
+		Assert(mode == RBM_NORMAL ||
+			   mode == RBM_ZERO_AND_LOCK ||
+			   mode == RBM_ZERO_ON_ERROR);
+
+		if (mode == RBM_ZERO_AND_LOCK)
+			flags |= EB_LOCK_FIRST;
+
+		return ExtendBufferedRel(EB_SMGR(smgr, relpersistence),
+								 forkNum, strategy, flags);
+	}
+
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
-	isExtend = (blockNum == P_NEW);
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
 									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend,
-									   isExtend);
-
-	/* Substitute proper block number if caller asked for P_NEW */
-	if (isExtend)
-	{
-		blockNum = smgrnblocks(smgr, forkNum);
-		/* Fail if relation is already at maximum possible length */
-		if (blockNum == P_NEW)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot extend relation %s beyond %u blocks",
-							relpath(smgr->smgr_rlocator, forkNum),
-							P_NEW)));
-	}
+									   smgr->smgr_rlocator.backend);
 
 	if (isLocalBuf)
 	{
@@ -843,8 +1036,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found, &io_context);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.local_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
@@ -859,8 +1050,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found, &io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.shared_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
@@ -871,103 +1060,40 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
-		if (!isExtend)
-		{
-			/* Just need to update stats before we exit */
-			*hit = true;
-			VacuumPageHit++;
+		/* Just need to update stats before we exit */
+		*hit = true;
+		VacuumPageHit++;
 
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
 
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-											  smgr->smgr_rlocator.locator.spcOid,
-											  smgr->smgr_rlocator.locator.dbOid,
-											  smgr->smgr_rlocator.locator.relNumber,
-											  smgr->smgr_rlocator.backend,
-											  isExtend,
-											  found);
-
-			/*
-			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
-			 * locked on return.
-			 */
-			if (!isLocalBuf)
-			{
-				if (mode == RBM_ZERO_AND_LOCK)
-					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-								  LW_EXCLUSIVE);
-				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-			}
-
-			return BufferDescriptorGetBuffer(bufHdr);
-		}
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  smgr->smgr_rlocator.locator.spcOid,
+										  smgr->smgr_rlocator.locator.dbOid,
+										  smgr->smgr_rlocator.locator.relNumber,
+										  smgr->smgr_rlocator.backend,
+										  found);
 
 		/*
-		 * We get here only in the corner case where we are trying to extend
-		 * the relation but we found a pre-existing buffer marked BM_VALID.
-		 * This can happen because mdread doesn't complain about reads beyond
-		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
-		 * read a block beyond EOF could have left a "valid" zero-filled
-		 * buffer.  Unfortunately, we have also seen this case occurring
-		 * because of buggy Linux kernels that sometimes return an
-		 * lseek(SEEK_END) result that doesn't account for a recent write. In
-		 * that situation, the pre-existing buffer would contain valid data
-		 * that we don't want to overwrite.  Since the legitimate case should
-		 * always have left a zero-filled buffer, complain if not PageIsNew.
+		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
+		 * on return.
 		 */
-		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-		if (!PageIsNew((Page) bufBlock))
-			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rlocator, forkNum)),
-					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
-
-		/*
-		 * We *must* do smgrextend before succeeding, else the page will not
-		 * be reserved by the kernel, and the next P_NEW call will decide to
-		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
-		 * call that BufferAlloc didn't, and proceed.
-		 */
-		if (isLocalBuf)
+		if (!isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			Assert(buf_state & BM_VALID);
-			buf_state &= ~BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			if (mode == RBM_ZERO_AND_LOCK)
+				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+							  LW_EXCLUSIVE);
+			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
+				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
 		}
-		else
-		{
-			/*
-			 * Loop to handle the very small possibility that someone re-sets
-			 * BM_VALID between our clearing it and StartBufferIO inspecting
-			 * it.
-			 */
-			do
-			{
-				uint32		buf_state = LockBufHdr(bufHdr);
 
-				Assert(buf_state & BM_VALID);
-				buf_state &= ~BM_VALID;
-				UnlockBufHdr(bufHdr, buf_state);
-			} while (!StartBufferIO(bufHdr, true));
-		}
+		return BufferDescriptorGetBuffer(bufHdr);
 	}
 
 	/*
 	 * if we have gotten to this point, we have allocated a buffer for the
 	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
 	 * if it's a shared buffer.
-	 *
-	 * Note: if smgrextend fails, we will end up with a buffer that is
-	 * allocated but not marked BM_VALID.  P_NEW will still select the same
-	 * block number (because the relation didn't get any longer on disk) and
-	 * so future attempts to extend the relation will find the same buffer (if
-	 * it's not been recycled) but come right back here to try smgrextend
-	 * again.
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
@@ -982,72 +1108,51 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		io_object = IOOBJECT_RELATION;
 	}
 
-	if (isExtend)
-	{
-		/* new buffers are zero-filled */
+	/*
+	 * Read in the page, unless the caller intends to overwrite it and just
+	 * wants us to allocate a buffer.
+	 */
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
-
-		/*
-		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
-		 * although we're essentially performing a write. At least on linux
-		 * doing so defeats the 'delayed allocation' mechanism, leading to
-		 * increased file fragmentation.
-		 */
-	}
 	else
 	{
-		/*
-		 * Read in the page, unless the caller intends to overwrite it and
-		 * just wants us to allocate a buffer.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			MemSet((char *) bufBlock, 0, BLCKSZ);
-		else
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
+		smgrread(smgr, forkNum, blockNum, bufBlock);
+
+		if (track_io_timing)
 		{
-			instr_time	io_start,
-						io_time;
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
+		pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
+		/* check for garbage data */
+		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
+									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		{
+			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			{
+				ereport(WARNING,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s; zeroing out page",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
+				MemSet((char *) bufBlock, 0, BLCKSZ);
+			}
 			else
-				INSTR_TIME_SET_ZERO(io_start);
-
-			smgrread(smgr, forkNum, blockNum, bufBlock);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-					MemSet((char *) bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-			}
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
 		}
 	}
 
@@ -1090,7 +1195,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									  smgr->smgr_rlocator.locator.dbOid,
 									  smgr->smgr_rlocator.locator.relNumber,
 									  smgr->smgr_rlocator.backend,
-									  isExtend,
 									  found);
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1465,8 +1569,8 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
 	 * Clear out the buffer's tag and flags and usagecount.  This is not
 	 * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
 	 * doing anything with the buffer. But currently it's beneficial as the
-	 * pre-check for several linear scans of shared buffers just checks the
-	 * tag.
+	 * cheaper pre-check for several linear scans of shared buffers use the
+	 * tag (see e.g. FlushDatabaseBuffers()).
 	 */
 	ClearBufferTag(&buf_hdr->tag);
 	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
@@ -1640,6 +1744,363 @@ again:
 	return buf;
 }
 
+/*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, as otherwise the operation likely
+ * cannot be performed at all.
+ *
+ * The number of allowed pins for a backend is computed based on
+ * shared_buffers and the maximum number of connections possible. That's very
+ * pessimistic, but oustide of toy-sized shared_buffers it should allow
+ * sufficient pins.
+ */
+static void
+LimitAdditionalPins(uint32 *additional_pins)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	/*
+	 * Subtract the approximate number of buffers already pinned by this
+	 * backend. We get the number of "overflowed" pins for free, but don't
+	 * know the number of pins in PrivateRefCountArray. The cost of
+	 * calculating that exactly doesn't seem worth it, so just assume the max.
+	 */
+	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	if (max_proportional_pins < 0)
+		max_proportional_pins = 1;
+
+	if (*additional_pins > max_proportional_pins)
+		*additional_pins = max_proportional_pins;
+}
+
+/*
+ * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
+ * avoid duplicating the tracing and relpersistence related logic.
+ */
+static BlockNumber
+ExtendBufferedRelCommon(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
+										 eb.smgr->smgr_rlocator.locator.spcOid,
+										 eb.smgr->smgr_rlocator.locator.dbOid,
+										 eb.smgr->smgr_rlocator.locator.relNumber,
+										 eb.smgr->smgr_rlocator.backend,
+										 extend_by);
+
+	if (eb.relpersistence == RELPERSISTENCE_TEMP)
+		first_block = ExtendBufferedRelLocal(eb, fork, flags,
+											 extend_by, extend_upto,
+											 buffers, &extend_by);
+	else
+		first_block = ExtendBufferedRelShared(eb, fork, strategy, flags,
+											  extend_by, extend_upto,
+											  buffers, &extend_by);
+	*extended_by = extend_by;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
+										eb.smgr->smgr_rlocator.locator.spcOid,
+										eb.smgr->smgr_rlocator.locator.dbOid,
+										eb.smgr->smgr_rlocator.locator.relNumber,
+										eb.smgr->smgr_rlocator.backend,
+										*extended_by,
+										first_block);
+
+	return first_block;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * shared buffers.
+ */
+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	LimitAdditionalPins(&extend_by);
+
+	/*
+	 * Acquire victim buffers for extension without holding extension lock.
+	 * Writing out victim buffers is the most expensive part of extending the
+	 * relation, particularly when doing so requires WAL flushes. Zeroing out
+	 * the buffers is also quite expensive, so do that before holding the
+	 * extension lock as well.
+	 *
+	 * These pages are pinned by us and not valid. While we hold the pin they
+	 * can't be acquired as victim buffers by another backend.
+	 */
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		Block		buf_block;
+
+		buffers[i] = GetVictimBuffer(strategy, io_context);
+		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	/* in case we need to pin an existing buffer below */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/*
+	 * Lock relation against concurrent extensions, unless requested not to.
+	 *
+	 * We use the same extension lock for all forks. That's unnecessarily
+	 * restrictive, but currently extensions for forks don't happen often
+	 * enough to make it worth locking more granularly.
+	 *
+	 * Note that another backend might have extended the relation by the time
+	 * we get the lock.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+		eb.smgr = RelationGetSmgr(eb.rel);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	/*
+	 * Now that we have the accurate relation size, check if the caller wants
+	 * us to extend to only up to a specific size. If there were concurrent
+	 * extensions, we might have acquired too many buffers and need to release
+	 * them.
+	 */
+	if (extend_upto != InvalidBlockNumber)
+	{
+		uint32 orig_extend_by = extend_by;
+
+		if (first_block > extend_upto)
+			extend_by = 0;
+		else if ((uint64) first_block + extend_by > extend_upto)
+			extend_by = extend_upto - first_block;
+
+		for (uint32 i = extend_by; i < orig_extend_by; i++)
+		{
+			BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(buf_hdr);
+			UnpinBuffer(buf_hdr);
+		}
+
+		if (extend_by == 0)
+		{
+			UnlockRelationForExtension(eb.rel, ExclusiveLock);
+			*extended_by = extend_by;
+			return first_block;
+		}
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	/*
+	 * Insert buffers into buffer table, mark as IO_IN_PROGRESS.
+	 *
+	 * This needs to happen before we extend the relation, because as soon as
+	 * we do, other backends can start to read in those pages.
+	 */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		victim_buf = buffers[i];
+		BufferDesc *victim_buf_hdr = GetBufferDescriptor(victim_buf - 1);
+		BufferTag	tag;
+		uint32		hash;
+		LWLock	   *partition_lock;
+		int			existing_id;
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		existing_id = BufTableInsert(&tag, hash, victim_buf_hdr->buf_id);
+
+		/*
+		 * We get here only in the corner case where we are trying to extend
+		 * the relation but we found a pre-existing buffer. This can happen
+		 * because a prior attempt at extending the relation failed, and
+		 * because mdread doesn't complain about reads beyond EOF (when
+		 * zero_damaged_pages is ON) and so a previous attempt to read a block
+		 * beyond EOF could have left a "valid" zero-filled buffer.
+		 * Unfortunately, we have also seen this case occurring because of
+		 * buggy Linux kernels that sometimes return an lseek(SEEK_END) result
+		 * that doesn't account for a recent write. In that situation, the
+		 * pre-existing buffer would contain valid data that we don't want to
+		 * overwrite.  Since the legitimate cases should always have left a
+		 * zero-filled buffer, complain if not PageIsNew.
+		 */
+		if (existing_id >= 0)
+		{
+			BufferDesc *existing_hdr = GetBufferDescriptor(existing_id);
+			Block		buf_block;
+			bool		valid;
+
+			/*
+			 * Pin the existing buffer before releasing the partition lock,
+			 * preventing it from being evicted.
+			 */
+			valid = PinBuffer(existing_hdr, strategy);
+
+			LWLockRelease(partition_lock);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(victim_buf_hdr);
+			UnpinBuffer(victim_buf_hdr);
+
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+			buf_block = BufHdrGetBlock(existing_hdr);
+
+			if (valid && !PageIsNew((Page) buf_block))
+				ereport(ERROR,
+						(errmsg("unexpected data beyond EOF in block %u of relation %s",
+								existing_hdr->tag.blockNum, relpath(eb.smgr->smgr_rlocator, fork)),
+						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
+
+			/*
+			 * We *must* do smgr[zero]extend before succeeding, else the page
+			 * will not be reserved by the kernel, and the next P_NEW call
+			 * will decide to return the same page.  Clear the BM_VALID bit,
+			 * do StartBufferIO() and proceed.
+			 *
+			 * Loop to handle the very small possibility that someone re-sets
+			 * BM_VALID between our clearing it and StartBufferIO inspecting
+			 * it.
+			 */
+			do
+			{
+				uint32		buf_state = LockBufHdr(existing_hdr);
+
+				buf_state &= ~BM_VALID;
+				UnlockBufHdr(existing_hdr, buf_state);
+			} while (!StartBufferIO(existing_hdr, true));
+		}
+		else
+		{
+			uint32		buf_state;
+
+			buf_state = LockBufHdr(victim_buf_hdr);
+
+			/* some sanity checks while we hold the buffer header lock */
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+			if (eb.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+				buf_state |= BM_PERMANENT;
+
+			UnlockBufHdr(victim_buf_hdr, buf_state);
+
+			LWLockRelease(partition_lock);
+
+			/* XXX: could combine the locked operations in it with the above */
+			StartBufferIO(victim_buf_hdr, true);
+		}
+	}
+
+	/*
+	 * Note: if smgzerorextend fails, we will end up with buffers that are
+	 * allocated but not marked BM_VALID.  The next relation extension will
+	 * still select the same block number (because the relation didn't get any
+	 * longer on disk) and so future attempts to extend the relation will find
+	 * the same buffers (if they have not been recycled) but come right back
+	 * here to try smgrzeroextend again.
+	 *
+	 * We don't need to set checksum for all-zero pages.
+	 */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 *
+	 * We remove IO_IN_PROGRESS after this, as waking up waiting backends can
+	 * take noticeable time.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+
+	/* Set BM_VALID, terminate IO, and wake up any waiters */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+		bool		lock = false;
+
+		if (flags & EB_LOCK_FIRST && i == 0)
+			lock = true;
+		else if (flags & EB_LOCK_TARGET)
+		{
+			Assert(extend_upto != InvalidBlockNumber);
+			if (first_block + i + 1 == extend_upto)
+				lock = true;
+		}
+
+		if (lock)
+			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+
+		TerminateBufferIO(buf_hdr, false, BM_VALID);
+	}
+
+	pgBufferUsage.shared_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
+						 extend_by);
+
+	*extended_by = extend_by;
+
+	return first_block;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 715fd682e6f..f1937e23c82 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -49,6 +49,9 @@ static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
+/* number of local buffers pinned at least once */
+static int	NLocalPinnedBuffers = 0;
+
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
@@ -280,6 +283,154 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see LimitAdditionalPins() */
+static void
+LimitAdditionalLocalPins(uint32 *additional_pins)
+{
+	uint32		max_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	/*
+	 * In contrast to LimitAdditionalPins() other backends don't play a role
+	 * here. We can allow up to NLocBuffer pins in total.
+	 */
+	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+
+	if (*additional_pins >= max_pins)
+		*additional_pins = max_pins;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * temporary buffers.
+ */
+BlockNumber
+ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+					   ForkNumber fork,
+					   uint32 flags,
+					   uint32 extend_by,
+					   BlockNumber extend_upto,
+					   Buffer *buffers,
+					   uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	/* Initialize local buffers if first request in this session */
+	if (LocalBufHash == NULL)
+		InitLocalBuffers();
+
+	LimitAdditionalLocalPins(&extend_by);
+
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		BufferDesc *buf_hdr;
+		Block		buf_block;
+
+		buffers[i] = GetLocalVictimBuffer();
+		buf_hdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+		buf_block = LocalBufHdrGetBlock(buf_hdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	if (extend_upto != InvalidBlockNumber)
+	{
+		/*
+		 * In contranst to shared relations, nothing could change the relation
+		 * size concurrently. Thus we shouldn't end up finding that we don't
+		 * need to do anything.
+		 */
+		Assert(first_block <= extend_upto);
+
+		Assert((uint64) first_block + extend_by <= extend_upto);
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		int			victim_buf_id;
+		BufferDesc *victim_buf_hdr;
+		BufferTag	tag;
+		LocalBufferLookupEnt *hresult;
+		bool		found;
+
+		victim_buf_id = -buffers[i] - 1;
+		victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
+		if (found)
+		{
+			BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			uint32		buf_state;
+
+			UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
+
+			existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			PinLocalBuffer(existing_hdr, false);
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+
+			buf_state = pg_atomic_read_u32(&existing_hdr->state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(!(buf_state & BM_DIRTY));
+			buf_state &= BM_VALID;
+			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+		}
+		else
+		{
+			uint32		buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
+
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
+
+			hresult->id = victim_buf_id;
+		}
+	}
+
+	/* actually extend relation */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr;
+		uint32		buf_state;
+
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	*extended_by = extend_by;
+
+	pgBufferUsage.temp_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
+						 extend_by);
+
+	return first_block;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
@@ -494,6 +645,7 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 
 	if (LocalRefCount[bufid] == 0)
 	{
+		NLocalPinnedBuffers++;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
@@ -515,9 +667,11 @@ UnpinLocalBuffer(Buffer buffer)
 
 	Assert(BufferIsLocal(buffer));
 	Assert(LocalRefCount[buffid] > 0);
+	Assert(NLocalPinnedBuffers > 0);
 
 	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-	LocalRefCount[buffid]--;
+	if (--LocalRefCount[buffid] == 0)
+		NLocalPinnedBuffers--;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d5546101..f2f6eae8031 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -64,13 +64,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	pgstat_count_io_op_n(io_object, io_context, io_op, 1);
+}
+
+void
+pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt)
 {
 	Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
 	Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.data[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index c064d679e94..fd3df2f7900 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,10 +55,12 @@ provider postgresql {
 	probe sort__start(int, bool, int, int, bool, int);
 	probe sort__done(bool, long);
 
-	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
+	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
+	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index b2ccd8d7fef..00df5fcd9ba 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7761,33 +7761,52 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <entry>Probe that fires when the two-phase portion of a checkpoint is
       complete.</entry>
     </row>
+    <row>
+     <entry><literal>buffer-extend-start</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int)</literal></entry>
+     <entry>Probe that fires when a relation extension starts.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the caller would like to extend
+       by.</entry>
+    </row>
+    <row>
+     <entry><literal>buffer-extend-done</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber)</literal></entry>
+     <entry>Probe that fires when a relation extension is complete.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the relation was extended by, this
+       can be less than the number in the
+       <literal>buffer-extend-start</literal> due to resource
+       constraints. arg6 contains the BlockNumber of the first new
+       block.</entry>
+    </row>
     <row>
      <entry><literal>buffer-read-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int)</literal></entry>
      <entry>Probe that fires when a buffer read is started.
-      arg0 and arg1 contain the fork and block numbers of the page (but
-      arg1 will be -1 if this is a relation extension request).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.</entry>
+      </entry>
     </row>
     <row>
      <entry><literal>buffer-read-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
      <entry>Probe that fires when a buffer read is complete.
-      arg0 and arg1 contain the fork and block numbers of the page (if this
-      is a relation extension request, arg1 now contains the block number
-      of the newly added block).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.
-      arg7 is true if the buffer was found in the pool, false if not.</entry>
+      arg6 is true if the buffer was found in the pool, false if not.</entry>
     </row>
     <row>
      <entry><literal>buffer-flush-start</literal></entry>
-- 
2.38.0

v6-0012-Convert-a-few-places-to-ExtendBufferedRel.patchtext/x-diff; charset=us-asciiDownload

From 7191f4e284e2c70461c8a0094317d51436ccdf85 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:18:18 -0700
Subject: [PATCH v6 12/17] Convert a few places to ExtendBufferedRel

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/access/brin/brin.c         |  9 +++----
 src/backend/access/brin/brin_pageops.c |  4 +++
 src/backend/access/brin/brin_revmap.c  | 15 +++---------
 src/backend/access/gin/gininsert.c     | 10 +++-----
 src/backend/access/gin/ginutil.c       | 13 ++--------
 src/backend/access/gin/ginvacuum.c     |  8 ++++++
 src/backend/access/gist/gist.c         |  4 +--
 src/backend/access/gist/gistutil.c     | 14 ++---------
 src/backend/access/gist/gistvacuum.c   |  3 +++
 src/backend/access/nbtree/nbtpage.c    | 34 +++++++-------------------
 src/backend/access/nbtree/nbtree.c     |  3 +++
 src/backend/access/spgist/spgutils.c   | 13 ++--------
 contrib/bloom/blutils.c                | 12 ++-------
 13 files changed, 48 insertions(+), 94 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 53e4721a54e..41bf950a4af 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -837,9 +837,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -904,9 +904,8 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	metabuf = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index ad5a89bd051..b578d259545 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -730,6 +730,10 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * There's not enough free space in any existing index page,
 			 * according to the FSM: extend the relation to obtain a shiny new
 			 * page.
+			 *
+			 * XXX: It's likely possible to use RBM_ZERO_AND_LOCK here,
+			 * which'd avoid the need to hold the extension lock during buffer
+			 * reclaim.
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 7fc5226bf74..f4271ba31c9 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -538,7 +538,6 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
-	bool		needLock = !RELATION_IS_LOCAL(irel);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -570,10 +569,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
-
-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendBufferedRel(EB_REL(irel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
@@ -582,17 +579,11 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * up and have caller start over.  We will have to evacuate that
 			 * page from under whoever is using it.
 			 */
-			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buf);
+			UnlockReleaseBuffer(buf);
 			return;
 		}
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
-
-		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d5d748009ea..be1841de7bf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,12 +440,10 @@ ginbuildempty(Relation index)
 				MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
-	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	MetaBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	RootBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 03fec1704e9..3c6f87236c5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -299,7 +299,6 @@ Buffer
 GinNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -327,16 +326,8 @@ GinNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index e5d310d8362..13251d7e07d 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -736,6 +736,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	needLock = !RELATION_IS_LOCAL(index);
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	npages = RelationGetNumberOfBlocks(index);
@@ -786,6 +790,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	stats->pages_free = totFreePages;
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index ea72bcce1bc..d3539e96d87 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -134,8 +134,8 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+							   EB_SKIP_EXTENSION_LOCK | EB_LOCK_FIRST);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index b4d843a0ff1..446fcb3cc5e 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -824,7 +824,6 @@ Buffer
 gistNewBuffer(Relation r)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -877,17 +876,8 @@ gistNewBuffer(Relation r)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(r);
-
-	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
-
-	buffer = ReadBuffer(r, P_NEW);
-	LockBuffer(buffer, GIST_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(r), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3f60d3274d2..cc711b04986 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -203,6 +203,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * we must already have processed any tuples due to be moved into such a
 	 * page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 3feee28d197..c4f16b3d2b7 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -881,7 +881,6 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 	}
 	else
 	{
-		bool		needLock;
 		Page		page;
 
 		Assert(access == BT_WRITE);
@@ -962,31 +961,16 @@ _bt_getbuf(Relation rel, BlockNumber blkno, int access)
 		}
 
 		/*
-		 * Extend the relation by one page.
-		 *
-		 * We have to use a lock to ensure no one else is extending the rel at
-		 * the same time, else we will both try to initialize the same new
-		 * page.  We can skip locking for new or temp relations, however,
-		 * since no one else could be accessing them.
+		 * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
+		 * we risk a race condition against btvacuumscan --- see comments
+		 * therein. This forces us to repeat the valgrind request that
+		 * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
+		 * without introducing a race.
 		 */
-		needLock = !RELATION_IS_LOCAL(rel);
-
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-
-		buf = ReadBuffer(rel, P_NEW);
-
-		/* Acquire buffer lock on new page */
-		_bt_lockbuf(rel, buf, BT_WRITE);
-
-		/*
-		 * Release the file-extension lock; it's now OK for someone else to
-		 * extend the relation some more.  Note that we cannot release this
-		 * lock before we have buffer lock on the new page, or we risk a race
-		 * condition against btvacuumscan --- see comments therein.
-		 */
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+		buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index a68dd07534f..86010443269 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -970,6 +970,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * write-lock on the left page before it adds a right page, so we must
 	 * already have processed any tuples due to be moved into such a page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4e7ff1d1603..190e4f76a9e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -366,7 +366,6 @@ Buffer
 SpGistNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -406,16 +405,8 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a6d9f09f315..d935ed8fbdf 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -353,7 +353,6 @@ Buffer
 BloomNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -387,15 +386,8 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
-- 
2.38.0

v6-0013-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload

From 74280f91644259f9be4feca1cb30c48229efd93b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v6 13/17] heapam: Add num_pages to RelationGetBufferForTuple()

This will be useful to compute the number of pages to extend a relation by.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/access/hio.h         | 14 +++++++-
 src/backend/access/heap/heapam.c | 56 +++++++++++++++++++++++++++++---
 src/backend/access/heap/hio.c    |  8 ++++-
 3 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 3f20b585326..a3c49b3025d 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -30,6 +30,17 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+
+	/*
+	 * State for bulk extensions. Further pages that were unused at the time
+	 * of the extension. They might be in use by the time we use them though,
+	 * so rechecks are needed.
+	 *
+	 * XXX: It's possible these should live in RelationData instead, alongside
+	 * targetblock.
+	 */
+	BlockNumber next_free;
+	BlockNumber last_free;
 } BulkInsertStateData;
 
 
@@ -38,6 +49,7 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
 extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
 										Buffer otherBuffer, int options,
 										BulkInsertStateData *bistate,
-										Buffer *vmbuffer, Buffer *vmbuffer_other);
+										Buffer *vmbuffer, Buffer *vmbuffer_other,
+										int num_pages);
 
 #endif							/* HIO_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8abc101c8cb..5b9f4542359 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1774,6 +1774,8 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 	return bistate;
 }
 
@@ -1847,7 +1849,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
-									   &vmbuffer, NULL);
+									   &vmbuffer, NULL,
+									   0);
 
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
@@ -2050,6 +2053,30 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 		return tup;
 }
 
+/*
+ * Helper for heap_multi_insert() that computes the number of full pages
+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace)
+{
+	size_t		page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+	int			npages = 1;
+
+	for (int i = done; i < ntuples; i++)
+	{
+		size_t		tup_sz = sizeof(ItemIdData) + MAXALIGN(heaptuples[i]->t_len);
+
+		if (page_avail < tup_sz)
+		{
+			npages++;
+			page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+		}
+		page_avail -= tup_sz;
+	}
+
+	return npages;
+}
+
 /*
  *	heap_multi_insert	- insert multiple tuples into a heap
  *
@@ -2076,6 +2103,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	bool		starting_with_empty_page = false;
+	int			npages = 0;
+	int			npages_used = 0;
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2126,13 +2156,29 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	while (ndone < ntuples)
 	{
 		Buffer		buffer;
-		bool		starting_with_empty_page;
 		bool		all_visible_cleared = false;
 		bool		all_frozen_set = false;
 		int			nthispage;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Compute number of pages needed to insert tuples in the worst case.
+		 * This will be used to determine how much to extend the relation by
+		 * in RelationGetBufferForTuple(), if needed.  If we filled a prior
+		 * page from scratch, we can just update our last computation, but if
+		 * we started with a partially filled page recompute from scratch, the
+		 * number of potentially required pages can vary due to tuples needing
+		 * to fit onto the page, page headers etc.
+		 */
+		if (ndone == 0 || !starting_with_empty_page)
+		{
+			npages = heap_multi_insert_pages(heaptuples, ndone, ntuples, saveFreeSpace);
+			npages_used = 0;
+		}
+		else
+			npages_used++;
+
 		/*
 		 * Find buffer where at least the next tuple will fit.  If the page is
 		 * all-visible, this will also pin the requisite visibility map page.
@@ -2142,7 +2188,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		 */
 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
 										   InvalidBuffer, options, bistate,
-										   &vmbuffer, NULL);
+										   &vmbuffer, NULL,
+										   npages - npages_used);
 		page = BufferGetPage(buffer);
 
 		starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0;
@@ -3576,7 +3623,8 @@ l2:
 				/* It doesn't fit, must use RelationGetBufferForTuple. */
 				newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
 												   buffer, 0, NULL,
-												   &vmbuffer_new, &vmbuffer);
+												   &vmbuffer_new, &vmbuffer,
+												   0);
 				/* We're all done. */
 				break;
 			}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 8b3dfa0ccae..cc913a028e0 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -281,6 +281,11 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	Returns pinned and exclusive-locked buffer of a page in given relation
  *	with free space >= given len.
  *
+ *	If num_pages is > 1, the relation will be extended by at least that many
+ *	pages when we decide to extend the relation. This is more efficient for
+ *	callers that know they will need multiple pages
+ *	(e.g. heap_multi_insert()).
+ *
  *	If otherBuffer is not InvalidBuffer, then it references a previously
  *	pinned buffer of another page in the same relation; on return, this
  *	buffer will also be exclusive-locked.  (This case is used by heap_update;
@@ -339,7 +344,8 @@ Buffer
 RelationGetBufferForTuple(Relation relation, Size len,
 						  Buffer otherBuffer, int options,
 						  BulkInsertState bistate,
-						  Buffer *vmbuffer, Buffer *vmbuffer_other)
+						  Buffer *vmbuffer, Buffer *vmbuffer_other,
+						  int num_pages)
 {
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
-- 
2.38.0

v6-0014-hio-Use-ExtendBufferedRelBy.patchtext/x-diff; charset=us-asciiDownload

From fd89bf0599c7802d3177999590d989180a87a099 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Mar 2023 18:39:10 -0700
Subject: [PATCH v6 14/17] hio: Use ExtendBufferedRelBy()

---
 src/backend/access/heap/hio.c | 370 ++++++++++++++++++++--------------
 1 file changed, 217 insertions(+), 153 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index cc913a028e0..6d66826d951 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -192,87 +192,197 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
- * Extend a relation by multiple blocks to avoid future contention on the
- * relation extension lock.  Our goal is to pre-extend the relation by an
- * amount which ramps up as the degree of contention ramps up, but limiting
- * the result to some sane overall value.
+ * Extend the relation. By multiple pages, if beneficial.
+ *
+ * If the caller needs multiple pages (num_pages > 1), we always try to extend
+ * by at least that much.
+ *
+ * If there is contention on the extension lock, we don't just extend "for
+ * ourselves", but we try to help others. We can do so by adding empty pages
+ * into the FSM. Typically there is no contention when we can't use the FSM.
+ *
+ * We do have to limit the number of pages to extend by to some value, as the
+ * buffers for all the extended pages need to, temporarily, be pinned. For now
+ * we define MAX_BUFFERS_TO_EXTEND_BY to be 64 buffers, it's hard to see
+ * benefits with higher numbers. This partially is because copyfrom.c's
+ * MAX_BUFFERED_TUPLES / MAX_BUFFERED_BYTES prevents larger multi_inserts.
  */
-static void
-RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+static Buffer
+RelationAddBlocks(Relation relation, BulkInsertState bistate,
+				  int num_pages, bool use_fsm, bool *did_unlock)
 {
-	BlockNumber blockNum,
-				firstBlock = InvalidBlockNumber;
-	int			extraBlocks;
-	int			lockWaiters;
-
-	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
-	if (lockWaiters <= 0)
-		return;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+	BlockNumber first_block = InvalidBlockNumber;
+	BlockNumber last_block = InvalidBlockNumber;
+	uint32		extend_by_pages;
+	uint32		not_in_fsm_pages;
+	Buffer		buffer;
+	Page		page;
 
 	/*
-	 * It might seem like multiplying the number of lock waiters by as much as
-	 * 20 is too aggressive, but benchmarking revealed that smaller numbers
-	 * were insufficient.  512 is just an arbitrary cap to prevent
-	 * pathological results.
+	 * Determine by how many pages to try to extend by.
 	 */
-	extraBlocks = Min(512, lockWaiters * 20);
-
-	do
+	if (bistate == NULL && !use_fsm)
 	{
-		Buffer		buffer;
-		Page		page;
-		Size		freespace;
-
 		/*
-		 * Extend by one page.  This should generally match the main-line
-		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout, and we don't immediately
-		 * initialize the page (see below).
+		 * If we have neither bistate, nor can use the FSM, we can't bulk
+		 * extend - there'd be no way to find the additional pages.
 		 */
-		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-		page = BufferGetPage(buffer);
-
-		if (!PageIsNew(page))
-			elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-				 BufferGetBlockNumber(buffer),
-				 RelationGetRelationName(relation));
-
-		/*
-		 * Add the page to the FSM without initializing. If we were to
-		 * initialize here, the page would potentially get flushed out to disk
-		 * before we add any useful content. There's no guarantee that that'd
-		 * happen before a potential crash, so we need to deal with
-		 * uninitialized pages anyway, thus avoid the potential for
-		 * unnecessary writes.
-		 */
-
-		/* we'll need this info below */
-		blockNum = BufferGetBlockNumber(buffer);
-		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
-
-		UnlockReleaseBuffer(buffer);
-
-		/* Remember first block number thus added. */
-		if (firstBlock == InvalidBlockNumber)
-			firstBlock = blockNum;
-
-		/*
-		 * Immediately update the bottom level of the FSM.  This has a good
-		 * chance of making this page visible to other concurrently inserting
-		 * backends, and we want that to happen without delay.
-		 */
-		RecordPageWithFreeSpace(relation, blockNum, freespace);
+		extend_by_pages = 1;
+	}
+	else
+	{
+		uint32		waitcount;
+
+		/*
+		 * Try to extend at least by the number of pages the caller needs. We
+		 * can remember the additional pages (either via FSM or bistate).
+		 */
+		extend_by_pages = num_pages;
+
+		if (!RELATION_IS_LOCAL(relation))
+			waitcount = RelationExtensionLockWaiterCount(relation);
+		else
+			waitcount = 0;
+
+		/*
+		 * Multiply the number of pages to extend by the number of waiters. Do
+		 * this even if we're not using the FSM, as it still relieves
+		 * contention, by deferring the next time this backend needs to
+		 * extend. In that case the extended pages will be found via
+		 * bistate->next_free.
+		 */
+		extend_by_pages += extend_by_pages * waitcount;
+
+		/*
+		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
+		 * them all concurrently.
+		 */
+		extend_by_pages = Min(extend_by_pages, MAX_BUFFERS_TO_EXTEND_BY);
 	}
-	while (--extraBlocks > 0);
 
 	/*
-	 * Updating the upper levels of the free space map is too expensive to do
-	 * for every block, but it's worth doing once at the end to make sure that
-	 * subsequent insertion activity sees all of those nifty free pages we
-	 * just inserted.
+	 * How many of the extended pages should be entered into the FSM?
+	 *
+	 * If we have a bistate, only enter pages that we don't need ourselves
+	 * into the FSM.  Otherwise every other backend will immediately try to
+	 * use the pages this backend needs for itself, causing unnecessary
+	 * contention.  If we don't have a bistate, we can't avoid the FSM.
+	 *
+	 * Never enter the page returned into the FSM, we'll immediately use it.
 	 */
-	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
+	if (num_pages > 1 && bistate == NULL)
+		not_in_fsm_pages = 1;
+	else
+		not_in_fsm_pages = num_pages;
+
+	/* prepare to put another buffer into the bistate */
+	if (bistate && bistate->current_buf != InvalidBuffer)
+	{
+		ReleaseBuffer(bistate->current_buf);
+		bistate->current_buf = InvalidBuffer;
+	}
+
+	/*
+	 * Extend the relation. We ask for the first returned page to be locked,
+	 * so that we are sure that nobody has inserted into the page
+	 * concurrently.
+	 *
+	 * With the current MAX_BUFFERS_TO_EXTEND_BY there's no danger of
+	 * [auto]vacuum trying to truncate later pages as REL_TRUNCATE_MINIMUM is
+	 * way larger.
+	 */
+	first_block = ExtendBufferedRelBy(EB_REL(relation), MAIN_FORKNUM,
+									  bistate ? bistate->strategy : NULL,
+									  EB_LOCK_FIRST,
+									  extend_by_pages,
+									  victim_buffers,
+									  &extend_by_pages);
+	buffer = victim_buffers[0]; /* the buffer the function will return */
+	last_block = first_block + (extend_by_pages - 1);
+	Assert(first_block == BufferGetBlockNumber(buffer));
+
+	/*
+	 * Relation is now extended. Initialize the page. We do this here, before
+	 * potentially releasing the lock on the page, because it allows us to
+	 * double check that the page contents are empty (this should never
+	 * happen, but if it does we don't want to risk wiping out valid data).
+	 */
+	page = BufferGetPage(buffer);
+	if (!PageIsNew(page))
+		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
+			 first_block,
+			 RelationGetRelationName(relation));
+
+	PageInit(page, BufferGetPageSize(buffer), 0);
+	MarkBufferDirty(buffer);
+
+	/*
+	 * If we decided to put pages into the FSM, release the buffer lock (but
+	 * not pin), we don't want to do IO while holding a buffer lock. This will
+	 * necessitate a bit more extensive checking in our caller.
+	 */
+	if (use_fsm && not_in_fsm_pages < extend_by_pages)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		*did_unlock = true;
+	}
+
+	/*
+	 * Relation is now extended. Release pins on all buffers, except for the
+	 * first (which we'll return).  If we decided to put pages into the FSM,
+	 * we can do that as part of the same loop.
+	 */
+	for (uint32 i = 1; i < extend_by_pages; i++)
+	{
+		BlockNumber curBlock = first_block + i;
+
+		Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
+		Assert(BlockNumberIsValid(curBlock));
+
+		ReleaseBuffer(victim_buffers[i]);
+
+		if (use_fsm && i >= not_in_fsm_pages)
+		{
+			Size		freespace = BufferGetPageSize(victim_buffers[i]) -
+			SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(relation, curBlock, freespace);
+		}
+	}
+
+	if (use_fsm && not_in_fsm_pages < extend_by_pages)
+	{
+		BlockNumber first_fsm_block = first_block + not_in_fsm_pages;
+
+		FreeSpaceMapVacuumRange(relation, first_fsm_block, last_block);
+	}
+
+	if (bistate)
+	{
+		/*
+		 * Remember the additionaly pages we extended by, so we later can use
+		 * them without looking into the FSM.
+		 */
+		if (extend_by_pages > 1)
+		{
+			bistate->next_free = first_block + 1;
+			bistate->last_free = last_block;
+		}
+		else
+		{
+			bistate->next_free = InvalidBlockNumber;
+			bistate->last_free = InvalidBlockNumber;
+		}
+
+		/* maintain bistate->current_buf */
+		IncrBufferRefCount(buffer);
+		bistate->current_buf = buffer;
+	}
+
+	return buffer;
+#undef MAX_BUFFERS_TO_EXTEND_BY
 }
 
 /*
@@ -356,11 +466,13 @@ RelationGetBufferForTuple(Relation relation, Size len,
 				targetFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
 	bool		unlockedTargetBuffer;
 
 	len = MAXALIGN(len);		/* be conservative */
 
+	if (num_pages <= 0)
+		num_pages = 1;
+
 	/* Bulk insert is not supported for updates, only inserts. */
 	Assert(otherBuffer == InvalidBuffer || !bistate);
 
@@ -565,102 +677,53 @@ loop:
 			ReleaseBuffer(buffer);
 		}
 
-		/* Without FSM, always fall out of the loop and extend */
-		if (!use_fsm)
-			break;
-
-		/*
-		 * Update FSM as to condition of this page, and ask for another page
-		 * to try.
-		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
-													targetBlock,
-													pageFreeSpace,
-													targetFreeSpace);
-	}
-
-	/*
-	 * Have to extend the relation.
-	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
-	 */
-	needLock = !RELATION_IS_LOCAL(relation);
-
-	/*
-	 * If we need the lock but are not able to acquire it immediately, we'll
-	 * consider extending the relation by multiple blocks at a time to manage
-	 * contention on the relation extension lock.  However, this only makes
-	 * sense if we're using the FSM; otherwise, there's no point.
-	 */
-	if (needLock)
-	{
-		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		if (bistate	&& bistate->next_free != InvalidBlockNumber)
 		{
-			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			Assert(bistate->next_free <= bistate->last_free);
 
 			/*
-			 * Check if some other backend has extended a block for us while
-			 * we were waiting on the lock.
+			 * We bulk extended the relation before, and there are still some
+			 * unused pages from that extension, so we don't need to look in
+			 * the FSM for a new page. But do record the free space from the
+			 * last page, somebody might insert narrower tuples later.
 			 */
-			targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
+			if (use_fsm)
+				RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
 
-			/*
-			 * If some other waiter has already extended the relation, we
-			 * don't need to do so; just use the existing freespace.
-			 */
-			if (targetBlock != InvalidBlockNumber)
+			targetBlock = bistate->next_free;
+			if (bistate->next_free >= bistate->last_free)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
-				goto loop;
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
 			}
-
-			/* Time to bulk-extend. */
-			RelationAddExtraBlocks(relation, bistate);
+			else
+				bistate->next_free++;
+		}
+		else if (!use_fsm)
+		{
+			/* Without FSM, always fall out of the loop and extend */
+			break;
+		}
+		else
+		{
+			/*
+			 * Update FSM as to condition of this page, and ask for another
+			 * page to try.
+			 */
+			targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														targetBlock,
+														pageFreeSpace,
+														targetFreeSpace);
 		}
 	}
 
-	/*
-	 * In addition to whatever extension we performed above, we always add at
-	 * least one block to satisfy our own request.
-	 *
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
+	/* Have to extend the relation */
 	unlockedTargetBuffer = false;
+	buffer = RelationAddBlocks(relation, bistate, num_pages, use_fsm, &unlockedTargetBuffer);
+
 	targetBlock = BufferGetBlockNumber(buffer);
-
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
 	page = BufferGetPage(buffer);
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 targetBlock,
-			 RelationGetRelationName(relation));
-
-	PageInit(page, BufferGetPageSize(buffer), 0);
-	MarkBufferDirty(buffer);
-
 	/*
 	 * The page is empty, pin vmbuffer to set all_frozen bit. We don't want to
 	 * do IO while the buffer is locked, so we unlock the page first if IO is
@@ -672,8 +735,9 @@ loop:
 
 		if (!visibilitymap_pin_ok(targetBlock, *vmbuffer))
 		{
+			if (!unlockedTargetBuffer)
+				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			unlockedTargetBuffer = true;
-			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			visibilitymap_pin(relation, targetBlock, vmbuffer);
 		}
 	}
-- 
2.38.0

v6-0015-WIP-Don-t-initialize-page-in-vm-fsm-_extend-not-n.patchtext/x-diff; charset=us-asciiDownload

From e329dcc9b250fa6cae306e925809299877e1b5cf Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Feb 2023 20:56:32 -0800
Subject: [PATCH v6 15/17] WIP: Don't initialize page in {vm,fsm}_extend(), not
 needed

---
 src/backend/access/heap/visibilitymap.c   | 6 +-----
 src/backend/storage/freespace/freespace.c | 5 +----
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 74ff01bb172..709213d0d97 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -623,11 +623,9 @@ static void
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
 	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
-
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
 	 * extend the visibility map at the same time. It also locks out extension
@@ -663,8 +661,6 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
 		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
 		vm_nblocks_now++;
 	}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 3e9693b293b..90c529958e7 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,10 +608,9 @@ static void
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
 	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
@@ -649,8 +648,6 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
 		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
 				   pg.data, false);
 		fsm_nblocks_now++;
-- 
2.38.0

v6-0016-Convert-a-few-places-to-ExtendBufferedRelTo.patchtext/x-diff; charset=us-asciiDownload

From dd75c4b1fec6ed30eb7aa88152b3479396855d51 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:08:17 -0800
Subject: [PATCH v6 16/17] Convert a few places to ExtendBufferedRelTo

---
 src/backend/access/heap/visibilitymap.c   | 80 +++++++---------------
 src/backend/access/transam/xlogutils.c    | 29 ++------
 src/backend/storage/freespace/freespace.c | 83 +++++++----------------
 3 files changed, 55 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 709213d0d97..60b96f5df80 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -126,7 +126,7 @@
 
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
-static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static Buffer vm_extend(Relation rel, BlockNumber vm_nblocks);
 
 
 /*
@@ -575,22 +575,29 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary. It's always safe to clear bits, so it's better to clear
+	 * corrupt pages than error out.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
 	{
 		if (extend)
-			vm_extend(rel, blkno + 1);
+			buf = vm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
+								 RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
-	 * always safe to clear bits, so it's better to clear corrupt pages than
-	 * error out.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -603,8 +610,6 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
-							 RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -619,51 +624,16 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
  * Ensure that the visibility map fork is at least vm_nblocks long, extending
  * it if necessary with zeroed pages.
  */
-static void
+static Buffer
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
-	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
+	Buffer buf;
 
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the visibility map at the same time. It also locks out extension
-	 * of the main fork, unnecessarily, but extending the visibility map
-	 * happens seldom enough that it doesn't seem worthwhile to have a
-	 * separate lock tag type for it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, VISIBILITYMAP_FORKNUM))
-		smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
-	vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
-
-	/* Now extend the file */
-	while (vm_nblocks_now < vm_nblocks)
-	{
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
-		vm_nblocks_now++;
-	}
+	buf = ExtendBufferedRelTo(EB_REL(rel), VISIBILITYMAP_FORKNUM, NULL,
+							  EB_CREATE_FORK_IF_NEEDED |
+							  EB_CLEAR_SIZE_CACHE,
+							  vm_nblocks,
+							  RBM_ZERO_ON_ERROR);
 
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
@@ -672,7 +642,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * to keep checking for creation or extension of the file, which happens
 	 * infrequently.
 	 */
-	CacheInvalidateSmgr(reln->smgr_rlocator);
+	CacheInvalidateSmgr(RelationGetSmgr(rel)->smgr_rlocator);
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return buf;
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 2c28956b1aa..e174a2a8919 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -524,28 +524,13 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
-		buffer = InvalidBuffer;
-		do
-		{
-			if (buffer != InvalidBuffer)
-			{
-				if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-				ReleaseBuffer(buffer);
-			}
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum,
-											   P_NEW, mode, NULL, true);
-		}
-		while (BufferGetBlockNumber(buffer) < blkno);
-		/* Handle the corner case that P_NEW returns non-consecutive pages */
-		if (BufferGetBlockNumber(buffer) != blkno)
-		{
-			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno,
-											   mode, NULL, true);
-		}
+		buffer = ExtendBufferedRelTo(EB_SMGR(smgr, RELPERSISTENCE_PERMANENT),
+									 forknum,
+									 NULL,
+									 EB_PERFORMING_RECOVERY |
+									 EB_SKIP_EXTENSION_LOCK,
+									 blkno + 1,
+									 mode);
 	}
 
 recent_buffer_fast_path:
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 90c529958e7..2face615d07 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -98,7 +98,7 @@ static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot);
 static BlockNumber fsm_logical_to_physical(FSMAddress addr);
 
 static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
-static void fsm_extend(Relation rel, BlockNumber fsm_nblocks);
+static Buffer fsm_extend(Relation rel, BlockNumber fsm_nblocks);
 
 /* functions to convert amount of free space to a FSM category */
 static uint8 fsm_space_avail_to_cat(Size avail);
@@ -558,24 +558,30 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary.  The FSM information is not accurate anyway, so it's better
+	 * to clear corrupt pages than error out. Since the FSM changes are not
+	 * WAL-logged, the so-called torn page problem on crash can lead to pages
+	 * with corrupt headers, for example.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM])
 	{
 		if (extend)
-			fsm_extend(rel, blkno + 1);
+			buf = fsm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. The FSM
-	 * information is not accurate anyway, so it's better to clear corrupt
-	 * pages than error out. Since the FSM changes are not WAL-logged, the
-	 * so-called torn page problem on crash can lead to pages with corrupt
-	 * headers, for example.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -588,7 +594,6 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -604,56 +609,14 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
  * it if necessary with empty pages. And by empty, I mean pages filled
  * with zeros, meaning there's no free space.
  */
-static void
+static Buffer
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
-	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
-
-
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the FSM at the same time. It also locks out extension of the
-	 * main fork, unnecessarily, but extending the FSM happens seldom enough
-	 * that it doesn't seem worthwhile to have a separate lock tag type for
-	 * it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, FSM_FORKNUM))
-		smgrcreate(reln, FSM_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-	fsm_nblocks_now = smgrnblocks(reln, FSM_FORKNUM);
-
-	/* Extend as needed. */
-	while (fsm_nblocks_now < fsm_nblocks)
-	{
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
-		fsm_nblocks_now++;
-	}
-
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return ExtendBufferedRelTo(EB_REL(rel), FSM_FORKNUM, NULL,
+							   EB_CREATE_FORK_IF_NEEDED |
+							   EB_CLEAR_SIZE_CACHE,
+							   fsm_nblocks,
+							   RBM_ZERO_ON_ERROR);
 }
 
 /*
-- 
2.38.0

#43

John Naylor

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#42)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:

Attached is v6. Changes:

0006:

+        ereport(ERROR,
+            errcode_for_file_access(),
+            errmsg("could not extend file \"%s\" with posix_fallocate():
%m",
+                 FilePathName(v->mdfd_vfd)),
+            errhint("Check free disk space."));

Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was
used in FileFallocate().

--
John Naylor
EDB: http://www.enterprisedb.com

#44

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: John Naylor (#43)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-30 12:28:57 +0700, John Naylor wrote:

On Thu, Mar 30, 2023 at 10:02 AM Andres Freund <andres@anarazel.de> wrote:

Attached is v6. Changes:

0006:
+        ereport(ERROR,
+            errcode_for_file_access(),
+            errmsg("could not extend file \"%s\" with posix_fallocate():
%m",
+                 FilePathName(v->mdfd_vfd)),
+            errhint("Check free disk space."));
Portability nit: mdzeroextend() doesn't know whether posix_fallocate() was
used in FileFallocate().

Fair point. I would however like to see a different error message for the two
ways of extending, at least initially. What about referencing FileFallocate()?

Greetings,

Andres Freund

#45

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#42)

13 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-03-29 20:02:33 -0700, Andres Freund wrote:

Attached is v6. Changes:

Attached is v7. Not much in the way of changes:
- polished a lot of the commit messages
- reordered commits to not be blocked as immediately by
/messages/by-id/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
- Used the new relation extension function in two more places (finding an
independent bug on the way), not sure why I didn't convert those earlier...

I'm planning to push the patches up to the hio.c changes soon, unless somebody
would like me to hold off.

After that I'm planning to wait for a buildfarm cycle, and push the changes
necessary to use bulk extension in hio.c (the main win).

I might split the patch to use ExtendBufferedRelTo() into two, one for
vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
complicated and has more of a complicated history (see [1]).

Greetings,

Andres Freund

/messages/by-id/20230223010147.32oir7sb66slqnjk@awork3.anarazel.de

Attachments:

v7-0001-Don-t-initialize-page-in-vm-fsm-_extend-not-neede.patchtext/x-diff; charset=us-asciiDownload

From e1314898de28d1c7354df6a06eb53e4095f1928f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Feb 2023 20:56:32 -0800
Subject: [PATCH v7 01/14] Don't initialize page in {vm,fsm}_extend(), not
 needed

The read path needs to be able to initialize pages anyway, as relation
extensions are not durable. By avoiding initializing pages, we can, in a
future patch, extend the relation by multiple blocks at once.

Using smgrextend() for {vm,fsm}_extend() is not a good idea in general, as at
least one page of the VM/FSM will be read immediately after, always causing a
cache miss, requiring us to read content we just wrote.

Discussion: https://postgr.es/m/20230301223515.pucbj7nb54n4i4nv@awork3.anarazel.de
---
 src/backend/access/heap/visibilitymap.c   | 6 +-----
 src/backend/storage/freespace/freespace.c | 5 +----
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 11e6d0d479f..114d1b42b3e 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -622,11 +622,9 @@ static void
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
 	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
-
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
 	 * extend the visibility map at the same time. It also locks out extension
@@ -662,8 +660,6 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	/* Now extend the file */
 	while (vm_nblocks_now < vm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, vm_nblocks_now);
-
 		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
 		vm_nblocks_now++;
 	}
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 3e9693b293b..90c529958e7 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -608,10 +608,9 @@ static void
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
 	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg;
+	PGAlignedBlock pg = {0};
 	SMgrRelation reln;
 
-	PageInit((Page) pg.data, BLCKSZ, 0);
 
 	/*
 	 * We use the relation extension lock to lock out other backends trying to
@@ -649,8 +648,6 @@ fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 	/* Extend as needed. */
 	while (fsm_nblocks_now < fsm_nblocks)
 	{
-		PageSetChecksumInplace((Page) pg.data, fsm_nblocks_now);
-
 		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
 				   pg.data, false);
 		fsm_nblocks_now++;
-- 
2.38.0

v7-0002-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload

From 9ad944bd20dc92e6d790b70a056f362d68142847 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 27 Feb 2023 17:36:37 -0800
Subject: [PATCH v7 02/14] Add smgrzeroextend(), FileZero(), FileFallocate()

smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple
blocks. When extending by a small number of blocks, use FileZero() instead, as
using posix_fallocate() for small numbers of blocks is inefficient for some
file systems / operating systems. FileZero() is also used as the fallback for
FileFallocate() on platforms / filesystems that don't support fallocate.

A big advantage of using posix_fallocate() is that it typically won't cause
dirty buffers in the kernel pagecache. So far the most common pattern in our
code is that we smgrextend() a page full of zeroes and put the corresponding
page into shared buffers, from where we later write out the actual contents of
the page. If the kernel, e.g. due to memory pressure or elapsed time, already
wrote back the all-zeroes page, this can lead to doubling the amount of writes
reaching storage.

There are no users of the facility as of this commit. That will follow in
future commit.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/include/storage/fd.h        |   3 +
 src/include/storage/md.h        |   2 +
 src/include/storage/smgr.h      |   2 +
 src/backend/storage/file/fd.c   |  89 ++++++++++++++++++++++++++
 src/backend/storage/smgr/md.c   | 108 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c |  28 +++++++++
 6 files changed, 232 insertions(+)

diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index f85de97d083..daceafd4732 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -106,6 +106,9 @@ extern int	FilePrefetch(File file, off_t offset, off_t amount, uint32 wait_event
 extern int	FileRead(File file, void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileWrite(File file, const void *buffer, size_t amount, off_t offset, uint32 wait_event_info);
 extern int	FileSync(File file, uint32 wait_event_info);
+extern int	FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info);
+extern int	FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info);
+
 extern off_t FileSize(File file);
 extern int	FileTruncate(File file, off_t offset, uint32 wait_event_info);
 extern void FileWriteback(File file, off_t offset, off_t nbytes, uint32 wait_event_info);
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index 8f32af9ef3d..941879ee6a8 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -28,6 +28,8 @@ extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileLocatorBackend rlocator, ForkNumber forknum, bool isRedo);
 extern void mdextend(SMgrRelation reln, ForkNumber forknum,
 					 BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+						 BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool mdprefetch(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 0935144f425..a9a179aabac 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -92,6 +92,8 @@ extern void smgrdosyncall(SMgrRelation *rels, int nrels);
 extern void smgrdounlinkall(SMgrRelation *rels, int nrels, bool isRedo);
 extern void smgrextend(SMgrRelation reln, ForkNumber forknum,
 					   BlockNumber blocknum, const void *buffer, bool skipFsync);
+extern void smgrzeroextend(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, int nblocks, bool skipFsync);
 extern bool smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2ac365e97cc..9eabdbc589e 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -2206,6 +2206,94 @@ FileSync(File file, uint32 wait_event_info)
 	return returnCode;
 }
 
+/*
+ * Zero a region of the file.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileZero(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+	int			returnCode;
+	ssize_t		written;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileZero: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
+			   file, VfdCache[file].fileName,
+			   (int64) offset, (int64) amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return returnCode;
+
+	pgstat_report_wait_start(wait_event_info);
+	written = pg_pwrite_zeros(VfdCache[file].fd, amount, offset);
+	pgstat_report_wait_end();
+
+	if (written < 0)
+		return -1;
+	else if (written != amount)
+	{
+		/* if errno is unset, assume problem is no disk space */
+		if (errno == 0)
+			errno = ENOSPC;
+		return -1;
+	}
+
+	return 0;
+}
+
+/*
+ * Try to reserve file space with posix_fallocate(). If posix_fallocate() is
+ * not implemented on the operating system or fails with EINVAL / EOPNOTSUPP,
+ * use FileZero() instead.
+ *
+ * Note that at least glibc() implements posix_fallocate() in userspace if not
+ * implemented by the filesystem. That's not the case for all environments
+ * though.
+ *
+ * Returns 0 on success, -1 otherwise. In the latter case errno is set to the
+ * appropriate error.
+ */
+int
+FileFallocate(File file, off_t offset, off_t amount, uint32 wait_event_info)
+{
+#ifdef HAVE_POSIX_FALLOCATE
+	int			returnCode;
+
+	Assert(FileIsValid(file));
+
+	DO_DB(elog(LOG, "FileFallocate: %d (%s) " INT64_FORMAT " " INT64_FORMAT,
+			   file, VfdCache[file].fileName,
+			   (int64) offset, (int64) amount));
+
+	returnCode = FileAccess(file);
+	if (returnCode < 0)
+		return -1;
+
+	pgstat_report_wait_start(wait_event_info);
+	returnCode = posix_fallocate(VfdCache[file].fd, offset, amount);
+	pgstat_report_wait_end();
+
+	if (returnCode == 0)
+		return 0;
+
+	/* for compatibility with %m printing etc */
+	errno = returnCode;
+
+	/*
+	 * Return in cases of a "real" failure, if fallocate is not supported,
+	 * fall through to the FileZero() backed implementation.
+	 */
+	if (returnCode != EINVAL && returnCode != EOPNOTSUPP)
+		return -1;
+#endif
+
+	return FileZero(file, offset, amount, wait_event_info);
+}
+
 off_t
 FileSize(File file)
 {
@@ -2278,6 +2366,7 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
 	return VfdCache[file].fd;
 }
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1feb..1c2d1405f86 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -500,6 +500,114 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
 
+/*
+ *	mdzeroextend() -- Add new zeroed out blocks to the specified relation.
+ *
+ *		Similar to mdextend(), except the relation can be extended by multiple
+ *		blocks at once and the added blocks will be filled with zeroes.
+ */
+void
+mdzeroextend(SMgrRelation reln, ForkNumber forknum,
+			 BlockNumber blocknum, int nblocks, bool skipFsync)
+{
+	MdfdVec    *v;
+	BlockNumber curblocknum = blocknum;
+	int			remblocks = nblocks;
+
+	Assert(nblocks > 0);
+
+	/* This assert is too expensive to have on normally ... */
+#ifdef CHECK_WRITE_VS_EXTEND
+	Assert(blocknum >= mdnblocks(reln, forknum));
+#endif
+
+	/*
+	 * If a relation manages to grow to 2^32-1 blocks, refuse to extend it any
+	 * more --- we mustn't create a block whose number actually is
+	 * InvalidBlockNumber or larger.
+	 */
+	if ((uint64) blocknum + nblocks >= (uint64) InvalidBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend file \"%s\" beyond %u blocks",
+						relpath(reln->smgr_rlocator, forknum),
+						InvalidBlockNumber)));
+
+	while (remblocks > 0)
+	{
+		BlockNumber	segstartblock = curblocknum % ((BlockNumber) RELSEG_SIZE);
+		off_t		seekpos = (off_t) BLCKSZ * segstartblock;
+		int			numblocks;
+
+		if (segstartblock + remblocks > RELSEG_SIZE)
+			numblocks = RELSEG_SIZE - segstartblock;
+		else
+			numblocks = remblocks;
+
+		v = _mdfd_getseg(reln, forknum, curblocknum, skipFsync, EXTENSION_CREATE);
+
+		Assert(segstartblock < RELSEG_SIZE);
+		Assert(segstartblock + numblocks <= RELSEG_SIZE);
+
+		/*
+		 * If available and useful, use posix_fallocate() (via FileAllocate())
+		 * to extend the relation. That's often more efficient than using
+		 * write(), as it commonly won't cause the kernel to allocate page
+		 * cache space for the extended pages.
+		 *
+		 * However, we don't use FileAllocate() for small extensions, as it
+		 * defeats delayed allocation on some filesystems. Not clear where
+		 * that decision should be made though? For now just use a cutoff of
+		 * 8, anything between 4 and 8 worked OK in some local testing.
+		 */
+		if (numblocks > 8)
+		{
+			int			ret;
+
+			ret = FileFallocate(v->mdfd_vfd,
+								seekpos, (off_t) BLCKSZ * numblocks,
+								WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret != 0)
+			{
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\" with FileFallocate(): %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+			}
+		}
+		else
+		{
+			int			ret;
+
+			/*
+			 * Even if we don't want to use fallocate, we can still extend a
+			 * bit more efficiently than writing each 8kB block individually.
+			 * pg_pwrite_zeroes() (via FileZero()) uses
+			 * pg_pwritev_with_retry() to avoid multiple writes or needing a
+			 * zeroed buffer for the whole length of the extension.
+			 */
+			ret = FileZero(v->mdfd_vfd,
+						   seekpos, (off_t) BLCKSZ * numblocks,
+						   WAIT_EVENT_DATA_FILE_EXTEND);
+			if (ret < 0)
+				ereport(ERROR,
+						errcode_for_file_access(),
+						errmsg("could not extend file \"%s\": %m",
+							   FilePathName(v->mdfd_vfd)),
+						errhint("Check free disk space."));
+		}
+
+		if (!skipFsync && !SmgrIsTemp(reln))
+			register_dirty_segment(reln, forknum, v);
+
+		Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
+
+		remblocks -= numblocks;
+		curblocknum += numblocks;
+	}
+}
+
 /*
  *	mdopenfork() -- Open one fork of the specified relation.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index dc466e54145..c37c246b77f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -50,6 +50,8 @@ typedef struct f_smgr
 								bool isRedo);
 	void		(*smgr_extend) (SMgrRelation reln, ForkNumber forknum,
 								BlockNumber blocknum, const void *buffer, bool skipFsync);
+	void		(*smgr_zeroextend) (SMgrRelation reln, ForkNumber forknum,
+									BlockNumber blocknum, int nblocks, bool skipFsync);
 	bool		(*smgr_prefetch) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
@@ -75,6 +77,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_exists = mdexists,
 		.smgr_unlink = mdunlink,
 		.smgr_extend = mdextend,
+		.smgr_zeroextend = mdzeroextend,
 		.smgr_prefetch = mdprefetch,
 		.smgr_read = mdread,
 		.smgr_write = mdwrite,
@@ -507,6 +510,31 @@ smgrextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
 }
 
+/*
+ *	smgrzeroextend() -- Add new zeroed out blocks to a file.
+ *
+ *		Similar to smgrextend(), except the relation can be extended by
+ *		multiple blocks at once and the added blocks will be filled with
+ *		zeroes.
+ */
+void
+smgrzeroextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			   int nblocks, bool skipFsync)
+{
+	smgrsw[reln->smgr_which].smgr_zeroextend(reln, forknum, blocknum,
+											 nblocks, skipFsync);
+
+	/*
+	 * Normally we expect this to increase the fork size by nblocks, but if
+	 * the cached value isn't as expected, just invalidate it so the next call
+	 * asks the kernel.
+	 */
+	if (reln->smgr_cached_nblocks[forknum] == blocknum)
+		reln->smgr_cached_nblocks[forknum] = blocknum + nblocks;
+	else
+		reln->smgr_cached_nblocks[forknum] = InvalidBlockNumber;
+}
+
 /*
  *	smgrprefetch() -- Initiate asynchronous read of the specified block of a relation.
  *
-- 
2.38.0

v7-0003-bufmgr-Add-some-more-error-checking-infrastructur.patchtext/x-diff; charset=us-asciiDownload

From 066202ceb610639602996cd920b571c0108b5710 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Jul 2020 19:06:45 -0700
Subject: [PATCH v7 03/14] bufmgr: Add some more error checking
 [infrastructure] around pinning

This adds a few more assertions against a buffer being local in places we
don't expect, and extracts the check for a buffer being pinned exactly once
from LockBufferForCleanup() into its own function. Later commits will use this
function.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
---
 src/include/storage/bufmgr.h        |  1 +
 src/backend/storage/buffer/bufmgr.c | 46 +++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 73762cb1ec8..f96dc080211 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -134,6 +134,7 @@ extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
 extern void IncrBufferRefCount(Buffer buffer);
+extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7f119cd4b0f..2362423b89d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1735,6 +1735,8 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 	bool		result;
 	PrivateRefCountEntry *ref;
 
+	Assert(!BufferIsLocal(b));
+
 	ref = GetPrivateRefCountEntry(b, true);
 
 	if (ref == NULL)
@@ -1880,6 +1882,8 @@ UnpinBuffer(BufferDesc *buf)
 	PrivateRefCountEntry *ref;
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 
+	Assert(!BufferIsLocal(b));
+
 	/* not moving as we're likely deleting it soon anyway */
 	ref = GetPrivateRefCountEntry(b, false);
 	Assert(ref != NULL);
@@ -4253,6 +4257,29 @@ ConditionalLockBuffer(Buffer buffer)
 									LW_EXCLUSIVE);
 }
 
+/*
+ * Verify that this backend is pinning the buffer exactly once.
+ *
+ * NOTE: Like in BufferIsPinned(), what we check here is that *this* backend
+ * holds a pin on the buffer.  We do not care whether some other backend does.
+ */
+void
+CheckBufferIsPinnedOnce(Buffer buffer)
+{
+	if (BufferIsLocal(buffer))
+	{
+		if (LocalRefCount[-buffer - 1] != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 LocalRefCount[-buffer - 1]);
+	}
+	else
+	{
+		if (GetPrivateRefCount(buffer) != 1)
+			elog(ERROR, "incorrect local pin count: %d",
+				 GetPrivateRefCount(buffer));
+	}
+}
+
 /*
  * LockBufferForCleanup - lock a buffer in preparation for deleting items
  *
@@ -4280,20 +4307,11 @@ LockBufferForCleanup(Buffer buffer)
 	Assert(BufferIsPinned(buffer));
 	Assert(PinCountWaitBuf == NULL);
 
-	if (BufferIsLocal(buffer))
-	{
-		/* There should be exactly one pin */
-		if (LocalRefCount[-buffer - 1] != 1)
-			elog(ERROR, "incorrect local pin count: %d",
-				 LocalRefCount[-buffer - 1]);
-		/* Nobody else to wait for */
-		return;
-	}
+	CheckBufferIsPinnedOnce(buffer);
 
-	/* There should be exactly one local pin */
-	if (GetPrivateRefCount(buffer) != 1)
-		elog(ERROR, "incorrect local pin count: %d",
-			 GetPrivateRefCount(buffer));
+	/* Nobody else to wait for */
+	if (BufferIsLocal(buffer))
+		return;
 
 	bufHdr = GetBufferDescriptor(buffer - 1);
 
@@ -4794,6 +4812,8 @@ LockBufHdr(BufferDesc *desc)
 	SpinDelayStatus delayStatus;
 	uint32		old_buf_state;
 
+	Assert(!BufferIsLocal(BufferDescriptorGetBuffer(desc)));
+
 	init_local_spin_delay(&delayStatus);
 
 	while (true)
-- 
2.38.0

v7-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload

From 1a03baf9b3ef0bd1c82880123fdc11bd92922b27 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 3 Apr 2023 18:40:14 -0700
Subject: [PATCH v7 04/14] bufmgr: Add Pin/UnpinLocalBuffer()

So far these were open-coded in quite a few places, without a good reason.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Reviewed-by: David Rowley <dgrowleyml@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/include/storage/buf_internals.h   |  2 +
 src/backend/storage/buffer/bufmgr.c   | 30 ++----------
 src/backend/storage/buffer/localbuf.c | 67 ++++++++++++++++++---------
 3 files changed, 51 insertions(+), 48 deletions(-)

diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 2afb9bb3099..970d0090615 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -415,6 +415,8 @@ extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
 extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
 
 /* localbuf.c */
+extern bool PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount);
+extern void UnpinLocalBuffer(Buffer buffer);
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												ForkNumber forkNum,
 												BlockNumber blockNum);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 2362423b89d..1c3dec487a1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -636,20 +636,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		/* Is it still valid and holding the right tag? */
 		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
 		{
-			/*
-			 * Bump buffer's ref and usage counts. This is equivalent of
-			 * PinBuffer for a shared buffer.
-			 */
-			if (LocalRefCount[b] == 0)
-			{
-				if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-				{
-					buf_state += BUF_USAGECOUNT_ONE;
-					pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-				}
-			}
-			LocalRefCount[b]++;
-			ResourceOwnerRememberBuffer(CurrentResourceOwner, recent_buffer);
+			PinLocalBuffer(bufHdr, true);
 
 			pgBufferUsage.local_blks_hit++;
 
@@ -1688,8 +1675,7 @@ ReleaseAndReadBuffer(Buffer buffer,
 				BufTagMatchesRelFileLocator(&bufHdr->tag, &relation->rd_locator) &&
 				BufTagGetForkNum(&bufHdr->tag) == forkNum)
 				return buffer;
-			ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-			LocalRefCount[-buffer - 1]--;
+			UnpinLocalBuffer(buffer);
 		}
 		else
 		{
@@ -3982,15 +3968,9 @@ ReleaseBuffer(Buffer buffer)
 		elog(ERROR, "bad buffer ID: %d", buffer);
 
 	if (BufferIsLocal(buffer))
-	{
-		ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-
-		Assert(LocalRefCount[-buffer - 1] > 0);
-		LocalRefCount[-buffer - 1]--;
-		return;
-	}
-
-	UnpinBuffer(GetBufferDescriptor(buffer - 1));
+		UnpinLocalBuffer(buffer);
+	else
+		UnpinBuffer(GetBufferDescriptor(buffer - 1));
 }
 
 /*
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 6f9e7eda57c..940b80d165e 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -137,27 +137,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
 				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
 #endif
-		buf_state = pg_atomic_read_u32(&bufHdr->state);
 
-		/* this part is equivalent to PinBuffer for a shared buffer */
-		if (LocalRefCount[b] == 0)
-		{
-			if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
-			{
-				buf_state += BUF_USAGECOUNT_ONE;
-				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-			}
-		}
-		LocalRefCount[b]++;
-		ResourceOwnerRememberBuffer(CurrentResourceOwner,
-									BufferDescriptorGetBuffer(bufHdr));
-		if (buf_state & BM_VALID)
-			*foundPtr = true;
-		else
-		{
-			/* Previous read attempt must have failed; try again */
-			*foundPtr = false;
-		}
+		*foundPtr = PinLocalBuffer(bufHdr, true);
 		return bufHdr;
 	}
 
@@ -194,9 +175,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 			else
 			{
 				/* Found a usable buffer */
-				LocalRefCount[b]++;
-				ResourceOwnerRememberBuffer(CurrentResourceOwner,
-											BufferDescriptorGetBuffer(bufHdr));
+				PinLocalBuffer(bufHdr, false);
 				break;
 			}
 		}
@@ -484,6 +463,48 @@ InitLocalBuffers(void)
 	NLocBuffer = nbufs;
 }
 
+/*
+ * XXX: We could have a slightly more efficient version of PinLocalBuffer()
+ * that does not support adjusting the usagecount - but so far it does not
+ * seem worth the trouble.
+ */
+bool
+PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
+{
+	uint32		buf_state;
+	Buffer		buffer = BufferDescriptorGetBuffer(buf_hdr);
+	int			bufid = -buffer - 1;
+
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	if (LocalRefCount[bufid] == 0)
+	{
+		if (adjust_usagecount &&
+			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
+		{
+			buf_state += BUF_USAGECOUNT_ONE;
+			pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+		}
+	}
+	LocalRefCount[bufid]++;
+	ResourceOwnerRememberBuffer(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf_hdr));
+
+	return buf_state & BM_VALID;
+}
+
+void
+UnpinLocalBuffer(Buffer buffer)
+{
+	int			buffid = -buffer - 1;
+
+	Assert(BufferIsLocal(buffer));
+	Assert(LocalRefCount[buffid] > 0);
+
+	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
+	LocalRefCount[buffid]--;
+}
+
 /*
  * GUC check_hook for temp_buffers
  */
-- 
2.38.0

v7-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload

From 520c8133e306258062260ea33bdf24182c5d907d Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 2 Apr 2023 14:45:17 -0700
Subject: [PATCH v7 05/14] bufmgr: Acquire and clean victim buffer separately

Previously we held buffer locks for two buffer mapping partitions at the same
time to change the identity of buffers.  Particularly for extending relations
needing to hold the extension lock while acquiring a victim buffer is
painful.But it also creates a bottleneck for workloads that just involve
reads.

Now we instead first acquire a victim buffer and write it out, if
necessary. Then we remove that buffer from the old partition with just the old
partition's partition lock held and insert it into the new partition with just
that partition's lock held.

By separating out the victim buffer acquisition, future commits will be able
to change relation extensions to scale better.

On my workstation, a micro-benchmark exercising buffered reads strenuously and
under a lot of concurrency, sees a >2x improvement.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c   | 581 ++++++++++++++------------
 src/backend/storage/buffer/localbuf.c | 115 ++---
 2 files changed, 379 insertions(+), 317 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1c3dec487a1..497050c83d4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -473,6 +473,7 @@ static BufferDesc *BufferAlloc(SMgrRelation smgr,
 							   BlockNumber blockNum,
 							   BufferAccessStrategy strategy,
 							   bool *foundPtr, IOContext io_context);
+static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
@@ -1123,18 +1124,14 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr, IOContext io_context)
 {
-	bool		from_ring;
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
-	uint32		oldFlags;
-	int			buf_id;
-	BufferDesc *buf;
-	bool		valid;
-	uint32		buf_state;
+	int			existing_buf_id;
+
+	Buffer		victim_buffer;
+	BufferDesc *victim_buf_hdr;
+	uint32		victim_buf_state;
 
 	/* create a tag so we can lookup the buffer */
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
@@ -1145,15 +1142,18 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+	existing_buf_id = BufTableLookup(&newTag, newHash);
+	if (existing_buf_id >= 0)
 	{
+		BufferDesc *buf;
+		bool		valid;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
 		 * buffer pool, and check to see if the correct data has been loaded
 		 * into the buffer.
 		 */
-		buf = GetBufferDescriptor(buf_id);
+		buf = GetBufferDescriptor(existing_buf_id);
 
 		valid = PinBuffer(buf, strategy);
 
@@ -1190,293 +1190,111 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
-	/* Loop here in case we have to try another victim buffer */
-	for (;;)
+	/*
+	 * Acquire a victim buffer. Somebody else might try to do the same, we
+	 * don't hold any conflicting locks. If so we'll have to undo our work
+	 * later.
+	 */
+	victim_buffer = GetVictimBuffer(strategy, io_context);
+	victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. If
+	 * somebody else inserted another buffer for the tag, we'll release the
+	 * victim buffer we acquired and use the already inserted one.
+	 */
+	LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
+	if (existing_buf_id >= 0)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
+		BufferDesc *existing_buf_hdr;
+		bool		valid;
 
 		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.  First, give up the buffer we were planning to
+		 * use.
+		 *
+		 * We could do this after releasing the partition lock, but then we'd
+		 * have to call ResourceOwnerEnlargeBuffers() &
+		 * ReservePrivateRefCountEntry() before acquiring the lock, for the
+		 * rare case of such a collision.
 		 */
-		buf = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+		UnpinBuffer(victim_buf_hdr);
 
-		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+		/* FIXME: Should we put the victim buffer onto the freelist? */
 
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf_state & BUF_FLAG_MASK;
+		/* remaining code should match code at top of routine */
 
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
+		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),
-										 LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
+		valid = PinBuffer(existing_buf_hdr, strategy);
 
-					/* Read the LSN while holding buffer header lock */
-					buf_state = LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf, buf_state);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf, from_ring))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(BufferDescriptorGetContentLock(buf));
-						UnpinBuffer(buf);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				FlushBuffer(buf, NULL, IOOBJECT_RELATION, io_context);
-				LWLockRelease(BufferDescriptorGetContentLock(buf));
-
-				ScheduleBufferTagForWriteback(&BackendWritebackContext,
-											  &buf->tag);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf);
-				continue;
-			}
-		}
-
-		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
-		 */
-		if (oldFlags & BM_TAG_VALID)
-		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* remember we have no old-partition lock or tag */
-			oldPartitionLock = NULL;
-			/* keep the compiler quiet about uninitialized variables */
-			oldHash = 0;
-		}
-
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
-		/*
-		 * Need to lock the buffer header too in order to change its tag.
-		 */
-		buf_state = LockBufHdr(buf);
-
-		/*
-		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
-		 */
-		oldFlags = buf_state & BUF_FLAG_MASK;
-		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
-			break;
-
-		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
-		UnpinBuffer(buf);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(existing_buf_hdr, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return existing_buf_hdr;
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
-	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
-	 *
+	 * Need to lock the buffer header too in order to change its tag.
+	 */
+	victim_buf_state = LockBufHdr(victim_buf_hdr);
+
+	/* some sanity checks while we hold the buffer header lock */
+	Assert(BUF_STATE_GET_REFCOUNT(victim_buf_state) == 1);
+	Assert(!(victim_buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_IO_IN_PROGRESS)));
+
+	victim_buf_hdr->tag = newTag;
+
+	/*
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
 	 */
-	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
+	victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
-	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		victim_buf_state |= BM_PERMANENT;
 
-	UnlockBufHdr(buf, buf_state);
-
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
+	UnlockBufHdr(victim_buf_hdr, victim_buf_state);
 
 	LWLockRelease(newPartitionLock);
 
-	if (oldFlags & BM_VALID)
-	{
-		/*
-		 * When a BufferAccessStrategy is in use, blocks evicted from shared
-		 * buffers are counted as IOOP_EVICT in the corresponding context
-		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
-		 * strategy in two cases: 1) while initially claiming buffers for the
-		 * strategy ring 2) to replace an existing strategy ring buffer
-		 * because it is pinned or in use and cannot be reused.
-		 *
-		 * Blocks evicted from buffers already in the strategy ring are
-		 * counted as IOOP_REUSE in the corresponding strategy context.
-		 *
-		 * At this point, we can accurately count evictions and reuses,
-		 * because we have successfully claimed the valid buffer. Previously,
-		 * we may have been forced to release the buffer due to concurrent
-		 * pinners or erroring out.
-		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
-						   from_ring ? IOOP_REUSE : IOOP_EVICT);
-	}
-
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
 	 * start I/O.  If StartBufferIO returns false, then someone else managed
 	 * to read it before we did, so there's nothing left for BufferAlloc() to
 	 * do.
 	 */
-	if (StartBufferIO(buf, true))
+	if (StartBufferIO(victim_buf_hdr, true))
 		*foundPtr = false;
 	else
 		*foundPtr = true;
 
-	return buf;
+	return victim_buf_hdr;
 }
 
 /*
@@ -1585,6 +1403,237 @@ retry:
 	StrategyFreeBuffer(buf);
 }
 
+/*
+ * Helper routine for GetVictimBuffer()
+ *
+ * Needs to be called on a buffer with a valid tag, pinned, but without the
+ * buffer header spinlock held.
+ *
+ * Returns true if the buffer can be reused, in which case the buffer is only
+ * pinned by this backend and marked as invalid, false otherwise.
+ */
+static bool
+InvalidateVictimBuffer(BufferDesc *buf_hdr)
+{
+	uint32		buf_state;
+	uint32		hash;
+	LWLock	   *partition_lock;
+	BufferTag	tag;
+
+	Assert(GetPrivateRefCount(BufferDescriptorGetBuffer(buf_hdr)) == 1);
+
+	/* have buffer pinned, so it's safe to read tag without lock */
+	tag = buf_hdr->tag;
+
+	hash = BufTableHashCode(&tag);
+	partition_lock = BufMappingPartitionLock(hash);
+
+	LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+	/* lock the buffer header */
+	buf_state = LockBufHdr(buf_hdr);
+
+	/*
+	 * We have the buffer pinned nobody else should have been able to unset
+	 * this concurrently.
+	 */
+	Assert(buf_state & BM_TAG_VALID);
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BufferTagsEqual(&buf_hdr->tag, &tag));
+
+	/*
+	 * If somebody else pinned the buffer since, or even worse, dirtied
+	 * it, give up on this buffer: It's clearly in use.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) != 1 || (buf_state & BM_DIRTY))
+	{
+		Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+		UnlockBufHdr(buf_hdr, buf_state);
+		LWLockRelease(partition_lock);
+
+		return false;
+	}
+
+	/*
+	 * Clear out the buffer's tag and flags and usagecount.  This is not
+	 * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
+	 * doing anything with the buffer. But currently it's beneficial as the
+	 * pre-check for several linear scans of shared buffers just checks the
+	 * tag.
+	 */
+	ClearBufferTag(&buf_hdr->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf_hdr, buf_state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+
+	/* finally delete buffer from the buffer mapping table */
+	BufTableDelete(&tag, hash);
+
+	LWLockRelease(partition_lock);
+
+	Assert(!(buf_state & (BM_DIRTY | BM_VALID | BM_TAG_VALID)));
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) > 0);
+	Assert(BUF_STATE_GET_REFCOUNT(pg_atomic_read_u32(&buf_hdr->state)) > 0);
+
+	return true;
+}
+
+static Buffer
+GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
+{
+	BufferDesc *buf_hdr;
+	Buffer		buf;
+	uint32		buf_state;
+	bool		from_ring;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/* we return here if a prospective victim buffer gets used concurrently */
+again:
+
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf_hdr = StrategyGetBuffer(strategy, &buf_state, &from_ring);
+	buf = BufferDescriptorGetBuffer(buf_hdr);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf_hdr);
+
+	/*
+	 * We shouldn't have any other pins for this buffer.
+	 */
+	CheckBufferIsPinnedOnce(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released the
+	 * buffer header lock above, or even while we are writing it out (since
+	 * our share-lock won't prevent hint-bit updates).  We will recheck the
+	 * dirty bit after re-locking the buffer header.
+	 */
+	if (buf_state & BM_DIRTY)
+	{
+		LWLock	   *content_lock;
+
+		Assert(buf_state & BM_TAG_VALID);
+		Assert(buf_state & BM_VALID);
+
+		/*
+		 * We need a share-lock on the buffer contents to write it out (else
+		 * we might write invalid data, eg because someone else is compacting
+		 * the page contents while we write).  We must use a conditional lock
+		 * acquisition here to avoid deadlock.  Even though the buffer was not
+		 * pinned (and therefore surely not locked) when StrategyGetBuffer
+		 * returned it, someone else could have pinned and exclusive-locked it
+		 * by the time we get here. If we try to get the lock unconditionally,
+		 * we'd block waiting for them; if they later block waiting for us,
+		 * deadlock ensues. (This has been observed to happen when two
+		 * backends are both trying to split btree index pages, and the second
+		 * one just happens to be trying to split the page the first one got
+		 * from StrategyGetBuffer.)
+		 */
+		content_lock = BufferDescriptorGetContentLock(buf_hdr);
+		if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop back
+			 * to get another one.
+			 */
+			UnpinBuffer(buf_hdr);
+			goto again;
+		}
+
+		/*
+		 * If using a nondefault strategy, and writing the buffer would
+		 * require a WAL flush, let the strategy decide whether to go ahead
+		 * and write/reuse the buffer or to choose another victim.  We need
+		 * lock to inspect the page LSN, so this can't be done inside
+		 * StrategyGetBuffer.
+		 */
+		if (strategy != NULL)
+		{
+			XLogRecPtr	lsn;
+
+			/* Read the LSN while holding buffer header lock */
+			buf_state = LockBufHdr(buf_hdr);
+			lsn = BufferGetLSN(buf_hdr);
+			UnlockBufHdr(buf_hdr, buf_state);
+
+			if (XLogNeedsFlush(lsn)
+				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(buf_hdr);
+				goto again;
+			}
+		}
+
+		/* OK, do the I/O */
+		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
+		LWLockRelease(content_lock);
+
+		ScheduleBufferTagForWriteback(&BackendWritebackContext,
+									  &buf_hdr->tag);
+	}
+
+
+	if (buf_state & BM_VALID)
+	{
+		/*
+		 * When a BufferAccessStrategy is in use, blocks evicted from shared
+		 * buffers are counted as IOOP_EVICT in the corresponding context
+		 * (e.g. IOCONTEXT_BULKWRITE). Shared buffers are evicted by a
+		 * strategy in two cases: 1) while initially claiming buffers for the
+		 * strategy ring 2) to replace an existing strategy ring buffer
+		 * because it is pinned or in use and cannot be reused.
+		 *
+		 * Blocks evicted from buffers already in the strategy ring are
+		 * counted as IOOP_REUSE in the corresponding strategy context.
+		 *
+		 * At this point, we can accurately count evictions and reuses,
+		 * because we have successfully claimed the valid buffer. Previously,
+		 * we may have been forced to release the buffer due to concurrent
+		 * pinners or erroring out.
+		 */
+		pgstat_count_io_op(IOOBJECT_RELATION, io_context,
+						   from_ring ? IOOP_REUSE : IOOP_EVICT);
+	}
+
+	/*
+	 * If the buffer has an entry in the buffer mapping table, delete it. This
+	 * can fail because another backend could have pinned or dirtied the
+	 * buffer.
+	 */
+	if ((buf_state & BM_TAG_VALID) && !InvalidateVictimBuffer(buf_hdr))
+	{
+		UnpinBuffer(buf_hdr);
+		goto again;
+	}
+
+	/* a final set of sanity checks */
+#ifdef USE_ASSERT_CHECKING
+	buf_state = pg_atomic_read_u32(&buf_hdr->state);
+
+	Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+	Assert(!(buf_state & (BM_TAG_VALID | BM_VALID | BM_DIRTY)));
+
+	CheckBufferIsPinnedOnce(buf);
+#endif
+
+	return buf;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 940b80d165e..c9ba5ee00ff 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -45,13 +45,14 @@ BufferDesc *LocalBufferDescriptors = NULL;
 Block	   *LocalBufferBlockPointers = NULL;
 int32	   *LocalRefCount = NULL;
 
-static int	nextFreeLocalBuf = 0;
+static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
+static Buffer GetLocalVictimBuffer(void);
 
 
 /*
@@ -113,10 +114,9 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	BufferTag	newTag;			/* identity of requested block */
 	LocalBufferLookupEnt *hresult;
 	BufferDesc *bufHdr;
-	int			b;
-	int			trycounter;
+	Buffer		victim_buffer;
+	int			bufid;
 	bool		found;
-	uint32		buf_state;
 
 	InitBufferTag(&newTag, &smgr->smgr_rlocator.locator, forkNum, blockNum);
 
@@ -130,23 +130,51 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 	if (hresult)
 	{
-		b = hresult->id;
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufid = hresult->id;
+		bufHdr = GetLocalBufferDescriptor(bufid);
 		Assert(BufferTagsEqual(&bufHdr->tag, &newTag));
-#ifdef LBDEBUG
-		fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-				smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum, -b - 1);
-#endif
 
 		*foundPtr = PinLocalBuffer(bufHdr, true);
-		return bufHdr;
+	}
+	else
+	{
+		uint32		buf_state;
+
+		victim_buffer = GetLocalVictimBuffer();
+		bufid = -victim_buffer - 1;
+		bufHdr = GetLocalBufferDescriptor(bufid);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
+		if (found)				/* shouldn't happen */
+			elog(ERROR, "local buffer hash table corrupted");
+		hresult->id = bufid;
+
+		/*
+		 * it's all ours now.
+		 */
+		bufHdr->tag = newTag;
+
+		buf_state = pg_atomic_read_u32(&bufHdr->state);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+
+		*foundPtr = false;
 	}
 
-#ifdef LBDEBUG
-	fprintf(stderr, "LB ALLOC (%u,%d,%d) %d\n",
-			smgr->smgr_rlocator.locator.relNumber, forkNum, blockNum,
-			-nextFreeLocalBuf - 1);
-#endif
+	return bufHdr;
+}
+
+static Buffer
+GetLocalVictimBuffer(void)
+{
+	int			victim_bufid;
+	int			trycounter;
+	uint32		buf_state;
+	BufferDesc *bufHdr;
+
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
 	/*
 	 * Need to get a new buffer.  We use a clock sweep algorithm (essentially
@@ -155,14 +183,14 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	trycounter = NLocBuffer;
 	for (;;)
 	{
-		b = nextFreeLocalBuf;
+		victim_bufid = nextFreeLocalBufId;
 
-		if (++nextFreeLocalBuf >= NLocBuffer)
-			nextFreeLocalBuf = 0;
+		if (++nextFreeLocalBufId >= NLocBuffer)
+			nextFreeLocalBufId = 0;
 
-		bufHdr = GetLocalBufferDescriptor(b);
+		bufHdr = GetLocalBufferDescriptor(victim_bufid);
 
-		if (LocalRefCount[b] == 0)
+		if (LocalRefCount[victim_bufid] == 0)
 		{
 			buf_state = pg_atomic_read_u32(&bufHdr->state);
 
@@ -185,6 +213,15 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 					 errmsg("no empty local buffer available")));
 	}
 
+	/*
+	 * lazy memory allocation: allocate space on first use of a buffer.
+	 */
+	if (LocalBufHdrGetBlock(bufHdr) == NULL)
+	{
+		/* Set pointer for use by BufferGetBlock() macro */
+		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
+	}
+
 	/*
 	 * this buffer is not referenced but it might still be dirty. if that's
 	 * the case, write it out before reusing it!
@@ -216,48 +253,24 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	}
 
 	/*
-	 * lazy memory allocation: allocate space on first use of a buffer.
-	 */
-	if (LocalBufHdrGetBlock(bufHdr) == NULL)
-	{
-		/* Set pointer for use by BufferGetBlock() macro */
-		LocalBufHdrGetBlock(bufHdr) = GetLocalBufferStorage();
-	}
-
-	/*
-	 * Update the hash table: remove old entry, if any, and make new one.
+	 * Remove the victim buffer from the hashtable and mark as invalid.
 	 */
 	if (buf_state & BM_TAG_VALID)
 	{
+		LocalBufferLookupEnt *hresult;
+
 		hresult = (LocalBufferLookupEnt *)
 			hash_search(LocalBufHash, &bufHdr->tag, HASH_REMOVE, NULL);
 		if (!hresult)			/* shouldn't happen */
 			elog(ERROR, "local buffer hash table corrupted");
 		/* mark buffer invalid just in case hash insert fails */
 		ClearBufferTag(&bufHdr->tag);
-		buf_state &= ~(BM_VALID | BM_TAG_VALID);
+		buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EVICT);
 	}
 
-	hresult = (LocalBufferLookupEnt *)
-		hash_search(LocalBufHash, &newTag, HASH_ENTER, &found);
-	if (found)					/* shouldn't happen */
-		elog(ERROR, "local buffer hash table corrupted");
-	hresult->id = b;
-
-	/*
-	 * it's all ours now.
-	 */
-	bufHdr->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_IO_ERROR);
-	buf_state |= BM_TAG_VALID;
-	buf_state &= ~BUF_USAGECOUNT_MASK;
-	buf_state += BUF_USAGECOUNT_ONE;
-	pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
-
-	*foundPtr = false;
-	return bufHdr;
+	return BufferDescriptorGetBuffer(bufHdr);
 }
 
 /*
@@ -424,7 +437,7 @@ InitLocalBuffers(void)
 				(errcode(ERRCODE_OUT_OF_MEMORY),
 				 errmsg("out of memory")));
 
-	nextFreeLocalBuf = 0;
+	nextFreeLocalBufId = 0;
 
 	/* initialize fields that need to start off nonzero */
 	for (i = 0; i < nbufs; i++)
-- 
2.38.0

v7-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload

From 57c7c01d0a9f7d5b87b8b7bb93b41c0cdbd5c426 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 16:44:16 -0700
Subject: [PATCH v7 06/14] bufmgr: Support multiple in-progress IOs by using
 resowner

A future patch will add support for extending relations by multiple blocks at
once. To be concurrency safe, the buffers for those blocks need to be marked
as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from
an IO error for a single buffer. This commit extends that infrastructure to
multiple buffers by using the resource owner infrastructure.

This commit increases the size of the ResourceOwnerData struct, which appears
to have a just about measurable overhead in very extreme workloads. Medium
term we are planning to substantially shrink the size of
ResourceOwnerData. Short term the increase is small enough to not worry about
it for now.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de
---
 src/include/storage/bufmgr.h          |  2 +-
 src/include/utils/resowner_private.h  |  5 ++
 src/backend/access/transam/xact.c     |  4 +-
 src/backend/postmaster/autovacuum.c   |  1 -
 src/backend/postmaster/bgwriter.c     |  1 -
 src/backend/postmaster/checkpointer.c |  1 -
 src/backend/postmaster/walwriter.c    |  1 -
 src/backend/storage/buffer/bufmgr.c   | 86 ++++++++++++---------------
 src/backend/utils/resowner/resowner.c | 60 +++++++++++++++++++
 9 files changed, 105 insertions(+), 56 deletions(-)

diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f96dc080211..537b89e8774 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -181,7 +181,7 @@ extern bool ConditionalLockBufferForCleanup(Buffer buffer);
 extern bool IsBufferCleanupOK(Buffer buffer);
 extern bool HoldingBufferPinThatDelaysRecovery(void);
 
-extern void AbortBufferIO(void);
+extern void AbortBufferIO(Buffer buffer);
 
 extern bool BgBufferSync(struct WritebackContext *wb_context);
 
diff --git a/src/include/utils/resowner_private.h b/src/include/utils/resowner_private.h
index 1b1f3181b54..ae58438ec76 100644
--- a/src/include/utils/resowner_private.h
+++ b/src/include/utils/resowner_private.h
@@ -30,6 +30,11 @@ extern void ResourceOwnerEnlargeBuffers(ResourceOwner owner);
 extern void ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer);
 extern void ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer);
 
+/* support for IO-in-progress management */
+extern void ResourceOwnerEnlargeBufferIOs(ResourceOwner owner);
+extern void ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer);
+extern void ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer);
+
 /* support for local lock management */
 extern void ResourceOwnerRememberLock(ResourceOwner owner, LOCALLOCK *locallock);
 extern void ResourceOwnerForgetLock(ResourceOwner owner, LOCALLOCK *locallock);
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 231af52cc92..6a837e1539d 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2719,8 +2719,7 @@ AbortTransaction(void)
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
 
-	/* Clean up buffer I/O and buffer context locks, too */
-	AbortBufferIO();
+	/* Clean up buffer context locks, too */
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
@@ -5080,7 +5079,6 @@ AbortSubTransaction(void)
 
 	pgstat_report_wait_end();
 	pgstat_progress_end_command();
-	AbortBufferIO();
 	UnlockBuffers();
 
 	/* Reset WAL record construction state */
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 585d28148ca..e9ba0dc17cd 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -526,7 +526,6 @@ AutoVacLauncherMain(int argc, char *argv[])
 		 */
 		LWLockReleaseAll();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		/* this is probably dead code, but let's be safe: */
 		if (AuxProcessResourceOwner)
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 9bb47da404d..caad642ec93 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -167,7 +167,6 @@ BackgroundWriterMain(void)
 		 */
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index aaad5c52281..ace9893d957 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -271,7 +271,6 @@ CheckpointerMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/postmaster/walwriter.c b/src/backend/postmaster/walwriter.c
index 513e580c513..65e84be39b9 100644
--- a/src/backend/postmaster/walwriter.c
+++ b/src/backend/postmaster/walwriter.c
@@ -163,7 +163,6 @@ WalWriterMain(void)
 		LWLockReleaseAll();
 		ConditionVariableCancelSleep();
 		pgstat_report_wait_end();
-		AbortBufferIO();
 		UnlockBuffers();
 		ReleaseAuxProcessResources(false);
 		AtEOXact_Buffers(false);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 497050c83d4..8de02c44aba 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -159,10 +159,6 @@ int			checkpoint_flush_after = DEFAULT_CHECKPOINT_FLUSH_AFTER;
 int			bgwriter_flush_after = DEFAULT_BGWRITER_FLUSH_AFTER;
 int			backend_flush_after = DEFAULT_BACKEND_FLUSH_AFTER;
 
-/* local state for StartBufferIO and related functions */
-static BufferDesc *InProgressBuf = NULL;
-static bool IsForInput;
-
 /* local state for LockBufferForCleanup */
 static BufferDesc *PinCountWaitBuf = NULL;
 
@@ -2702,7 +2698,6 @@ InitBufferPoolAccess(void)
 static void
 AtProcExit_Buffers(int code, Datum arg)
 {
-	AbortBufferIO();
 	UnlockBuffers();
 
 	CheckForBufferLeaks();
@@ -4644,7 +4639,7 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 {
 	uint32		buf_state;
 
-	Assert(!InProgressBuf);
+	ResourceOwnerEnlargeBufferIOs(CurrentResourceOwner);
 
 	for (;;)
 	{
@@ -4668,8 +4663,8 @@ StartBufferIO(BufferDesc *buf, bool forInput)
 	buf_state |= BM_IO_IN_PROGRESS;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = buf;
-	IsForInput = forInput;
+	ResourceOwnerRememberBufferIO(CurrentResourceOwner,
+								  BufferDescriptorGetBuffer(buf));
 
 	return true;
 }
@@ -4695,8 +4690,6 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 {
 	uint32		buf_state;
 
-	Assert(buf == InProgressBuf);
-
 	buf_state = LockBufHdr(buf);
 
 	Assert(buf_state & BM_IO_IN_PROGRESS);
@@ -4708,13 +4701,14 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
 	buf_state |= set_flag_bits;
 	UnlockBufHdr(buf, buf_state);
 
-	InProgressBuf = NULL;
+	ResourceOwnerForgetBufferIO(CurrentResourceOwner,
+								BufferDescriptorGetBuffer(buf));
 
 	ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
 }
 
 /*
- * AbortBufferIO: Clean up any active buffer I/O after an error.
+ * AbortBufferIO: Clean up active buffer I/O after an error.
  *
  *	All LWLocks we might have held have been released,
  *	but we haven't yet released buffer pins, so the buffer is still pinned.
@@ -4723,46 +4717,42 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
  *	possible the error condition wasn't related to the I/O.
  */
 void
-AbortBufferIO(void)
+AbortBufferIO(Buffer buf)
 {
-	BufferDesc *buf = InProgressBuf;
+	BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+	uint32		buf_state;
 
-	if (buf)
+	buf_state = LockBufHdr(buf_hdr);
+	Assert(buf_state & (BM_IO_IN_PROGRESS | BM_TAG_VALID));
+
+	if (!(buf_state & BM_VALID))
 	{
-		uint32		buf_state;
-
-		buf_state = LockBufHdr(buf);
-		Assert(buf_state & BM_IO_IN_PROGRESS);
-		if (IsForInput)
-		{
-			Assert(!(buf_state & BM_DIRTY));
-
-			/* We'd better not think buffer is valid yet */
-			Assert(!(buf_state & BM_VALID));
-			UnlockBufHdr(buf, buf_state);
-		}
-		else
-		{
-			Assert(buf_state & BM_DIRTY);
-			UnlockBufHdr(buf, buf_state);
-			/* Issue notice if this is not the first failure... */
-			if (buf_state & BM_IO_ERROR)
-			{
-				/* Buffer is pinned, so we can read tag without spinlock */
-				char	   *path;
-
-				path = relpathperm(BufTagGetRelFileLocator(&buf->tag),
-								   BufTagGetForkNum(&buf->tag));
-				ereport(WARNING,
-						(errcode(ERRCODE_IO_ERROR),
-						 errmsg("could not write block %u of %s",
-								buf->tag.blockNum, path),
-						 errdetail("Multiple failures --- write error might be permanent.")));
-				pfree(path);
-			}
-		}
-		TerminateBufferIO(buf, false, BM_IO_ERROR);
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
 	}
+	else
+	{
+		Assert(!(buf_state & BM_DIRTY));
+		UnlockBufHdr(buf_hdr, buf_state);
+
+		/* Issue notice if this is not the first failure... */
+		if (buf_state & BM_IO_ERROR)
+		{
+			/* Buffer is pinned, so we can read tag without spinlock */
+			char	   *path;
+
+			path = relpathperm(BufTagGetRelFileLocator(&buf_hdr->tag),
+							   BufTagGetForkNum(&buf_hdr->tag));
+			ereport(WARNING,
+					(errcode(ERRCODE_IO_ERROR),
+					 errmsg("could not write block %u of %s",
+							buf_hdr->tag.blockNum, path),
+					 errdetail("Multiple failures --- write error might be permanent.")));
+			pfree(path);
+		}
+	}
+
+	TerminateBufferIO(buf_hdr, false, BM_IO_ERROR);
 }
 
 /*
diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 19b6241e45d..fccc59b39dd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -121,6 +121,7 @@ typedef struct ResourceOwnerData
 
 	/* We have built-in support for remembering: */
 	ResourceArray bufferarr;	/* owned buffers */
+	ResourceArray bufferioarr;	/* in-progress buffer IO */
 	ResourceArray catrefarr;	/* catcache references */
 	ResourceArray catlistrefarr;	/* catcache-list pins */
 	ResourceArray relrefarr;	/* relcache references */
@@ -441,6 +442,7 @@ ResourceOwnerCreate(ResourceOwner parent, const char *name)
 	}
 
 	ResourceArrayInit(&(owner->bufferarr), BufferGetDatum(InvalidBuffer));
+	ResourceArrayInit(&(owner->bufferioarr), BufferGetDatum(InvalidBuffer));
 	ResourceArrayInit(&(owner->catrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->catlistrefarr), PointerGetDatum(NULL));
 	ResourceArrayInit(&(owner->relrefarr), PointerGetDatum(NULL));
@@ -517,6 +519,24 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 
 	if (phase == RESOURCE_RELEASE_BEFORE_LOCKS)
 	{
+		/*
+		 * Abort failed buffer IO. AbortBufferIO()->TerminateBufferIO() calls
+		 * ResourceOwnerForgetBufferIOs(), so we just have to iterate till
+		 * there are none.
+		 *
+		 * Needs to be before we release buffer pins.
+		 *
+		 * During a commit, there shouldn't be any in-progress IO.
+		 */
+		while (ResourceArrayGetAny(&(owner->bufferioarr), &foundres))
+		{
+			Buffer		res = DatumGetBuffer(foundres);
+
+			if (isCommit)
+				elog(PANIC, "lost track of buffer IO on buffer %u", res);
+			AbortBufferIO(res);
+		}
+
 		/*
 		 * Release buffer pins.  Note that ReleaseBuffer will remove the
 		 * buffer entry from our array, so we just have to iterate till there
@@ -746,6 +766,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And it better not own any resources, either */
 	Assert(owner->bufferarr.nitems == 0);
+	Assert(owner->bufferioarr.nitems == 0);
 	Assert(owner->catrefarr.nitems == 0);
 	Assert(owner->catlistrefarr.nitems == 0);
 	Assert(owner->relrefarr.nitems == 0);
@@ -775,6 +796,7 @@ ResourceOwnerDelete(ResourceOwner owner)
 
 	/* And free the object. */
 	ResourceArrayFree(&(owner->bufferarr));
+	ResourceArrayFree(&(owner->bufferioarr));
 	ResourceArrayFree(&(owner->catrefarr));
 	ResourceArrayFree(&(owner->catlistrefarr));
 	ResourceArrayFree(&(owner->relrefarr));
@@ -976,6 +998,44 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 			 buffer, owner->name);
 }
 
+
+/*
+ * Make sure there is room for at least one more entry in a ResourceOwner's
+ * buffer array.
+ *
+ * This is separate from actually inserting an entry because if we run out
+ * of memory, it's critical to do so *before* acquiring the resource.
+ */
+void
+ResourceOwnerEnlargeBufferIOs(ResourceOwner owner)
+{
+	/* We used to allow pinning buffers without a resowner, but no more */
+	Assert(owner != NULL);
+	ResourceArrayEnlarge(&(owner->bufferioarr));
+}
+
+/*
+ * Remember that a buffer IO is owned by a ResourceOwner
+ *
+ * Caller must have previously done ResourceOwnerEnlargeBufferIOs()
+ */
+void
+ResourceOwnerRememberBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	ResourceArrayAdd(&(owner->bufferioarr), BufferGetDatum(buffer));
+}
+
+/*
+ * Forget that a buffer IO is owned by a ResourceOwner
+ */
+void
+ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
+{
+	if (!ResourceArrayRemove(&(owner->bufferioarr), BufferGetDatum(buffer)))
+		elog(PANIC, "buffer IO %d is not owned by resource owner %s",
+			 buffer, owner->name);
+}
+
 /*
  * Remember that a Local Lock is owned by a ResourceOwner
  *
-- 
2.38.0

v7-0007-bufmgr-Introduce-infrastructure-for-faster-relati.patchtext/x-diff; charset=us-asciiDownload

From 5d4840b361fe65699563e208effc60c542e2f5e4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 2 Apr 2023 14:48:52 -0700
Subject: [PATCH v7 07/14] bufmgr: Introduce infrastructure for faster relation
 extension

The primary bottlenecks for relation extension are:

1) The extension lock is held while acquiring a victim buffer for the new
   page. Acquiring a victim buffer can require writing out the old page
   contents including possibly needing to flush WAL.

2) When extending via ReadBuffer() et al, we write a zero page during the
   extension, and then later write out the actual page contents. This can
   nearly double the write rate.

3) The existing bulk relation extension infrastructure in hio.c just amortized
   the cost of acquiring the relation extension lock, but none of the other
   costs.

Unfortunately 1) cannot currently be addressed in a central manner as the
callers to ReadBuffer() need to acquire the extension lock. To address that,
this this commit moves the responsibility for acquiring the extension lock
into bufmgr.c functions. That allows to acquire the relation extension lock
for just the required time. This will also allow us to improve relation
extension further, without changing callers.

The reason we write all-zeroes pages during relation extension is that we hope
to get ENOSPC errors earlier that way (largely works, except for CoW
filesystems). It is easier to handle out-of-space errors gracefully if the
page doesn't yet contain actual tuples. This commit addresses 2), by using the
recently introduced smgrzeroextend(), which extends the relation, without
dirtying the kernel page cache for all the extended pages.

To address 3), this commit introduces a function to extend a relation by
multiple blocks at a time.

There are three new exposed functions: ExtendBufferedRel() for extending the
relation by a single block, ExtendBufferedRelBy() to extend a relation by
multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up
to a certain size.

To avoid duplicating code between ReadBuffer(P_NEW) and the new functions,
ReadBuffer(P_NEW) now implements relation extension with
ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the
relation lock is already held.

Note that this commit does not lead to a meaningful performance or scalability
improvement - ReadBuffer(P_NEW) will need to be converted to ExtendBuffered*()
for that, which is done in subsequent commits.

Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi>
Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/include/pgstat.h                   |   1 +
 src/include/storage/buf_internals.h    |   7 +
 src/include/storage/bufmgr.h           |  65 ++
 src/backend/storage/buffer/bufmgr.c    | 799 +++++++++++++++++++------
 src/backend/storage/buffer/localbuf.c  | 156 ++++-
 src/backend/utils/activity/pgstat_io.c |   8 +-
 src/backend/utils/probes.d             |   6 +-
 doc/src/sgml/monitoring.sgml           |  43 +-
 8 files changed, 900 insertions(+), 185 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75d258d9215..e79b8a34ebc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -516,6 +516,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 970d0090615..34feaea9945 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -422,6 +422,13 @@ extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
 												BlockNumber blockNum);
 extern BufferDesc *LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
 									BlockNumber blockNum, bool *foundPtr);
+extern BlockNumber ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+										  ForkNumber fork,
+										  uint32 flags,
+										  uint32 extend_by,
+										  BlockNumber extend_upto,
+										  Buffer *buffers,
+										  uint32 *extended_by);
 extern void MarkLocalBufferDirty(Buffer buffer);
 extern void DropRelationLocalBuffers(RelFileLocator rlocator,
 									 ForkNumber forkNum,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 537b89e8774..ecdf53d1e56 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -60,6 +60,53 @@ typedef struct PrefetchBufferResult
 	bool		initiated_io;	/* If true, a miss resulting in async I/O */
 } PrefetchBufferResult;
 
+/*
+ * Flags influencing the behaviour of ExtendBufferedRel*
+ */
+typedef enum ExtendBufferedFlags
+{
+	/*
+	 * Don't acquire extension lock. This is safe only if the relation isn't
+	 * shared, an access exclusive lock is held or if this is the startup
+	 * process.
+	 */
+	EB_SKIP_EXTENSION_LOCK = (1 << 0),
+
+	/* Is this extension part of recovery? */
+	EB_PERFORMING_RECOVERY = (1 << 1),
+
+	/*
+	 * Should the fork be created if it does not currently exist? This likely
+	 * only ever makes sense for relation forks.
+	 */
+	EB_CREATE_FORK_IF_NEEDED = (1 << 2),
+
+	/* Should the first (possibly only) return buffer be returned locked? */
+	EB_LOCK_FIRST = (1 << 3),
+
+	/* Should the smgr size cache be cleared? */
+	EB_CLEAR_SIZE_CACHE = (1 << 4),
+
+	/* internal flags follow */
+	EB_LOCK_TARGET = (1 << 5),
+} ExtendBufferedFlags;
+
+/*
+ * To identify the relation - either relation or smgr + relpersistence has to
+ * be specified. Used via the EB_REL()/EB_SMGR() macros below. This allows us
+ * to use the same function for both crash recovery and normal operation.
+ */
+typedef struct ExtendBufferedWhat
+{
+	Relation rel;
+	struct SMgrRelationData *smgr;
+	char relpersistence;
+} ExtendBufferedWhat;
+
+#define EB_REL(p_rel) ((ExtendBufferedWhat){.rel = p_rel})
+#define EB_SMGR(p_smgr, p_relpersistence) ((ExtendBufferedWhat){.smgr = p_smgr, .relpersistence = p_relpersistence})
+
+
 /* forward declared, to avoid having to expose buf_internals.h here */
 struct WritebackContext;
 
@@ -138,6 +185,24 @@ extern void CheckBufferIsPinnedOnce(Buffer buffer);
 extern Buffer ReleaseAndReadBuffer(Buffer buffer, Relation relation,
 								   BlockNumber blockNum);
 
+extern Buffer ExtendBufferedRel(ExtendBufferedWhat eb,
+								ForkNumber forkNum,
+								BufferAccessStrategy strategy,
+								uint32 flags);
+extern BlockNumber ExtendBufferedRelBy(ExtendBufferedWhat eb,
+									   ForkNumber fork,
+									   BufferAccessStrategy strategy,
+									   uint32 flags,
+									   uint32 extend_by,
+									   Buffer *buffers,
+									   uint32 *extended_by);
+extern Buffer ExtendBufferedRelTo(ExtendBufferedWhat eb,
+								  ForkNumber fork,
+								  BufferAccessStrategy strategy,
+								  uint32 flags,
+								  BlockNumber extend_to,
+								  ReadBufferMode mode);
+
 extern void InitBufferPoolAccess(void);
 extern void AtEOXact_Buffers(bool isCommit);
 extern void PrintBufferLeakWarning(Buffer buffer);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8de02c44aba..327412bb165 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -48,6 +48,7 @@
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
@@ -450,6 +451,22 @@ static Buffer ReadBuffer_common(SMgrRelation smgr, char relpersistence,
 								ForkNumber forkNum, BlockNumber blockNum,
 								ReadBufferMode mode, BufferAccessStrategy strategy,
 								bool *hit);
+static BlockNumber ExtendBufferedRelCommon(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
+static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
+										   ForkNumber fork,
+										   BufferAccessStrategy strategy,
+										   uint32 flags,
+										   uint32 extend_by,
+										   BlockNumber extend_upto,
+										   Buffer *buffers,
+										   uint32 *extended_by);
 static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
@@ -785,6 +802,179 @@ ReadBufferWithoutRelcache(RelFileLocator rlocator, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * Convenience wrapper around ExtendBufferedRelBy() extending by one block.
+ */
+Buffer
+ExtendBufferedRel(ExtendBufferedWhat eb,
+				  ForkNumber forkNum,
+				  BufferAccessStrategy strategy,
+				  uint32 flags)
+{
+	Buffer		buf;
+	uint32		extend_by = 1;
+
+	ExtendBufferedRelBy(eb, forkNum, strategy, flags, extend_by,
+						&buf, &extend_by);
+
+	return buf;
+}
+
+/*
+ * Extend relation by multiple blocks.
+ *
+ * Tries to extend the relation by extend_by blocks. Depending on the
+ * availability of resources the relation may end up being extended by a
+ * smaller number of pages (unless an error is thrown, always by at least one
+ * page). *extended_by is updated to the number of pages the relation has been
+ * extended to.
+ *
+ * buffers needs to be an array that is at least extend_by long. Upon
+ * completion, the first extend_by array elements will point to a pinned
+ * buffer.
+ *
+ * If EB_LOCK_FIRST is part of flags, the first returned buffer is
+ * locked. This is useful for callers that want a buffer that is guaranteed to
+ * be empty.
+ */
+BlockNumber
+ExtendBufferedRelBy(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					uint32 extend_by,
+					Buffer *buffers,
+					uint32 *extended_by)
+{
+	Assert((eb.rel != NULL) != (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_by > 0);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	return ExtendBufferedRelCommon(eb, fork, strategy, flags,
+								   extend_by, InvalidBlockNumber,
+								   buffers, extended_by);
+}
+
+/*
+ * Extend the relation so it is at least extend_to blocks large, read buffer
+ * (extend_to - 1).
+ *
+ * This is useful for callers that want to write a specific page, regardless
+ * of the current size of the relation (e.g. useful for visibilitymap and for
+ * crash recovery).
+ */
+Buffer
+ExtendBufferedRelTo(ExtendBufferedWhat eb,
+					ForkNumber fork,
+					BufferAccessStrategy strategy,
+					uint32 flags,
+					BlockNumber extend_to,
+					ReadBufferMode mode)
+{
+	BlockNumber current_size;
+	uint32 extended_by = 0;
+	Buffer buffer = InvalidBuffer;
+	Buffer buffers[64];
+
+	Assert((eb.rel != NULL) != (eb.smgr != NULL));
+	Assert(eb.smgr == NULL || eb.relpersistence != 0);
+	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
+	Assert(mode == RBM_NORMAL || mode == RBM_ZERO_ON_ERROR ||
+		   mode == RBM_ZERO_AND_LOCK);
+
+	if (eb.smgr == NULL)
+	{
+		eb.smgr = RelationGetSmgr(eb.rel);
+		eb.relpersistence = eb.rel->rd_rel->relpersistence;
+	}
+
+	/*
+	 * If desired, create the file if it doesn't exist.  If
+	 * smgr_cached_nblocks[fork] is positive then it must exist, no need for
+	 * an smgrexists call.
+	 */
+	if ((flags & EB_CREATE_FORK_IF_NEEDED) &&
+		(eb.smgr->smgr_cached_nblocks[fork] == 0 ||
+		 eb.smgr->smgr_cached_nblocks[fork] == InvalidBlockNumber) &&
+		!smgrexists(eb.smgr, fork))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+
+		/* could have been closed while waiting for lock */
+		eb.smgr = RelationGetSmgr(eb.rel);
+
+		/* recheck, fork might have been created concurrently */
+		if (!smgrexists(eb.smgr, fork))
+			smgrcreate(eb.smgr, fork, flags & EB_PERFORMING_RECOVERY);
+
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	/*
+	 * Estimate how many pages we'll need to extend by. This avoids acquiring
+	 * unnecessarily many victim buffers.
+	 */
+	current_size = smgrnblocks(eb.smgr, fork);
+
+	if (mode == RBM_ZERO_AND_LOCK)
+		flags |= EB_LOCK_TARGET;
+
+	while (current_size < extend_to)
+	{
+		uint32 num_pages = lengthof(buffers);
+		BlockNumber first_block;
+
+		if ((uint64) current_size + num_pages > extend_to)
+			num_pages = extend_to - current_size;
+
+		first_block = ExtendBufferedRelCommon(eb, fork, strategy, flags,
+											  num_pages, extend_to,
+											  buffers, &extended_by);
+
+		current_size = first_block + extended_by;
+		Assert(current_size <= extend_to);
+		Assert(num_pages != 0 || current_size >= extend_to);
+
+		for (int i = 0; i < extended_by; i++)
+		{
+			if (first_block + i != extend_to - 1)
+				ReleaseBuffer(buffers[i]);
+			else
+				buffer = buffers[i];
+		}
+	}
+
+	/*
+	 * It's possible that another backend concurrently extended the
+	 * relation. In that case read the buffer.
+	 *
+	 * XXX: Should we control this via a flag?
+	 */
+	if (buffer == InvalidBuffer)
+	{
+		bool hit;
+
+		Assert(extended_by == 0);
+		buffer = ReadBuffer_common(eb.smgr, eb.relpersistence,
+								   fork, extend_to - 1, mode, strategy,
+								   &hit);
+	}
+
+	return buffer;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
@@ -801,35 +991,38 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	bool		found;
 	IOContext	io_context;
 	IOObject	io_object;
-	bool		isExtend;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
 
 	*hit = false;
 
+	/*
+	 * Backward compatibility path, most code should use ExtendBufferedRel()
+	 * instead, as acquiring the extension lock inside ExtendBufferedRel()
+	 * scales a lot better.
+	 */
+	if (unlikely(blockNum == P_NEW))
+	{
+		uint32 flags = EB_SKIP_EXTENSION_LOCK;
+
+		Assert(mode == RBM_NORMAL ||
+			   mode == RBM_ZERO_AND_LOCK ||
+			   mode == RBM_ZERO_ON_ERROR);
+
+		if (mode == RBM_ZERO_AND_LOCK)
+			flags |= EB_LOCK_FIRST;
+
+		return ExtendBufferedRel(EB_SMGR(smgr, relpersistence),
+								 forkNum, strategy, flags);
+	}
+
 	/* Make sure we will have room to remember the buffer pin */
 	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
 
-	isExtend = (blockNum == P_NEW);
-
 	TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
 									   smgr->smgr_rlocator.locator.spcOid,
 									   smgr->smgr_rlocator.locator.dbOid,
 									   smgr->smgr_rlocator.locator.relNumber,
-									   smgr->smgr_rlocator.backend,
-									   isExtend);
-
-	/* Substitute proper block number if caller asked for P_NEW */
-	if (isExtend)
-	{
-		blockNum = smgrnblocks(smgr, forkNum);
-		/* Fail if relation is already at maximum possible length */
-		if (blockNum == P_NEW)
-			ereport(ERROR,
-					(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
-					 errmsg("cannot extend relation %s beyond %u blocks",
-							relpath(smgr->smgr_rlocator, forkNum),
-							P_NEW)));
-	}
+									   smgr->smgr_rlocator.backend);
 
 	if (isLocalBuf)
 	{
@@ -844,8 +1037,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.local_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
@@ -862,8 +1053,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found, io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (isExtend)
-			pgBufferUsage.shared_blks_written++;
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
@@ -874,175 +1063,91 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* if it was already in the buffer pool, we're done */
 	if (found)
 	{
-		if (!isExtend)
-		{
-			/* Just need to update stats before we exit */
-			*hit = true;
-			VacuumPageHit++;
-			pgstat_count_io_op(io_object, io_context, IOOP_HIT);
+		/* Just need to update stats before we exit */
+		*hit = true;
+		VacuumPageHit++;
+		pgstat_count_io_op(io_object, io_context, IOOP_HIT);
 
-			if (VacuumCostActive)
-				VacuumCostBalance += VacuumCostPageHit;
+		if (VacuumCostActive)
+			VacuumCostBalance += VacuumCostPageHit;
 
-			TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
-											  smgr->smgr_rlocator.locator.spcOid,
-											  smgr->smgr_rlocator.locator.dbOid,
-											  smgr->smgr_rlocator.locator.relNumber,
-											  smgr->smgr_rlocator.backend,
-											  isExtend,
-											  found);
-
-			/*
-			 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
-			 * locked on return.
-			 */
-			if (!isLocalBuf)
-			{
-				if (mode == RBM_ZERO_AND_LOCK)
-					LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
-								  LW_EXCLUSIVE);
-				else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
-			}
-
-			return BufferDescriptorGetBuffer(bufHdr);
-		}
+		TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
+										  smgr->smgr_rlocator.locator.spcOid,
+										  smgr->smgr_rlocator.locator.dbOid,
+										  smgr->smgr_rlocator.locator.relNumber,
+										  smgr->smgr_rlocator.backend,
+										  found);
 
 		/*
-		 * We get here only in the corner case where we are trying to extend
-		 * the relation but we found a pre-existing buffer marked BM_VALID.
-		 * This can happen because mdread doesn't complain about reads beyond
-		 * EOF (when zero_damaged_pages is ON) and so a previous attempt to
-		 * read a block beyond EOF could have left a "valid" zero-filled
-		 * buffer.  Unfortunately, we have also seen this case occurring
-		 * because of buggy Linux kernels that sometimes return an
-		 * lseek(SEEK_END) result that doesn't account for a recent write. In
-		 * that situation, the pre-existing buffer would contain valid data
-		 * that we don't want to overwrite.  Since the legitimate case should
-		 * always have left a zero-filled buffer, complain if not PageIsNew.
+		 * In RBM_ZERO_AND_LOCK mode the caller expects the page to be locked
+		 * on return.
 		 */
-		bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
-		if (!PageIsNew((Page) bufBlock))
-			ereport(ERROR,
-					(errmsg("unexpected data beyond EOF in block %u of relation %s",
-							blockNum, relpath(smgr->smgr_rlocator, forkNum)),
-					 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
-
-		/*
-		 * We *must* do smgrextend before succeeding, else the page will not
-		 * be reserved by the kernel, and the next P_NEW call will decide to
-		 * return the same page.  Clear the BM_VALID bit, do the StartBufferIO
-		 * call that BufferAlloc didn't, and proceed.
-		 */
-		if (isLocalBuf)
+		if (!isLocalBuf)
 		{
-			/* Only need to adjust flags */
-			uint32		buf_state = pg_atomic_read_u32(&bufHdr->state);
-
-			Assert(buf_state & BM_VALID);
-			buf_state &= ~BM_VALID;
-			pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
+			if (mode == RBM_ZERO_AND_LOCK)
+				LWLockAcquire(BufferDescriptorGetContentLock(bufHdr),
+							  LW_EXCLUSIVE);
+			else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
+				LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
 		}
-		else
-		{
-			/*
-			 * Loop to handle the very small possibility that someone re-sets
-			 * BM_VALID between our clearing it and StartBufferIO inspecting
-			 * it.
-			 */
-			do
-			{
-				uint32		buf_state = LockBufHdr(bufHdr);
 
-				Assert(buf_state & BM_VALID);
-				buf_state &= ~BM_VALID;
-				UnlockBufHdr(bufHdr, buf_state);
-			} while (!StartBufferIO(bufHdr, true));
-		}
+		return BufferDescriptorGetBuffer(bufHdr);
 	}
 
 	/*
 	 * if we have gotten to this point, we have allocated a buffer for the
 	 * page but its contents are not yet valid.  IO_IN_PROGRESS is set for it,
 	 * if it's a shared buffer.
-	 *
-	 * Note: if smgrextend fails, we will end up with a buffer that is
-	 * allocated but not marked BM_VALID.  P_NEW will still select the same
-	 * block number (because the relation didn't get any longer on disk) and
-	 * so future attempts to extend the relation will find the same buffer (if
-	 * it's not been recycled) but come right back here to try smgrextend
-	 * again.
 	 */
 	Assert(!(pg_atomic_read_u32(&bufHdr->state) & BM_VALID));	/* spinlock not needed */
 
 	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(bufHdr) : BufHdrGetBlock(bufHdr);
 
-	if (isExtend)
-	{
-		/* new buffers are zero-filled */
+	/*
+	 * Read in the page, unless the caller intends to overwrite it and just
+	 * wants us to allocate a buffer.
+	 */
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		MemSet((char *) bufBlock, 0, BLCKSZ);
-		/* don't set checksum for all-zero page */
-		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
-
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
-
-		/*
-		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
-		 * although we're essentially performing a write. At least on linux
-		 * doing so defeats the 'delayed allocation' mechanism, leading to
-		 * increased file fragmentation.
-		 */
-	}
 	else
 	{
-		/*
-		 * Read in the page, unless the caller intends to overwrite it and
-		 * just wants us to allocate a buffer.
-		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-			MemSet((char *) bufBlock, 0, BLCKSZ);
-		else
+		instr_time	io_start,
+					io_time;
+
+		if (track_io_timing)
+			INSTR_TIME_SET_CURRENT(io_start);
+
+		smgrread(smgr, forkNum, blockNum, bufBlock);
+
+		if (track_io_timing)
 		{
-			instr_time	io_start,
-						io_time;
+			INSTR_TIME_SET_CURRENT(io_time);
+			INSTR_TIME_SUBTRACT(io_time, io_start);
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+		}
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
+		pgstat_count_io_op(io_object, io_context, IOOP_READ);
+
+		/* check for garbage data */
+		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
+									PIV_LOG_WARNING | PIV_REPORT_STAT))
+		{
+			if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
+			{
+				ereport(WARNING,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s; zeroing out page",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
+				MemSet((char *) bufBlock, 0, BLCKSZ);
+			}
 			else
-				INSTR_TIME_SET_ZERO(io_start);
-
-			smgrread(smgr, forkNum, blockNum, bufBlock);
-
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
-
-			/* check for garbage data */
-			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
-										PIV_LOG_WARNING | PIV_REPORT_STAT))
-			{
-				if (mode == RBM_ZERO_ON_ERROR || zero_damaged_pages)
-				{
-					ereport(WARNING,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s; zeroing out page",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-					MemSet((char *) bufBlock, 0, BLCKSZ);
-				}
-				else
-					ereport(ERROR,
-							(errcode(ERRCODE_DATA_CORRUPTED),
-							 errmsg("invalid page in block %u of relation %s",
-									blockNum,
-									relpath(smgr->smgr_rlocator, forkNum))));
-			}
+				ereport(ERROR,
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("invalid page in block %u of relation %s",
+								blockNum,
+								relpath(smgr->smgr_rlocator, forkNum))));
 		}
 	}
 
@@ -1085,7 +1190,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 									  smgr->smgr_rlocator.locator.dbOid,
 									  smgr->smgr_rlocator.locator.relNumber,
 									  smgr->smgr_rlocator.backend,
-									  isExtend,
 									  found);
 
 	return BufferDescriptorGetBuffer(bufHdr);
@@ -1455,8 +1559,8 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
 	 * Clear out the buffer's tag and flags and usagecount.  This is not
 	 * strictly required, as BM_TAG_VALID/BM_VALID needs to be checked before
 	 * doing anything with the buffer. But currently it's beneficial as the
-	 * pre-check for several linear scans of shared buffers just checks the
-	 * tag.
+	 * cheaper pre-check for several linear scans of shared buffers use the
+	 * tag (see e.g. FlushDatabaseBuffers()).
 	 */
 	ClearBufferTag(&buf_hdr->tag);
 	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
@@ -1630,6 +1734,363 @@ again:
 	return buf;
 }
 
+/*
+ * Limit the number of pins a batch operation may additionally acquire, to
+ * avoid running out of pinnable buffers.
+ *
+ * One additional pin is always allowed, as otherwise the operation likely
+ * cannot be performed at all.
+ *
+ * The number of allowed pins for a backend is computed based on
+ * shared_buffers and the maximum number of connections possible. That's very
+ * pessimistic, but oustide of toy-sized shared_buffers it should allow
+ * sufficient pins.
+ */
+static void
+LimitAdditionalPins(uint32 *additional_pins)
+{
+	uint32		max_backends;
+	int			max_proportional_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	max_backends = MaxBackends + NUM_AUXILIARY_PROCS;
+	max_proportional_pins = NBuffers / max_backends;
+
+	/*
+	 * Subtract the approximate number of buffers already pinned by this
+	 * backend. We get the number of "overflowed" pins for free, but don't
+	 * know the number of pins in PrivateRefCountArray. The cost of
+	 * calculating that exactly doesn't seem worth it, so just assume the max.
+	 */
+	max_proportional_pins -= PrivateRefCountOverflowed + REFCOUNT_ARRAY_ENTRIES;
+
+	if (max_proportional_pins < 0)
+		max_proportional_pins = 1;
+
+	if (*additional_pins > max_proportional_pins)
+		*additional_pins = max_proportional_pins;
+}
+
+/*
+ * Logic shared between ExtendBufferedRelBy(), ExtendBufferedRelTo(). Just to
+ * avoid duplicating the tracing and relpersistence related logic.
+ */
+static BlockNumber
+ExtendBufferedRelCommon(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_START(fork,
+										 eb.smgr->smgr_rlocator.locator.spcOid,
+										 eb.smgr->smgr_rlocator.locator.dbOid,
+										 eb.smgr->smgr_rlocator.locator.relNumber,
+										 eb.smgr->smgr_rlocator.backend,
+										 extend_by);
+
+	if (eb.relpersistence == RELPERSISTENCE_TEMP)
+		first_block = ExtendBufferedRelLocal(eb, fork, flags,
+											 extend_by, extend_upto,
+											 buffers, &extend_by);
+	else
+		first_block = ExtendBufferedRelShared(eb, fork, strategy, flags,
+											  extend_by, extend_upto,
+											  buffers, &extend_by);
+	*extended_by = extend_by;
+
+	TRACE_POSTGRESQL_BUFFER_EXTEND_DONE(fork,
+										eb.smgr->smgr_rlocator.locator.spcOid,
+										eb.smgr->smgr_rlocator.locator.dbOid,
+										eb.smgr->smgr_rlocator.locator.relNumber,
+										eb.smgr->smgr_rlocator.backend,
+										*extended_by,
+										first_block);
+
+	return first_block;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * shared buffers.
+ */
+static BlockNumber
+ExtendBufferedRelShared(ExtendBufferedWhat eb,
+						ForkNumber fork,
+						BufferAccessStrategy strategy,
+						uint32 flags,
+						uint32 extend_by,
+						BlockNumber extend_upto,
+						Buffer *buffers,
+						uint32 *extended_by)
+{
+	BlockNumber first_block;
+	IOContext	io_context = IOContextForStrategy(strategy);
+
+	LimitAdditionalPins(&extend_by);
+
+	/*
+	 * Acquire victim buffers for extension without holding extension lock.
+	 * Writing out victim buffers is the most expensive part of extending the
+	 * relation, particularly when doing so requires WAL flushes. Zeroing out
+	 * the buffers is also quite expensive, so do that before holding the
+	 * extension lock as well.
+	 *
+	 * These pages are pinned by us and not valid. While we hold the pin they
+	 * can't be acquired as victim buffers by another backend.
+	 */
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		Block		buf_block;
+
+		buffers[i] = GetVictimBuffer(strategy, io_context);
+		buf_block = BufHdrGetBlock(GetBufferDescriptor(buffers[i] - 1));
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	/* in case we need to pin an existing buffer below */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/*
+	 * Lock relation against concurrent extensions, unless requested not to.
+	 *
+	 * We use the same extension lock for all forks. That's unnecessarily
+	 * restrictive, but currently extensions for forks don't happen often
+	 * enough to make it worth locking more granularly.
+	 *
+	 * Note that another backend might have extended the relation by the time
+	 * we get the lock.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+	{
+		LockRelationForExtension(eb.rel, ExclusiveLock);
+		eb.smgr = RelationGetSmgr(eb.rel);
+	}
+
+	/*
+	 * If requested, invalidate size cache, so that smgrnblocks asks the
+	 * kernel.
+	 */
+	if (flags & EB_CLEAR_SIZE_CACHE)
+		eb.smgr->smgr_cached_nblocks[fork] = InvalidBlockNumber;
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	/*
+	 * Now that we have the accurate relation size, check if the caller wants
+	 * us to extend to only up to a specific size. If there were concurrent
+	 * extensions, we might have acquired too many buffers and need to release
+	 * them.
+	 */
+	if (extend_upto != InvalidBlockNumber)
+	{
+		uint32 orig_extend_by = extend_by;
+
+		if (first_block > extend_upto)
+			extend_by = 0;
+		else if ((uint64) first_block + extend_by > extend_upto)
+			extend_by = extend_upto - first_block;
+
+		for (uint32 i = extend_by; i < orig_extend_by; i++)
+		{
+			BufferDesc *buf_hdr = GetBufferDescriptor(buffers[i] - 1);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(buf_hdr);
+			UnpinBuffer(buf_hdr);
+		}
+
+		if (extend_by == 0)
+		{
+			UnlockRelationForExtension(eb.rel, ExclusiveLock);
+			*extended_by = extend_by;
+			return first_block;
+		}
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	/*
+	 * Insert buffers into buffer table, mark as IO_IN_PROGRESS.
+	 *
+	 * This needs to happen before we extend the relation, because as soon as
+	 * we do, other backends can start to read in those pages.
+	 */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		victim_buf = buffers[i];
+		BufferDesc *victim_buf_hdr = GetBufferDescriptor(victim_buf - 1);
+		BufferTag	tag;
+		uint32		hash;
+		LWLock	   *partition_lock;
+		int			existing_id;
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+		hash = BufTableHashCode(&tag);
+		partition_lock = BufMappingPartitionLock(hash);
+
+		LWLockAcquire(partition_lock, LW_EXCLUSIVE);
+
+		existing_id = BufTableInsert(&tag, hash, victim_buf_hdr->buf_id);
+
+		/*
+		 * We get here only in the corner case where we are trying to extend
+		 * the relation but we found a pre-existing buffer. This can happen
+		 * because a prior attempt at extending the relation failed, and
+		 * because mdread doesn't complain about reads beyond EOF (when
+		 * zero_damaged_pages is ON) and so a previous attempt to read a block
+		 * beyond EOF could have left a "valid" zero-filled buffer.
+		 * Unfortunately, we have also seen this case occurring because of
+		 * buggy Linux kernels that sometimes return an lseek(SEEK_END) result
+		 * that doesn't account for a recent write. In that situation, the
+		 * pre-existing buffer would contain valid data that we don't want to
+		 * overwrite.  Since the legitimate cases should always have left a
+		 * zero-filled buffer, complain if not PageIsNew.
+		 */
+		if (existing_id >= 0)
+		{
+			BufferDesc *existing_hdr = GetBufferDescriptor(existing_id);
+			Block		buf_block;
+			bool		valid;
+
+			/*
+			 * Pin the existing buffer before releasing the partition lock,
+			 * preventing it from being evicted.
+			 */
+			valid = PinBuffer(existing_hdr, strategy);
+
+			LWLockRelease(partition_lock);
+
+			/*
+			 * The victim buffer we acquired peviously is clean and unused,
+			 * let it be found again quickly
+			 */
+			StrategyFreeBuffer(victim_buf_hdr);
+			UnpinBuffer(victim_buf_hdr);
+
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+			buf_block = BufHdrGetBlock(existing_hdr);
+
+			if (valid && !PageIsNew((Page) buf_block))
+				ereport(ERROR,
+						(errmsg("unexpected data beyond EOF in block %u of relation %s",
+								existing_hdr->tag.blockNum, relpath(eb.smgr->smgr_rlocator, fork)),
+						 errhint("This has been seen to occur with buggy kernels; consider updating your system.")));
+
+			/*
+			 * We *must* do smgr[zero]extend before succeeding, else the page
+			 * will not be reserved by the kernel, and the next P_NEW call
+			 * will decide to return the same page.  Clear the BM_VALID bit,
+			 * do StartBufferIO() and proceed.
+			 *
+			 * Loop to handle the very small possibility that someone re-sets
+			 * BM_VALID between our clearing it and StartBufferIO inspecting
+			 * it.
+			 */
+			do
+			{
+				uint32		buf_state = LockBufHdr(existing_hdr);
+
+				buf_state &= ~BM_VALID;
+				UnlockBufHdr(existing_hdr, buf_state);
+			} while (!StartBufferIO(existing_hdr, true));
+		}
+		else
+		{
+			uint32		buf_state;
+
+			buf_state = LockBufHdr(victim_buf_hdr);
+
+			/* some sanity checks while we hold the buffer header lock */
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+			Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 1);
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+			if (eb.relpersistence == RELPERSISTENCE_PERMANENT || fork == INIT_FORKNUM)
+				buf_state |= BM_PERMANENT;
+
+			UnlockBufHdr(victim_buf_hdr, buf_state);
+
+			LWLockRelease(partition_lock);
+
+			/* XXX: could combine the locked operations in it with the above */
+			StartBufferIO(victim_buf_hdr, true);
+		}
+	}
+
+	/*
+	 * Note: if smgzerorextend fails, we will end up with buffers that are
+	 * allocated but not marked BM_VALID.  The next relation extension will
+	 * still select the same block number (because the relation didn't get any
+	 * longer on disk) and so future attempts to extend the relation will find
+	 * the same buffers (if they have not been recycled) but come right back
+	 * here to try smgrzeroextend again.
+	 *
+	 * We don't need to set checksum for all-zero pages.
+	 */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 *
+	 * We remove IO_IN_PROGRESS after this, as waking up waiting backends can
+	 * take noticeable time.
+	 */
+	if (!(flags & EB_SKIP_EXTENSION_LOCK))
+		UnlockRelationForExtension(eb.rel, ExclusiveLock);
+
+	/* Set BM_VALID, terminate IO, and wake up any waiters */
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr = GetBufferDescriptor(buf - 1);
+		bool		lock = false;
+
+		if (flags & EB_LOCK_FIRST && i == 0)
+			lock = true;
+		else if (flags & EB_LOCK_TARGET)
+		{
+			Assert(extend_upto != InvalidBlockNumber);
+			if (first_block + i + 1 == extend_upto)
+				lock = true;
+		}
+
+		if (lock)
+			LWLockAcquire(BufferDescriptorGetContentLock(buf_hdr), LW_EXCLUSIVE);
+
+		TerminateBufferIO(buf_hdr, false, BM_VALID);
+	}
+
+	pgBufferUsage.shared_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_RELATION, io_context, IOOP_EXTEND,
+						 extend_by);
+
+	*extended_by = extend_by;
+
+	return first_block;
+}
+
 /*
  * MarkBufferDirty
  *
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index c9ba5ee00ff..818286675ee 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -49,6 +49,9 @@ static int	nextFreeLocalBufId = 0;
 
 static HTAB *LocalBufHash = NULL;
 
+/* number of local buffers pinned at least once */
+static int	NLocalPinnedBuffers = 0;
+
 
 static void InitLocalBuffers(void);
 static Block GetLocalBufferStorage(void);
@@ -273,6 +276,154 @@ GetLocalVictimBuffer(void)
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+/* see LimitAdditionalPins() */
+static void
+LimitAdditionalLocalPins(uint32 *additional_pins)
+{
+	uint32		max_pins;
+
+	if (*additional_pins <= 1)
+		return;
+
+	/*
+	 * In contrast to LimitAdditionalPins() other backends don't play a role
+	 * here. We can allow up to NLocBuffer pins in total.
+	 */
+	max_pins = (NLocBuffer - NLocalPinnedBuffers);
+
+	if (*additional_pins >= max_pins)
+		*additional_pins = max_pins;
+}
+
+/*
+ * Implementation of ExtendBufferedRelBy() and ExtendBufferedRelTo() for
+ * temporary buffers.
+ */
+BlockNumber
+ExtendBufferedRelLocal(ExtendBufferedWhat eb,
+					   ForkNumber fork,
+					   uint32 flags,
+					   uint32 extend_by,
+					   BlockNumber extend_upto,
+					   Buffer *buffers,
+					   uint32 *extended_by)
+{
+	BlockNumber first_block;
+
+	/* Initialize local buffers if first request in this session */
+	if (LocalBufHash == NULL)
+		InitLocalBuffers();
+
+	LimitAdditionalLocalPins(&extend_by);
+
+	for (uint32 i = 0; i < extend_by; i++)
+	{
+		BufferDesc *buf_hdr;
+		Block		buf_block;
+
+		buffers[i] = GetLocalVictimBuffer();
+		buf_hdr = GetLocalBufferDescriptor(-buffers[i] - 1);
+		buf_block = LocalBufHdrGetBlock(buf_hdr);
+
+		/* new buffers are zero-filled */
+		MemSet((char *) buf_block, 0, BLCKSZ);
+	}
+
+	first_block = smgrnblocks(eb.smgr, fork);
+
+	if (extend_upto != InvalidBlockNumber)
+	{
+		/*
+		 * In contranst to shared relations, nothing could change the relation
+		 * size concurrently. Thus we shouldn't end up finding that we don't
+		 * need to do anything.
+		 */
+		Assert(first_block <= extend_upto);
+
+		Assert((uint64) first_block + extend_by <= extend_upto);
+	}
+
+	/* Fail if relation is already at maximum possible length */
+	if ((uint64) first_block + extend_by >= MaxBlockNumber)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot extend relation %s beyond %u blocks",
+						relpath(eb.smgr->smgr_rlocator, fork),
+						MaxBlockNumber)));
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		int			victim_buf_id;
+		BufferDesc *victim_buf_hdr;
+		BufferTag	tag;
+		LocalBufferLookupEnt *hresult;
+		bool		found;
+
+		victim_buf_id = -buffers[i] - 1;
+		victim_buf_hdr = GetLocalBufferDescriptor(victim_buf_id);
+
+		InitBufferTag(&tag, &eb.smgr->smgr_rlocator.locator, fork, first_block + i);
+
+		hresult = (LocalBufferLookupEnt *)
+			hash_search(LocalBufHash, (void *) &tag, HASH_ENTER, &found);
+		if (found)
+		{
+			BufferDesc *existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			uint32		buf_state;
+
+			UnpinLocalBuffer(BufferDescriptorGetBuffer(victim_buf_hdr));
+
+			existing_hdr = GetLocalBufferDescriptor(hresult->id);
+			PinLocalBuffer(existing_hdr, false);
+			buffers[i] = BufferDescriptorGetBuffer(existing_hdr);
+
+			buf_state = pg_atomic_read_u32(&existing_hdr->state);
+			Assert(buf_state & BM_TAG_VALID);
+			Assert(!(buf_state & BM_DIRTY));
+			buf_state &= BM_VALID;
+			pg_atomic_unlocked_write_u32(&existing_hdr->state, buf_state);
+		}
+		else
+		{
+			uint32		buf_state = pg_atomic_read_u32(&victim_buf_hdr->state);
+
+			Assert(!(buf_state & (BM_VALID | BM_TAG_VALID | BM_DIRTY | BM_JUST_DIRTIED)));
+
+			victim_buf_hdr->tag = tag;
+
+			buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
+
+			pg_atomic_unlocked_write_u32(&victim_buf_hdr->state, buf_state);
+
+			hresult->id = victim_buf_id;
+		}
+	}
+
+	/* actually extend relation */
+	smgrzeroextend(eb.smgr, fork, first_block, extend_by, false);
+
+	for (int i = 0; i < extend_by; i++)
+	{
+		Buffer		buf = buffers[i];
+		BufferDesc *buf_hdr;
+		uint32		buf_state;
+
+		buf_hdr = GetLocalBufferDescriptor(-buf - 1);
+
+		buf_state = pg_atomic_read_u32(&buf_hdr->state);
+		buf_state |= BM_VALID;
+		pg_atomic_unlocked_write_u32(&buf_hdr->state, buf_state);
+	}
+
+	*extended_by = extend_by;
+
+	pgBufferUsage.temp_blks_written += extend_by;
+	pgstat_count_io_op_n(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_EXTEND,
+						 extend_by);
+
+	return first_block;
+}
+
 /*
  * MarkLocalBufferDirty -
  *	  mark a local buffer dirty
@@ -492,6 +643,7 @@ PinLocalBuffer(BufferDesc *buf_hdr, bool adjust_usagecount)
 
 	if (LocalRefCount[bufid] == 0)
 	{
+		NLocalPinnedBuffers++;
 		if (adjust_usagecount &&
 			BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
 		{
@@ -513,9 +665,11 @@ UnpinLocalBuffer(Buffer buffer)
 
 	Assert(BufferIsLocal(buffer));
 	Assert(LocalRefCount[buffid] > 0);
+	Assert(NLocalPinnedBuffers > 0);
 
 	ResourceOwnerForgetBuffer(CurrentResourceOwner, buffer);
-	LocalRefCount[buffid]--;
+	if (--LocalRefCount[buffid] == 0)
+		NLocalPinnedBuffers--;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index ae8bb34f78b..0e4f26427ca 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -64,13 +64,19 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
+{
+	pgstat_count_io_op_n(io_object, io_context, io_op, 1);
+}
+
+void
+pgstat_count_io_op_n(IOObject io_object, IOContext io_context, IOOp io_op, uint32 cnt)
 {
 	Assert((unsigned int) io_object < IOOBJECT_NUM_TYPES);
 	Assert((unsigned int) io_context < IOCONTEXT_NUM_TYPES);
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.data[io_object][io_context][io_op] += cnt;
 
 	have_iostats = true;
 }
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 204a2649b09..0af275587b7 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -55,10 +55,12 @@ provider postgresql {
 	probe sort__start(int, bool, int, int, bool, int);
 	probe sort__done(bool, long);
 
-	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
-	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool);
+	probe buffer__read__start(ForkNumber, BlockNumber, Oid, Oid, Oid, int);
+	probe buffer__read__done(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool);
 	probe buffer__flush__start(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
+	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index fd0ffbb1e08..bce9ae46615 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -7776,33 +7776,52 @@ FROM pg_stat_get_backend_idset() AS backendid;
      <entry>Probe that fires when the two-phase portion of a checkpoint is
       complete.</entry>
     </row>
+    <row>
+     <entry><literal>buffer-extend-start</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int)</literal></entry>
+     <entry>Probe that fires when a relation extension starts.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the caller would like to extend
+       by.</entry>
+    </row>
+    <row>
+     <entry><literal>buffer-extend-done</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber)</literal></entry>
+     <entry>Probe that fires when a relation extension is complete.
+       arg0 contains the fork to be extended. arg1, arg2, and arg3 contain the
+       tablespace, database, and relation OIDs identifying the relation.  arg4
+       is the ID of the backend which created the temporary relation for a
+       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared
+       buffer. arg5 is the number of blocks the relation was extended by, this
+       can be less than the number in the
+       <literal>buffer-extend-start</literal> due to resource
+       constraints. arg6 contains the BlockNumber of the first new
+       block.</entry>
+    </row>
     <row>
      <entry><literal>buffer-read-start</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int)</literal></entry>
      <entry>Probe that fires when a buffer read is started.
-      arg0 and arg1 contain the fork and block numbers of the page (but
-      arg1 will be -1 if this is a relation extension request).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.</entry>
+      </entry>
     </row>
     <row>
      <entry><literal>buffer-read-done</literal></entry>
-     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool, bool)</literal></entry>
+     <entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid, int, bool)</literal></entry>
      <entry>Probe that fires when a buffer read is complete.
-      arg0 and arg1 contain the fork and block numbers of the page (if this
-      is a relation extension request, arg1 now contains the block number
-      of the newly added block).
+      arg0 and arg1 contain the fork and block numbers of the page.
       arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
       identifying the relation.
       arg5 is the ID of the backend which created the temporary relation for a
       local buffer, or <symbol>InvalidBackendId</symbol> (-1) for a shared buffer.
-      arg6 is true for a relation extension request, false for normal
-      read.
-      arg7 is true if the buffer was found in the pool, false if not.</entry>
+      arg6 is true if the buffer was found in the pool, false if not.</entry>
     </row>
     <row>
      <entry><literal>buffer-flush-start</literal></entry>
-- 
2.38.0

v7-0008-Convert-many-uses-of-ReadBuffer-Extended-P_NEW-wi.patchtext/x-diff; charset=us-asciiDownload

From 11121f8ce57c9225c8bd3d4cf14cb89bead04af1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 24 Oct 2022 12:18:18 -0700
Subject: [PATCH v7 08/14] Convert many uses of ReadBuffer[Extended](P_NEW)
 with ExtendBufferedRel()

A few places are not converted. Some because they are tackled in later
commits (e.g. hio.c, xlogutils.c), some because they are more
complicated (e.g. brin_pageops.c).  Having a few users of ReadBuffer(P_NEW) is
good anyway, to ensure the backward compat path stays working.

Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/backend/access/brin/brin.c         |  9 +++----
 src/backend/access/brin/brin_pageops.c |  4 +++
 src/backend/access/brin/brin_revmap.c  | 15 +++---------
 src/backend/access/gin/gininsert.c     | 10 +++-----
 src/backend/access/gin/ginutil.c       | 13 ++--------
 src/backend/access/gin/ginvacuum.c     |  8 ++++++
 src/backend/access/gist/gist.c         |  4 +--
 src/backend/access/gist/gistutil.c     | 14 ++---------
 src/backend/access/gist/gistvacuum.c   |  3 +++
 src/backend/access/hash/hashpage.c     |  6 ++---
 src/backend/access/nbtree/nbtpage.c    | 34 +++++++-------------------
 src/backend/access/nbtree/nbtree.c     |  3 +++
 src/backend/access/spgist/spgutils.c   | 13 ++--------
 src/backend/commands/sequence.c        |  3 ++-
 contrib/bloom/blutils.c                | 12 ++-------
 15 files changed, 53 insertions(+), 98 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 53e4721a54e..41bf950a4af 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -837,9 +837,9 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	 * whole relation will be rolled back.
 	 */
 
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							 EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
 
 	brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
@@ -904,9 +904,8 @@ brinbuildempty(Relation index)
 	Buffer		metabuf;
 
 	/* An empty BRIN index has a metapage only. */
-	metabuf =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
+	metabuf = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index ad5a89bd051..b578d259545 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -730,6 +730,10 @@ brin_getinsertbuffer(Relation irel, Buffer oldbuf, Size itemsz,
 			 * There's not enough free space in any existing index page,
 			 * according to the FSM: extend the relation to obtain a shiny new
 			 * page.
+			 *
+			 * XXX: It's likely possible to use RBM_ZERO_AND_LOCK here,
+			 * which'd avoid the need to hold the extension lock during buffer
+			 * reclaim.
 			 */
 			if (!RELATION_IS_LOCAL(irel))
 			{
diff --git a/src/backend/access/brin/brin_revmap.c b/src/backend/access/brin/brin_revmap.c
index 7fc5226bf74..f4271ba31c9 100644
--- a/src/backend/access/brin/brin_revmap.c
+++ b/src/backend/access/brin/brin_revmap.c
@@ -538,7 +538,6 @@ revmap_physical_extend(BrinRevmap *revmap)
 	BlockNumber mapBlk;
 	BlockNumber nblocks;
 	Relation	irel = revmap->rm_irel;
-	bool		needLock = !RELATION_IS_LOCAL(irel);
 
 	/*
 	 * Lock the metapage. This locks out concurrent extensions of the revmap,
@@ -570,10 +569,8 @@ revmap_physical_extend(BrinRevmap *revmap)
 	}
 	else
 	{
-		if (needLock)
-			LockRelationForExtension(irel, ExclusiveLock);
-
-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendBufferedRel(EB_REL(irel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
 		if (BufferGetBlockNumber(buf) != mapBlk)
 		{
 			/*
@@ -582,17 +579,11 @@ revmap_physical_extend(BrinRevmap *revmap)
 			 * up and have caller start over.  We will have to evacuate that
 			 * page from under whoever is using it.
 			 */
-			if (needLock)
-				UnlockRelationForExtension(irel, ExclusiveLock);
 			LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buf);
+			UnlockReleaseBuffer(buf);
 			return;
 		}
-		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
 		page = BufferGetPage(buf);
-
-		if (needLock)
-			UnlockRelationForExtension(irel, ExclusiveLock);
 	}
 
 	/* Check that it's a regular block (or an empty page) */
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index d5d748009ea..be1841de7bf 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -440,12 +440,10 @@ ginbuildempty(Relation index)
 				MetaBuffer;
 
 	/* An empty GIN index has two pages. */
-	MetaBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(MetaBuffer, BUFFER_LOCK_EXCLUSIVE);
-	RootBuffer =
-		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
+	MetaBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
+	RootBuffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+								   EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 
 	/* Initialize and xlog metabuffer and root buffer. */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 03fec1704e9..3c6f87236c5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -299,7 +299,6 @@ Buffer
 GinNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -327,16 +326,8 @@ GinNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, GIN_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index e5d310d8362..13251d7e07d 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -736,6 +736,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 */
 	needLock = !RELATION_IS_LOCAL(index);
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	npages = RelationGetNumberOfBlocks(index);
@@ -786,6 +790,10 @@ ginvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 
 	stats->pages_free = totFreePages;
 
+	/*
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this is still required?
+	 */
 	if (needLock)
 		LockRelationForExtension(index, ExclusiveLock);
 	stats->num_pages = RelationGetNumberOfBlocks(index);
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index c3a3d49bca0..b5c1754e788 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -134,8 +134,8 @@ gistbuildempty(Relation index)
 	Buffer		buffer;
 
 	/* Initialize the root page */
-	buffer = ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+	buffer = ExtendBufferedRel(EB_REL(index), INIT_FORKNUM, NULL,
+							   EB_SKIP_EXTENSION_LOCK | EB_LOCK_FIRST);
 
 	/* Initialize and xlog buffer */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/gist/gistutil.c b/src/backend/access/gist/gistutil.c
index a607464b979..70861884113 100644
--- a/src/backend/access/gist/gistutil.c
+++ b/src/backend/access/gist/gistutil.c
@@ -824,7 +824,6 @@ Buffer
 gistNewBuffer(Relation r, Relation heaprel)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -877,17 +876,8 @@ gistNewBuffer(Relation r, Relation heaprel)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(r);
-
-	if (needLock)
-		LockRelationForExtension(r, ExclusiveLock);
-
-	buffer = ReadBuffer(r, P_NEW);
-	LockBuffer(buffer, GIST_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(r, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(r), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index 3f60d3274d2..cc711b04986 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -203,6 +203,9 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * we must already have processed any tuples due to be moved into such a
 	 * page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 2d8fdec98ed..6d8af422609 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -206,14 +206,14 @@ _hash_getnewbuf(Relation rel, BlockNumber blkno, ForkNumber forkNum)
 		elog(ERROR, "access to noncontiguous page in hash index \"%s\"",
 			 RelationGetRelationName(rel));
 
-	/* smgr insists we use P_NEW to extend the relation */
+	/* smgr insists we explicitly extend the relation */
 	if (blkno == nblocks)
 	{
-		buf = ReadBufferExtended(rel, forkNum, P_NEW, RBM_NORMAL, NULL);
+		buf = ExtendBufferedRel(EB_REL(rel), forkNum, NULL,
+								EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 		if (BufferGetBlockNumber(buf) != blkno)
 			elog(ERROR, "unexpected hash relation size: %u, should be %u",
 				 BufferGetBlockNumber(buf), blkno);
-		LockBuffer(buf, HASH_WRITE);
 	}
 	else
 	{
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 0144c3ab571..41aa1c4ccd1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -882,7 +882,6 @@ _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 	}
 	else
 	{
-		bool		needLock;
 		Page		page;
 
 		Assert(access == BT_WRITE);
@@ -963,31 +962,16 @@ _bt_getbuf(Relation rel, Relation heaprel, BlockNumber blkno, int access)
 		}
 
 		/*
-		 * Extend the relation by one page.
-		 *
-		 * We have to use a lock to ensure no one else is extending the rel at
-		 * the same time, else we will both try to initialize the same new
-		 * page.  We can skip locking for new or temp relations, however,
-		 * since no one else could be accessing them.
+		 * Extend the relation by one page. Need to use RBM_ZERO_AND_LOCK or
+		 * we risk a race condition against btvacuumscan --- see comments
+		 * therein. This forces us to repeat the valgrind request that
+		 * _bt_lockbuf() otherwise would make, as we can't use _bt_lockbuf()
+		 * without introducing a race.
 		 */
-		needLock = !RELATION_IS_LOCAL(rel);
-
-		if (needLock)
-			LockRelationForExtension(rel, ExclusiveLock);
-
-		buf = ReadBuffer(rel, P_NEW);
-
-		/* Acquire buffer lock on new page */
-		_bt_lockbuf(rel, buf, BT_WRITE);
-
-		/*
-		 * Release the file-extension lock; it's now OK for someone else to
-		 * extend the relation some more.  Note that we cannot release this
-		 * lock before we have buffer lock on the new page, or we risk a race
-		 * condition against btvacuumscan --- see comments therein.
-		 */
-		if (needLock)
-			UnlockRelationForExtension(rel, ExclusiveLock);
+		buf = ExtendBufferedRel(EB_REL(rel), MAIN_FORKNUM, NULL,
+								EB_LOCK_FIRST);
+		if (!RelationUsesLocalBuffers(rel))
+			VALGRIND_MAKE_MEM_DEFINED(BufferGetPage(buf), BLCKSZ);
 
 		/* Initialize the new page before returning it */
 		page = BufferGetPage(buf);
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 409a2c12100..75b4a2cc364 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -970,6 +970,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * write-lock on the left page before it adds a right page, so we must
 	 * already have processed any tuples due to be moved into such a page.
 	 *
+	 * FIXME: Now that new pages are locked with RBM_ZERO_AND_LOCK, I don't
+	 * think this issue still exists?
+	 *
 	 * We can skip locking for new or temp relations, however, since no one
 	 * else could be accessing them.
 	 */
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4e7ff1d1603..190e4f76a9e 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -366,7 +366,6 @@ Buffer
 SpGistNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -406,16 +405,8 @@ SpGistNewBuffer(Relation index)
 		ReleaseBuffer(buffer);
 	}
 
-	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
diff --git a/src/backend/commands/sequence.c b/src/backend/commands/sequence.c
index f3d1779655b..ef014496782 100644
--- a/src/backend/commands/sequence.c
+++ b/src/backend/commands/sequence.c
@@ -377,7 +377,8 @@ fill_seq_fork_with_data(Relation rel, HeapTuple tuple, ForkNumber forkNum)
 
 	/* Initialize first page of relation with special magic number */
 
-	buf = ReadBufferExtended(rel, forkNum, P_NEW, RBM_ZERO_AND_LOCK, NULL);
+	buf = ExtendBufferedRel(EB_REL(rel), forkNum, NULL,
+							EB_LOCK_FIRST | EB_SKIP_EXTENSION_LOCK);
 	Assert(BufferGetBlockNumber(buf) == 0);
 
 	page = BufferGetPage(buf);
diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index a6d9f09f315..d935ed8fbdf 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -353,7 +353,6 @@ Buffer
 BloomNewBuffer(Relation index)
 {
 	Buffer		buffer;
-	bool		needLock;
 
 	/* First, try to get a page from FSM */
 	for (;;)
@@ -387,15 +386,8 @@ BloomNewBuffer(Relation index)
 	}
 
 	/* Must extend the file */
-	needLock = !RELATION_IS_LOCAL(index);
-	if (needLock)
-		LockRelationForExtension(index, ExclusiveLock);
-
-	buffer = ReadBuffer(index, P_NEW);
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	if (needLock)
-		UnlockRelationForExtension(index, ExclusiveLock);
+	buffer = ExtendBufferedRel(EB_REL(index), MAIN_FORKNUM, NULL,
+							   EB_LOCK_FIRST);
 
 	return buffer;
 }
-- 
2.38.0

v7-0009-heapam-Pass-number-of-required-pages-to-RelationG.patchtext/x-diff; charset=us-asciiDownload

From cb9dd0f14adaaea3153214290d825d5d9d360396 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sun, 23 Oct 2022 14:44:43 -0700
Subject: [PATCH v7 09/14] heapam: Pass number of required pages to
 RelationGetBufferForTuple()

A future commit will use this information to determine how aggressively to
extend the relation by.

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/include/access/hio.h         | 14 +++++++-
 src/backend/access/heap/heapam.c | 56 +++++++++++++++++++++++++++++---
 src/backend/access/heap/hio.c    |  8 ++++-
 3 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/src/include/access/hio.h b/src/include/access/hio.h
index 3f20b585326..a3c49b3025d 100644
--- a/src/include/access/hio.h
+++ b/src/include/access/hio.h
@@ -30,6 +30,17 @@ typedef struct BulkInsertStateData
 {
 	BufferAccessStrategy strategy;	/* our BULKWRITE strategy object */
 	Buffer		current_buf;	/* current insertion target page */
+
+	/*
+	 * State for bulk extensions. Further pages that were unused at the time
+	 * of the extension. They might be in use by the time we use them though,
+	 * so rechecks are needed.
+	 *
+	 * XXX: It's possible these should live in RelationData instead, alongside
+	 * targetblock.
+	 */
+	BlockNumber next_free;
+	BlockNumber last_free;
 } BulkInsertStateData;
 
 
@@ -38,6 +49,7 @@ extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
 extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
 										Buffer otherBuffer, int options,
 										BulkInsertStateData *bistate,
-										Buffer *vmbuffer, Buffer *vmbuffer_other);
+										Buffer *vmbuffer, Buffer *vmbuffer_other,
+										int num_pages);
 
 #endif							/* HIO_H */
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f7d9ce59a47..516704a9791 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1774,6 +1774,8 @@ GetBulkInsertState(void)
 	bistate = (BulkInsertState) palloc(sizeof(BulkInsertStateData));
 	bistate->strategy = GetAccessStrategy(BAS_BULKWRITE);
 	bistate->current_buf = InvalidBuffer;
+	bistate->next_free = InvalidBlockNumber;
+	bistate->last_free = InvalidBlockNumber;
 	return bistate;
 }
 
@@ -1847,7 +1849,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
-									   &vmbuffer, NULL);
+									   &vmbuffer, NULL,
+									   0);
 
 	/*
 	 * We're about to do the actual insert -- but check for conflict first, to
@@ -2050,6 +2053,30 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
 		return tup;
 }
 
+/*
+ * Helper for heap_multi_insert() that computes the number of full pages
+ */
+static int
+heap_multi_insert_pages(HeapTuple *heaptuples, int done, int ntuples, Size saveFreeSpace)
+{
+	size_t		page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+	int			npages = 1;
+
+	for (int i = done; i < ntuples; i++)
+	{
+		size_t		tup_sz = sizeof(ItemIdData) + MAXALIGN(heaptuples[i]->t_len);
+
+		if (page_avail < tup_sz)
+		{
+			npages++;
+			page_avail = BLCKSZ - SizeOfPageHeaderData - saveFreeSpace;
+		}
+		page_avail -= tup_sz;
+	}
+
+	return npages;
+}
+
 /*
  *	heap_multi_insert	- insert multiple tuples into a heap
  *
@@ -2076,6 +2103,9 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	bool		starting_with_empty_page = false;
+	int			npages = 0;
+	int			npages_used = 0;
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	Assert(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2126,13 +2156,29 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	while (ndone < ntuples)
 	{
 		Buffer		buffer;
-		bool		starting_with_empty_page;
 		bool		all_visible_cleared = false;
 		bool		all_frozen_set = false;
 		int			nthispage;
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Compute number of pages needed to insert tuples in the worst case.
+		 * This will be used to determine how much to extend the relation by
+		 * in RelationGetBufferForTuple(), if needed.  If we filled a prior
+		 * page from scratch, we can just update our last computation, but if
+		 * we started with a partially filled page recompute from scratch, the
+		 * number of potentially required pages can vary due to tuples needing
+		 * to fit onto the page, page headers etc.
+		 */
+		if (ndone == 0 || !starting_with_empty_page)
+		{
+			npages = heap_multi_insert_pages(heaptuples, ndone, ntuples, saveFreeSpace);
+			npages_used = 0;
+		}
+		else
+			npages_used++;
+
 		/*
 		 * Find buffer where at least the next tuple will fit.  If the page is
 		 * all-visible, this will also pin the requisite visibility map page.
@@ -2142,7 +2188,8 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 		 */
 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
 										   InvalidBuffer, options, bistate,
-										   &vmbuffer, NULL);
+										   &vmbuffer, NULL,
+										   npages - npages_used);
 		page = BufferGetPage(buffer);
 
 		starting_with_empty_page = PageGetMaxOffsetNumber(page) == 0;
@@ -3576,7 +3623,8 @@ l2:
 				/* It doesn't fit, must use RelationGetBufferForTuple. */
 				newbuf = RelationGetBufferForTuple(relation, heaptup->t_len,
 												   buffer, 0, NULL,
-												   &vmbuffer_new, &vmbuffer);
+												   &vmbuffer_new, &vmbuffer,
+												   0);
 				/* We're all done. */
 				break;
 			}
diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 7479212d4e0..65886839e70 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -275,6 +275,11 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
  *	Returns pinned and exclusive-locked buffer of a page in given relation
  *	with free space >= given len.
  *
+ *	If num_pages is > 1, the relation will be extended by at least that many
+ *	pages when we decide to extend the relation. This is more efficient for
+ *	callers that know they will need multiple pages
+ *	(e.g. heap_multi_insert()).
+ *
  *	If otherBuffer is not InvalidBuffer, then it references a previously
  *	pinned buffer of another page in the same relation; on return, this
  *	buffer will also be exclusive-locked.  (This case is used by heap_update;
@@ -333,7 +338,8 @@ Buffer
 RelationGetBufferForTuple(Relation relation, Size len,
 						  Buffer otherBuffer, int options,
 						  BulkInsertState bistate,
-						  Buffer *vmbuffer, Buffer *vmbuffer_other)
+						  Buffer *vmbuffer, Buffer *vmbuffer_other,
+						  int num_pages)
 {
 	bool		use_fsm = !(options & HEAP_INSERT_SKIP_FSM);
 	Buffer		buffer = InvalidBuffer;
-- 
2.38.0

v7-0010-hio-Relax-rules-for-calling-GetVisibilityMapPins.patchtext/x-diff; charset=us-asciiDownload

From 80f005a7be1ec9b71c339b480ddaa71434ba4478 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 3 Apr 2023 11:18:10 -0700
Subject: [PATCH v7 10/14] hio: Relax rules for calling GetVisibilityMapPins()

Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230403190030.fk2frxv6faklrseb@awork3.anarazel.de
Backpatch:
---
 src/backend/access/heap/hio.c | 37 ++++++++++++++++++++++++-----------
 1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 65886839e70..e77437fcac1 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -131,9 +131,9 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * For each heap page which is all-visible, acquire a pin on the appropriate
  * visibility map page, if we haven't already got one.
  *
- * buffer2 may be InvalidBuffer, if only one buffer is involved.  buffer1
- * must not be InvalidBuffer.  If both buffers are specified, block1 must
- * be less than block2.
+ * To avoid complexity in the callers, either buffer1 or buffer2 may be
+ * InvalidBuffer if only one buffer is involved. For the same reason, block2
+ * may be smaller than block1.
  */
 static void
 GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
@@ -143,6 +143,26 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 	bool		need_to_pin_buffer1;
 	bool		need_to_pin_buffer2;
 
+	/*
+	 * Swap buffers around to handle case of a single block/buffer, and to
+	 * handle if lock ordering rules require to lock block2 first.
+	 */
+	if (!BufferIsValid(buffer1) ||
+		(BufferIsValid(buffer2) && block1 > block2))
+	{
+		Buffer tmpbuf = buffer1;
+		Buffer *tmpvmbuf = vmbuffer1;
+		BlockNumber tmpblock = block1;
+
+		buffer1 = buffer2;
+		vmbuffer1 = vmbuffer2;
+		block1 = block2;
+
+		buffer2 = tmpbuf;
+		vmbuffer2 = tmpvmbuf;
+		block2 = tmpblock;
+	}
+
 	Assert(BufferIsValid(buffer1));
 	Assert(buffer2 == InvalidBuffer || block1 <= block2);
 
@@ -508,14 +528,9 @@ loop:
 		 * done a bit of extra work for no gain, but there's no real harm
 		 * done.
 		 */
-		if (otherBuffer == InvalidBuffer || targetBlock <= otherBlock)
-			GetVisibilityMapPins(relation, buffer, otherBuffer,
-								 targetBlock, otherBlock, vmbuffer,
-								 vmbuffer_other);
-		else
-			GetVisibilityMapPins(relation, otherBuffer, buffer,
-								 otherBlock, targetBlock, vmbuffer_other,
-								 vmbuffer);
+		GetVisibilityMapPins(relation, buffer, otherBuffer,
+							 targetBlock, otherBlock, vmbuffer,
+							 vmbuffer_other);
 
 		/*
 		 * Now we can check to see if there's enough free space here. If so,
-- 
2.38.0

v7-0011-hio-Don-t-pin-the-VM-while-holding-buffer-lock-wh.patchtext/x-diff; charset=us-asciiDownload

From 67c5922c496087c348ac90dd924ba78309892cea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 3 Apr 2023 11:26:16 -0700
Subject: [PATCH v7 11/14] hio: Don't pin the VM while holding buffer lock
 while extending

Discussion: http://postgr.es/m/20230325025740.wzvchp2kromw4zqz@awork3.anarazel.de
---
 src/backend/access/heap/hio.c | 126 +++++++++++++++++++++++-----------
 1 file changed, 85 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index e77437fcac1..561d7329058 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -134,14 +134,17 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
  * To avoid complexity in the callers, either buffer1 or buffer2 may be
  * InvalidBuffer if only one buffer is involved. For the same reason, block2
  * may be smaller than block1.
+ *
+ * Returns whether buffer locks were temporarily released.
  */
-static void
+static bool
 GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 					 BlockNumber block1, BlockNumber block2,
 					 Buffer *vmbuffer1, Buffer *vmbuffer2)
 {
 	bool		need_to_pin_buffer1;
 	bool		need_to_pin_buffer2;
+	bool		released_locks = false;
 
 	/*
 	 * Swap buffers around to handle case of a single block/buffer, and to
@@ -175,9 +178,10 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			&& PageIsAllVisible(BufferGetPage(buffer2))
 			&& !visibilitymap_pin_ok(block2, *vmbuffer2);
 		if (!need_to_pin_buffer1 && !need_to_pin_buffer2)
-			return;
+			break;
 
 		/* We must unlock both buffers before doing any I/O. */
+		released_locks = true;
 		LockBuffer(buffer1, BUFFER_LOCK_UNLOCK);
 		if (buffer2 != InvalidBuffer && buffer2 != buffer1)
 			LockBuffer(buffer2, BUFFER_LOCK_UNLOCK);
@@ -203,6 +207,8 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 			|| (need_to_pin_buffer1 && need_to_pin_buffer2))
 			break;
 	}
+
+	return released_locks;
 }
 
 /*
@@ -371,6 +377,8 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	BlockNumber targetBlock,
 				otherBlock;
 	bool		needLock;
+	bool		unlockedTargetBuffer;
+	bool		recheckVmPins;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -651,6 +659,9 @@ loop:
 	if (needLock)
 		UnlockRelationForExtension(relation, ExclusiveLock);
 
+	unlockedTargetBuffer = false;
+	targetBlock = BufferGetBlockNumber(buffer);
+
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
 	 * is empty (this should never happen, but if it does we don't want to
@@ -660,75 +671,108 @@ loop:
 
 	if (!PageIsNew(page))
 		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
+			 targetBlock,
 			 RelationGetRelationName(relation));
 
 	PageInit(page, BufferGetPageSize(buffer), 0);
 	MarkBufferDirty(buffer);
 
 	/*
-	 * The page is empty, pin vmbuffer to set all_frozen bit.
+	 * The page is empty, pin vmbuffer to set all_frozen bit. We don't want to
+	 * do IO while the buffer is locked, so we unlock the page first if IO is
+	 * needed (necessitating checks below).
 	 */
 	if (options & HEAP_INSERT_FROZEN)
 	{
-		Assert(PageGetMaxOffsetNumber(BufferGetPage(buffer)) == 0);
-		visibilitymap_pin(relation, BufferGetBlockNumber(buffer), vmbuffer);
+		Assert(PageGetMaxOffsetNumber(page) == 0);
+
+		if (!visibilitymap_pin_ok(targetBlock, *vmbuffer))
+		{
+			unlockedTargetBuffer = true;
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			visibilitymap_pin(relation, targetBlock, vmbuffer);
+		}
 	}
 
 	/*
-	 * Lock the other buffer. It's guaranteed to be of a lower page number
-	 * than the new page. To conform with the deadlock prevent rules, we ought
-	 * to lock otherBuffer first, but that would give other backends a chance
-	 * to put tuples on our page. To reduce the likelihood of that, attempt to
-	 * lock the other buffer conditionally, that's very likely to work.
-	 * Otherwise we need to lock buffers in the correct order, and retry if
-	 * the space has been used in the mean time.
+	 * Reacquire locks if necessary.
 	 *
-	 * Alternatively, we could acquire the lock on otherBuffer before
-	 * extending the relation, but that'd require holding the lock while
-	 * performing IO, which seems worse than an unlikely retry.
+	 * If the target buffer was unlocked above, or is unlocked while
+	 * reacquiring the lock on otherBuffer below, it's unlikely, but possible,
+	 * that another backend used space on this page. We check for that below,
+	 * and retry if necessary.
 	 */
-	if (otherBuffer != InvalidBuffer)
+	recheckVmPins = false;
+	if (unlockedTargetBuffer)
 	{
+		/* released lock on target buffer above */
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		recheckVmPins = true;
+	}
+	else if (otherBuffer != InvalidBuffer)
+	{
+		/*
+		 * We did not release the target buffer, and otherBuffer is valid,
+		 * need to lock the other buffer. It's guaranteed to be of a lower
+		 * page number than the new page.  To conform with the deadlock
+		 * prevent rules, we ought to lock otherBuffer first, but that would
+		 * give other backends a chance to put tuples on our page. To reduce
+		 * the likelihood of that, attempt to lock the other buffer
+		 * conditionally, that's very likely to work.
+		 *
+		 * Alternatively, we could acquire the lock on otherBuffer before
+		 * extending the relation, but that'd require holding the lock while
+		 * performing IO, which seems worse than an unlikely retry.
+		 */
 		Assert(otherBuffer != buffer);
-		targetBlock = BufferGetBlockNumber(buffer);
 		Assert(targetBlock > otherBlock);
 
 		if (unlikely(!ConditionalLockBuffer(otherBuffer)))
 		{
+			unlockedTargetBuffer = true;
 			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 		}
+		recheckVmPins = true;
+	}
 
-		/*
-		 * Because the buffers were unlocked for a while, it's possible,
-		 * although unlikely, that an all-visible flag became set or that
-		 * somebody used up the available space in the new page.  We can use
-		 * GetVisibilityMapPins to deal with the first case.  In the second
-		 * case, just retry from start.
-		 */
-		GetVisibilityMapPins(relation, otherBuffer, buffer,
-							 otherBlock, targetBlock, vmbuffer_other,
-							 vmbuffer);
+	/*
+	 * If one of the buffers was unlocked (always the case if otherBuffer is
+	 * valid), it's possible, although unlikely, that an all-visible flag
+	 * became set.  We can use GetVisibilityMapPins to deal with that. It's
+	 * possible that GetVisibilityMapPins() might need to temporarily release
+	 * buffer locks, in which case we'll need to check if there's still enough
+	 * space on the page below.
+	 */
+	if (recheckVmPins)
+	{
+		if (GetVisibilityMapPins(relation, otherBuffer, buffer,
+								 otherBlock, targetBlock, vmbuffer_other,
+								 vmbuffer))
+			unlockedTargetBuffer = true;
+	}
 
-		/*
-		 * Note that we have to check the available space even if our
-		 * conditional lock succeeded, because GetVisibilityMapPins might've
-		 * transiently released lock on the target buffer to acquire a VM pin
-		 * for the otherBuffer.
-		 */
-		if (len > PageGetHeapFreeSpace(page))
+	/*
+	 * If the target buffer was temporarily unlocked since the relation
+	 * extension, it's possible, although unlikely, that all the space on the
+	 * page was already used. If so, we just retry from the start.  If we
+	 * didn't unlock, something has gone wrong if there's not enough space -
+	 * the test at the top should have prevented reaching this case.
+	 */
+	pageFreeSpace = PageGetHeapFreeSpace(page);
+	if (len > pageFreeSpace)
+	{
+		if (unlockedTargetBuffer)
 		{
-			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+			if (otherBuffer != InvalidBuffer)
+				LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
 			UnlockReleaseBuffer(buffer);
 
 			goto loop;
 		}
-	}
-	else if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
 		elog(PANIC, "tuple is too big: size %zu", len);
 	}
 
@@ -741,7 +785,7 @@ loop:
 	 * current backend to make more insertions or not, which is probably a
 	 * good bet most of the time.  So for now, don't add it to FSM yet.
 	 */
-	RelationSetTargetBlock(relation, BufferGetBlockNumber(buffer));
+	RelationSetTargetBlock(relation, targetBlock);
 
 	return buffer;
 }
-- 
2.38.0

v7-0012-hio-Use-ExtendBufferedRelBy.patchtext/x-diff; charset=us-asciiDownload

From 6cfe721881267f735865d34304d9725b85fff35b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 28 Mar 2023 18:39:10 -0700
Subject: [PATCH v7 12/14] hio: Use ExtendBufferedRelBy()

Reviewed-by: Melanie Plageman <melanieplageman@gmail.com>
Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
---
 src/backend/access/heap/hio.c | 372 ++++++++++++++++++++--------------
 1 file changed, 218 insertions(+), 154 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 561d7329058..574c56cfb5f 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -212,87 +212,197 @@ GetVisibilityMapPins(Relation relation, Buffer buffer1, Buffer buffer2,
 }
 
 /*
- * Extend a relation by multiple blocks to avoid future contention on the
- * relation extension lock.  Our goal is to pre-extend the relation by an
- * amount which ramps up as the degree of contention ramps up, but limiting
- * the result to some sane overall value.
+ * Extend the relation. By multiple pages, if beneficial.
+ *
+ * If the caller needs multiple pages (num_pages > 1), we always try to extend
+ * by at least that much.
+ *
+ * If there is contention on the extension lock, we don't just extend "for
+ * ourselves", but we try to help others. We can do so by adding empty pages
+ * into the FSM. Typically there is no contention when we can't use the FSM.
+ *
+ * We do have to limit the number of pages to extend by to some value, as the
+ * buffers for all the extended pages need to, temporarily, be pinned. For now
+ * we define MAX_BUFFERS_TO_EXTEND_BY to be 64 buffers, it's hard to see
+ * benefits with higher numbers. This partially is because copyfrom.c's
+ * MAX_BUFFERED_TUPLES / MAX_BUFFERED_BYTES prevents larger multi_inserts.
  */
-static void
-RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
+static Buffer
+RelationAddBlocks(Relation relation, BulkInsertState bistate,
+				  int num_pages, bool use_fsm, bool *did_unlock)
 {
-	BlockNumber blockNum,
-				firstBlock = InvalidBlockNumber;
-	int			extraBlocks;
-	int			lockWaiters;
-
-	/* Use the length of the lock wait queue to judge how much to extend. */
-	lockWaiters = RelationExtensionLockWaiterCount(relation);
-	if (lockWaiters <= 0)
-		return;
+#define MAX_BUFFERS_TO_EXTEND_BY 64
+	Buffer		victim_buffers[MAX_BUFFERS_TO_EXTEND_BY];
+	BlockNumber first_block = InvalidBlockNumber;
+	BlockNumber last_block = InvalidBlockNumber;
+	uint32		extend_by_pages;
+	uint32		not_in_fsm_pages;
+	Buffer		buffer;
+	Page		page;
 
 	/*
-	 * It might seem like multiplying the number of lock waiters by as much as
-	 * 20 is too aggressive, but benchmarking revealed that smaller numbers
-	 * were insufficient.  512 is just an arbitrary cap to prevent
-	 * pathological results.
+	 * Determine by how many pages to try to extend by.
 	 */
-	extraBlocks = Min(512, lockWaiters * 20);
-
-	do
+	if (bistate == NULL && !use_fsm)
 	{
-		Buffer		buffer;
-		Page		page;
-		Size		freespace;
-
 		/*
-		 * Extend by one page.  This should generally match the main-line
-		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout, and we don't immediately
-		 * initialize the page (see below).
+		 * If we have neither bistate, nor can use the FSM, we can't bulk
+		 * extend - there'd be no way to find the additional pages.
 		 */
-		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-		page = BufferGetPage(buffer);
-
-		if (!PageIsNew(page))
-			elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-				 BufferGetBlockNumber(buffer),
-				 RelationGetRelationName(relation));
-
-		/*
-		 * Add the page to the FSM without initializing. If we were to
-		 * initialize here, the page would potentially get flushed out to disk
-		 * before we add any useful content. There's no guarantee that that'd
-		 * happen before a potential crash, so we need to deal with
-		 * uninitialized pages anyway, thus avoid the potential for
-		 * unnecessary writes.
-		 */
-
-		/* we'll need this info below */
-		blockNum = BufferGetBlockNumber(buffer);
-		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
-
-		UnlockReleaseBuffer(buffer);
-
-		/* Remember first block number thus added. */
-		if (firstBlock == InvalidBlockNumber)
-			firstBlock = blockNum;
-
-		/*
-		 * Immediately update the bottom level of the FSM.  This has a good
-		 * chance of making this page visible to other concurrently inserting
-		 * backends, and we want that to happen without delay.
-		 */
-		RecordPageWithFreeSpace(relation, blockNum, freespace);
+		extend_by_pages = 1;
+	}
+	else
+	{
+		uint32		waitcount;
+
+		/*
+		 * Try to extend at least by the number of pages the caller needs. We
+		 * can remember the additional pages (either via FSM or bistate).
+		 */
+		extend_by_pages = num_pages;
+
+		if (!RELATION_IS_LOCAL(relation))
+			waitcount = RelationExtensionLockWaiterCount(relation);
+		else
+			waitcount = 0;
+
+		/*
+		 * Multiply the number of pages to extend by the number of waiters. Do
+		 * this even if we're not using the FSM, as it still relieves
+		 * contention, by deferring the next time this backend needs to
+		 * extend. In that case the extended pages will be found via
+		 * bistate->next_free.
+		 */
+		extend_by_pages += extend_by_pages * waitcount;
+
+		/*
+		 * Can't extend by more than MAX_BUFFERS_TO_EXTEND_BY, we need to pin
+		 * them all concurrently.
+		 */
+		extend_by_pages = Min(extend_by_pages, MAX_BUFFERS_TO_EXTEND_BY);
 	}
-	while (--extraBlocks > 0);
 
 	/*
-	 * Updating the upper levels of the free space map is too expensive to do
-	 * for every block, but it's worth doing once at the end to make sure that
-	 * subsequent insertion activity sees all of those nifty free pages we
-	 * just inserted.
+	 * How many of the extended pages should be entered into the FSM?
+	 *
+	 * If we have a bistate, only enter pages that we don't need ourselves
+	 * into the FSM.  Otherwise every other backend will immediately try to
+	 * use the pages this backend needs for itself, causing unnecessary
+	 * contention.  If we don't have a bistate, we can't avoid the FSM.
+	 *
+	 * Never enter the page returned into the FSM, we'll immediately use it.
 	 */
-	FreeSpaceMapVacuumRange(relation, firstBlock, blockNum + 1);
+	if (num_pages > 1 && bistate == NULL)
+		not_in_fsm_pages = 1;
+	else
+		not_in_fsm_pages = num_pages;
+
+	/* prepare to put another buffer into the bistate */
+	if (bistate && bistate->current_buf != InvalidBuffer)
+	{
+		ReleaseBuffer(bistate->current_buf);
+		bistate->current_buf = InvalidBuffer;
+	}
+
+	/*
+	 * Extend the relation. We ask for the first returned page to be locked,
+	 * so that we are sure that nobody has inserted into the page
+	 * concurrently.
+	 *
+	 * With the current MAX_BUFFERS_TO_EXTEND_BY there's no danger of
+	 * [auto]vacuum trying to truncate later pages as REL_TRUNCATE_MINIMUM is
+	 * way larger.
+	 */
+	first_block = ExtendBufferedRelBy(EB_REL(relation), MAIN_FORKNUM,
+									  bistate ? bistate->strategy : NULL,
+									  EB_LOCK_FIRST,
+									  extend_by_pages,
+									  victim_buffers,
+									  &extend_by_pages);
+	buffer = victim_buffers[0]; /* the buffer the function will return */
+	last_block = first_block + (extend_by_pages - 1);
+	Assert(first_block == BufferGetBlockNumber(buffer));
+
+	/*
+	 * Relation is now extended. Initialize the page. We do this here, before
+	 * potentially releasing the lock on the page, because it allows us to
+	 * double check that the page contents are empty (this should never
+	 * happen, but if it does we don't want to risk wiping out valid data).
+	 */
+	page = BufferGetPage(buffer);
+	if (!PageIsNew(page))
+		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
+			 first_block,
+			 RelationGetRelationName(relation));
+
+	PageInit(page, BufferGetPageSize(buffer), 0);
+	MarkBufferDirty(buffer);
+
+	/*
+	 * If we decided to put pages into the FSM, release the buffer lock (but
+	 * not pin), we don't want to do IO while holding a buffer lock. This will
+	 * necessitate a bit more extensive checking in our caller.
+	 */
+	if (use_fsm && not_in_fsm_pages < extend_by_pages)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		*did_unlock = true;
+	}
+
+	/*
+	 * Relation is now extended. Release pins on all buffers, except for the
+	 * first (which we'll return).  If we decided to put pages into the FSM,
+	 * we can do that as part of the same loop.
+	 */
+	for (uint32 i = 1; i < extend_by_pages; i++)
+	{
+		BlockNumber curBlock = first_block + i;
+
+		Assert(curBlock == BufferGetBlockNumber(victim_buffers[i]));
+		Assert(BlockNumberIsValid(curBlock));
+
+		ReleaseBuffer(victim_buffers[i]);
+
+		if (use_fsm && i >= not_in_fsm_pages)
+		{
+			Size		freespace = BufferGetPageSize(victim_buffers[i]) -
+			SizeOfPageHeaderData;
+
+			RecordPageWithFreeSpace(relation, curBlock, freespace);
+		}
+	}
+
+	if (use_fsm && not_in_fsm_pages < extend_by_pages)
+	{
+		BlockNumber first_fsm_block = first_block + not_in_fsm_pages;
+
+		FreeSpaceMapVacuumRange(relation, first_fsm_block, last_block);
+	}
+
+	if (bistate)
+	{
+		/*
+		 * Remember the additionaly pages we extended by, so we later can use
+		 * them without looking into the FSM.
+		 */
+		if (extend_by_pages > 1)
+		{
+			bistate->next_free = first_block + 1;
+			bistate->last_free = last_block;
+		}
+		else
+		{
+			bistate->next_free = InvalidBlockNumber;
+			bistate->last_free = InvalidBlockNumber;
+		}
+
+		/* maintain bistate->current_buf */
+		IncrBufferRefCount(buffer);
+		bistate->current_buf = buffer;
+	}
+
+	return buffer;
+#undef MAX_BUFFERS_TO_EXTEND_BY
 }
 
 /*
@@ -376,12 +486,14 @@ RelationGetBufferForTuple(Relation relation, Size len,
 				targetFreeSpace = 0;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
 	bool		unlockedTargetBuffer;
 	bool		recheckVmPins;
 
 	len = MAXALIGN(len);		/* be conservative */
 
+	if (num_pages <= 0)
+		num_pages = 1;
+
 	/* Bulk insert is not supported for updates, only inserts. */
 	Assert(otherBuffer == InvalidBuffer || !bistate);
 
@@ -581,102 +693,54 @@ loop:
 			ReleaseBuffer(buffer);
 		}
 
-		/* Without FSM, always fall out of the loop and extend */
-		if (!use_fsm)
-			break;
-
-		/*
-		 * Update FSM as to condition of this page, and ask for another page
-		 * to try.
-		 */
-		targetBlock = RecordAndGetPageWithFreeSpace(relation,
-													targetBlock,
-													pageFreeSpace,
-													targetFreeSpace);
-	}
-
-	/*
-	 * Have to extend the relation.
-	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
-	 */
-	needLock = !RELATION_IS_LOCAL(relation);
-
-	/*
-	 * If we need the lock but are not able to acquire it immediately, we'll
-	 * consider extending the relation by multiple blocks at a time to manage
-	 * contention on the relation extension lock.  However, this only makes
-	 * sense if we're using the FSM; otherwise, there's no point.
-	 */
-	if (needLock)
-	{
-		if (!use_fsm)
-			LockRelationForExtension(relation, ExclusiveLock);
-		else if (!ConditionalLockRelationForExtension(relation, ExclusiveLock))
+		if (bistate	&& bistate->next_free != InvalidBlockNumber)
 		{
-			/* Couldn't get the lock immediately; wait for it. */
-			LockRelationForExtension(relation, ExclusiveLock);
+			Assert(bistate->next_free <= bistate->last_free);
 
 			/*
-			 * Check if some other backend has extended a block for us while
-			 * we were waiting on the lock.
+			 * We bulk extended the relation before, and there are still some
+			 * unused pages from that extension, so we don't need to look in
+			 * the FSM for a new page. But do record the free space from the
+			 * last page, somebody might insert narrower tuples later.
 			 */
-			targetBlock = GetPageWithFreeSpace(relation, targetFreeSpace);
+			if (use_fsm)
+				RecordPageWithFreeSpace(relation, targetBlock, pageFreeSpace);
 
-			/*
-			 * If some other waiter has already extended the relation, we
-			 * don't need to do so; just use the existing freespace.
-			 */
-			if (targetBlock != InvalidBlockNumber)
+			targetBlock = bistate->next_free;
+			if (bistate->next_free >= bistate->last_free)
 			{
-				UnlockRelationForExtension(relation, ExclusiveLock);
-				goto loop;
+				bistate->next_free = InvalidBlockNumber;
+				bistate->last_free = InvalidBlockNumber;
 			}
-
-			/* Time to bulk-extend. */
-			RelationAddExtraBlocks(relation, bistate);
+			else
+				bistate->next_free++;
+		}
+		else if (!use_fsm)
+		{
+			/* Without FSM, always fall out of the loop and extend */
+			break;
+		}
+		else
+		{
+			/*
+			 * Update FSM as to condition of this page, and ask for another
+			 * page to try.
+			 */
+			targetBlock = RecordAndGetPageWithFreeSpace(relation,
+														targetBlock,
+														pageFreeSpace,
+														targetFreeSpace);
 		}
 	}
 
-	/*
-	 * In addition to whatever extension we performed above, we always add at
-	 * least one block to satisfy our own request.
-	 *
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
-
+	/* Have to extend the relation */
 	unlockedTargetBuffer = false;
+	buffer = RelationAddBlocks(relation, bistate, num_pages, use_fsm, &unlockedTargetBuffer);
+	recheckVmPins = unlockedTargetBuffer;
+
 	targetBlock = BufferGetBlockNumber(buffer);
-
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
 	page = BufferGetPage(buffer);
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 targetBlock,
-			 RelationGetRelationName(relation));
-
-	PageInit(page, BufferGetPageSize(buffer), 0);
-	MarkBufferDirty(buffer);
-
 	/*
 	 * The page is empty, pin vmbuffer to set all_frozen bit. We don't want to
 	 * do IO while the buffer is locked, so we unlock the page first if IO is
@@ -688,8 +752,9 @@ loop:
 
 		if (!visibilitymap_pin_ok(targetBlock, *vmbuffer))
 		{
+			if (!unlockedTargetBuffer)
+				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			unlockedTargetBuffer = true;
-			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			visibilitymap_pin(relation, targetBlock, vmbuffer);
 		}
 	}
@@ -702,7 +767,6 @@ loop:
 	 * that another backend used space on this page. We check for that below,
 	 * and retry if necessary.
 	 */
-	recheckVmPins = false;
 	if (unlockedTargetBuffer)
 	{
 		/* released lock on target buffer above */
-- 
2.38.0

v7-0013-Convert-a-few-places-to-ExtendBufferedRelTo.patchtext/x-diff; charset=us-asciiDownload

From 436a1d541480cf07e911455e0d5cb149a7437097 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 1 Mar 2023 13:08:17 -0800
Subject: [PATCH v7 13/14] Convert a few places to ExtendBufferedRelTo

---
 src/backend/access/heap/visibilitymap.c   | 80 +++++++---------------
 src/backend/access/transam/xlogutils.c    | 29 ++------
 src/backend/storage/freespace/freespace.c | 83 +++++++----------------
 3 files changed, 55 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index 114d1b42b3e..ac91d1a14da 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -126,7 +126,7 @@
 
 /* prototypes for internal routines */
 static Buffer vm_readbuf(Relation rel, BlockNumber blkno, bool extend);
-static void vm_extend(Relation rel, BlockNumber vm_nblocks);
+static Buffer vm_extend(Relation rel, BlockNumber vm_nblocks);
 
 
 /*
@@ -574,22 +574,29 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 			reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary. It's always safe to clear bits, so it's better to clear
+	 * corrupt pages than error out.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM])
 	{
 		if (extend)
-			vm_extend(rel, blkno + 1);
+			buf = vm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
+								 RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. It's
-	 * always safe to clear bits, so it's better to clear corrupt pages than
-	 * error out.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -602,8 +609,6 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, VISIBILITYMAP_FORKNUM, blkno,
-							 RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -618,51 +623,16 @@ vm_readbuf(Relation rel, BlockNumber blkno, bool extend)
  * Ensure that the visibility map fork is at least vm_nblocks long, extending
  * it if necessary with zeroed pages.
  */
-static void
+static Buffer
 vm_extend(Relation rel, BlockNumber vm_nblocks)
 {
-	BlockNumber vm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
+	Buffer buf;
 
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the visibility map at the same time. It also locks out extension
-	 * of the main fork, unnecessarily, but extending the visibility map
-	 * happens seldom enough that it doesn't seem worthwhile to have a
-	 * separate lock tag type for it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the file first if it doesn't exist.  If smgr_vm_nblocks is
-	 * positive then it must exist, no need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, VISIBILITYMAP_FORKNUM))
-		smgrcreate(reln, VISIBILITYMAP_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[VISIBILITYMAP_FORKNUM] = InvalidBlockNumber;
-	vm_nblocks_now = smgrnblocks(reln, VISIBILITYMAP_FORKNUM);
-
-	/* Now extend the file */
-	while (vm_nblocks_now < vm_nblocks)
-	{
-		smgrextend(reln, VISIBILITYMAP_FORKNUM, vm_nblocks_now, pg.data, false);
-		vm_nblocks_now++;
-	}
+	buf = ExtendBufferedRelTo(EB_REL(rel), VISIBILITYMAP_FORKNUM, NULL,
+							  EB_CREATE_FORK_IF_NEEDED |
+							  EB_CLEAR_SIZE_CACHE,
+							  vm_nblocks,
+							  RBM_ZERO_ON_ERROR);
 
 	/*
 	 * Send a shared-inval message to force other backends to close any smgr
@@ -671,7 +641,7 @@ vm_extend(Relation rel, BlockNumber vm_nblocks)
 	 * to keep checking for creation or extension of the file, which happens
 	 * infrequently.
 	 */
-	CacheInvalidateSmgr(reln->smgr_rlocator);
+	CacheInvalidateSmgr(RelationGetSmgr(rel)->smgr_rlocator);
 
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return buf;
 }
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 2c28956b1aa..e174a2a8919 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -524,28 +524,13 @@ XLogReadBufferExtended(RelFileLocator rlocator, ForkNumber forknum,
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
-		buffer = InvalidBuffer;
-		do
-		{
-			if (buffer != InvalidBuffer)
-			{
-				if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-					LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-				ReleaseBuffer(buffer);
-			}
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum,
-											   P_NEW, mode, NULL, true);
-		}
-		while (BufferGetBlockNumber(buffer) < blkno);
-		/* Handle the corner case that P_NEW returns non-consecutive pages */
-		if (BufferGetBlockNumber(buffer) != blkno)
-		{
-			if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
-				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
-			ReleaseBuffer(buffer);
-			buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno,
-											   mode, NULL, true);
-		}
+		buffer = ExtendBufferedRelTo(EB_SMGR(smgr, RELPERSISTENCE_PERMANENT),
+									 forknum,
+									 NULL,
+									 EB_PERFORMING_RECOVERY |
+									 EB_SKIP_EXTENSION_LOCK,
+									 blkno + 1,
+									 mode);
 	}
 
 recent_buffer_fast_path:
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 90c529958e7..2face615d07 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -98,7 +98,7 @@ static BlockNumber fsm_get_heap_blk(FSMAddress addr, uint16 slot);
 static BlockNumber fsm_logical_to_physical(FSMAddress addr);
 
 static Buffer fsm_readbuf(Relation rel, FSMAddress addr, bool extend);
-static void fsm_extend(Relation rel, BlockNumber fsm_nblocks);
+static Buffer fsm_extend(Relation rel, BlockNumber fsm_nblocks);
 
 /* functions to convert amount of free space to a FSM category */
 static uint8 fsm_space_avail_to_cat(Size avail);
@@ -558,24 +558,30 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 			reln->smgr_cached_nblocks[FSM_FORKNUM] = 0;
 	}
 
-	/* Handle requests beyond EOF */
+	/*
+	 * For reading we use ZERO_ON_ERROR mode, and initialize the page if
+	 * necessary.  The FSM information is not accurate anyway, so it's better
+	 * to clear corrupt pages than error out. Since the FSM changes are not
+	 * WAL-logged, the so-called torn page problem on crash can lead to pages
+	 * with corrupt headers, for example.
+	 *
+	 * We use the same path below to initialize pages when extending the
+	 * relation, as a concurrent extension can end up with vm_extend()
+	 * returning an already-initialized page.
+	 */
 	if (blkno >= reln->smgr_cached_nblocks[FSM_FORKNUM])
 	{
 		if (extend)
-			fsm_extend(rel, blkno + 1);
+			buf = fsm_extend(rel, blkno + 1);
 		else
 			return InvalidBuffer;
 	}
+	else
+		buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 
 	/*
-	 * Use ZERO_ON_ERROR mode, and initialize the page if necessary. The FSM
-	 * information is not accurate anyway, so it's better to clear corrupt
-	 * pages than error out. Since the FSM changes are not WAL-logged, the
-	 * so-called torn page problem on crash can lead to pages with corrupt
-	 * headers, for example.
-	 *
-	 * The initialize-the-page part is trickier than it looks, because of the
-	 * possibility of multiple backends doing this concurrently, and our
+	 * Initializing the page when needed is trickier than it looks, because of
+	 * the possibility of multiple backends doing this concurrently, and our
 	 * desire to not uselessly take the buffer lock in the normal path where
 	 * the page is OK.  We must take the lock to initialize the page, so
 	 * recheck page newness after we have the lock, in case someone else
@@ -588,7 +594,6 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * long as it doesn't depend on the page header having correct contents.
 	 * Current usage is safe because PageGetContents() does not require that.
 	 */
-	buf = ReadBufferExtended(rel, FSM_FORKNUM, blkno, RBM_ZERO_ON_ERROR, NULL);
 	if (PageIsNew(BufferGetPage(buf)))
 	{
 		LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
@@ -604,56 +609,14 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
  * it if necessary with empty pages. And by empty, I mean pages filled
  * with zeros, meaning there's no free space.
  */
-static void
+static Buffer
 fsm_extend(Relation rel, BlockNumber fsm_nblocks)
 {
-	BlockNumber fsm_nblocks_now;
-	PGAlignedBlock pg = {0};
-	SMgrRelation reln;
-
-
-	/*
-	 * We use the relation extension lock to lock out other backends trying to
-	 * extend the FSM at the same time. It also locks out extension of the
-	 * main fork, unnecessarily, but extending the FSM happens seldom enough
-	 * that it doesn't seem worthwhile to have a separate lock tag type for
-	 * it.
-	 *
-	 * Note that another backend might have extended or created the relation
-	 * by the time we get the lock.
-	 */
-	LockRelationForExtension(rel, ExclusiveLock);
-
-	/*
-	 * Caution: re-using this smgr pointer could fail if the relcache entry
-	 * gets closed.  It's safe as long as we only do smgr-level operations
-	 * between here and the last use of the pointer.
-	 */
-	reln = RelationGetSmgr(rel);
-
-	/*
-	 * Create the FSM file first if it doesn't exist.  If
-	 * smgr_cached_nblocks[FSM_FORKNUM] is positive then it must exist, no
-	 * need for an smgrexists call.
-	 */
-	if ((reln->smgr_cached_nblocks[FSM_FORKNUM] == 0 ||
-		 reln->smgr_cached_nblocks[FSM_FORKNUM] == InvalidBlockNumber) &&
-		!smgrexists(reln, FSM_FORKNUM))
-		smgrcreate(reln, FSM_FORKNUM, false);
-
-	/* Invalidate cache so that smgrnblocks() asks the kernel. */
-	reln->smgr_cached_nblocks[FSM_FORKNUM] = InvalidBlockNumber;
-	fsm_nblocks_now = smgrnblocks(reln, FSM_FORKNUM);
-
-	/* Extend as needed. */
-	while (fsm_nblocks_now < fsm_nblocks)
-	{
-		smgrextend(reln, FSM_FORKNUM, fsm_nblocks_now,
-				   pg.data, false);
-		fsm_nblocks_now++;
-	}
-
-	UnlockRelationForExtension(rel, ExclusiveLock);
+	return ExtendBufferedRelTo(EB_REL(rel), FSM_FORKNUM, NULL,
+							   EB_CREATE_FORK_IF_NEEDED |
+							   EB_CLEAR_SIZE_CACHE,
+							   fsm_nblocks,
+							   RBM_ZERO_ON_ERROR);
 }
 
 /*
-- 
2.38.0

#46

John Naylor

john.naylor@enterprisedb.com

almost 3 years ago

In reply to: Andres Freund (#44)

Re: refactoring relation extension and BufferAlloc(), faster COPY

On Wed, Apr 5, 2023 at 7:32 AM Andres Freund <andres@anarazel.de> wrote:

Portability nit: mdzeroextend() doesn't know whether posix_fallocate()

was

used in FileFallocate().

Fair point. I would however like to see a different error message for the

two

ways of extending, at least initially. What about referencing

FileFallocate()?

Seems logical.

--
John Naylor
EDB: http://www.enterprisedb.com

#47

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#45)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-04-04 17:39:45 -0700, Andres Freund wrote:

I'm planning to push the patches up to the hio.c changes soon, unless somebody
would like me to hold off.

Done that.

After that I'm planning to wait for a buildfarm cycle, and push the changes
necessary to use bulk extension in hio.c (the main win).

Working on that. Might end up being tomorrow.

I might split the patch to use ExtendBufferedRelTo() into two, one for
vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
complicated and has more of a complicated history (see [1]).

I've pushed the vm_extend() and fsm_extend() piece, and did split out the
xlogutils.c case.

Greetings,

Andres Freund

#48

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#47)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-04-05 18:46:16 -0700, Andres Freund wrote:

On 2023-04-04 17:39:45 -0700, Andres Freund wrote:

After that I'm planning to wait for a buildfarm cycle, and push the changes
necessary to use bulk extension in hio.c (the main win).

Working on that. Might end up being tomorrow.

It did. So far no complaints from the buildfarm. Although I pushed the last
piece just now...

Besides editing the commit message, not a lot of changes between what I posted
last and what I pushed. A few typos and awkward sentences, code formatting,
etc. I did change the API of RelationAddBlocks() to set *did_unlock = false,
if it didn't unlock (and thus removed setting it in the caller).

I might split the patch to use ExtendBufferedRelTo() into two, one for
vm_extend() and fsm_extend(), and one for xlogutils.c. The latter is more
complicated and has more of a complicated history (see [1]).

I've pushed the vm_extend() and fsm_extend() piece, and did split out the
xlogutils.c case.

Which I pushed, just now. I did perform manual testing with creating
disconnected segments on the standby, and checking that everything behaves
well in that case.

I think it might be worth having a C test for some of the bufmgr.c API. Things
like testing that retrying a failed relation extension works the second time
round.

Greetings,

Andres Freund

#49

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Andres Freund (#48)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-04-06 18:15:14 -0700, Andres Freund wrote:

I think it might be worth having a C test for some of the bufmgr.c API. Things
like testing that retrying a failed relation extension works the second time
round.

A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would
hopefully have been uncovered by such a test...

I guess we could even test this specific instance without a more complicated
framework. Create table with some data, rename the file, checkpoint - should
fail, rename back, checkpoint - should succeed.

It's much harder to exercise the error paths inside the backend extending the
relation unfortunately, because we require the file to be opened rw before
doing much. And once the FD is open, removing the permissions doesn't help.
The least complicated approach I scan see is creating directory qutoas, but
that's quite file system specific...

Greetings,

Andres Freund

#50

Alexander Lakhin

exclusion@gmail.com

almost 3 years ago

In reply to: Andres Freund (#49)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi Andres,

07.04.2023 11:39, Andres Freund wrote:

Hi,

On 2023-04-06 18:15:14 -0700, Andres Freund wrote:

I think it might be worth having a C test for some of the bufmgr.c API. Things
like testing that retrying a failed relation extension works the second time
round.

A few hours after this I hit a stupid copy-pasto (21d7c05a5cf) that would
hopefully have been uncovered by such a test...

A few days later I've found a new defect introduced with 31966b151.
The following script:
echo "
CREATE TABLE tbl(id int);
INSERT INTO tbl(id) SELECT i FROM generate_series(1, 1000) i;
DELETE FROM tbl;
CHECKPOINT;
" | psql -q

sleep 2

grep -C2 'automatic vacuum of table ".*.tbl"' server.log

tf=$(psql -Aqt -c "SELECT format('%s/%s', pg_database.oid, relfilenode) FROM pg_database, pg_class WHERE datname =
current_database() AND relname = 'tbl'")
ls -l "$PGDB/base/$tf"

pg_ctl -D $PGDB stop -m immediate
pg_ctl -D $PGDB -l server.log start

with the autovacuum enabled as follows:
autovacuum = on
log_autovacuum_min_duration = 0
autovacuum_naptime = 1

gives:
2023-04-11 20:56:56.261 MSK [675708] LOG: checkpoint starting: immediate force wait
2023-04-11 20:56:56.324 MSK [675708] LOG: checkpoint complete: wrote 900 buffers (5.5%); 0 WAL file(s) added, 0
removed, 0 recycled; write=0.016 s, sync=0.034 s, total=0.063 s; sync files=252, longest=0.017 s, average=0.001 s;
distance=4162 kB, estimate=4162 kB; lsn=0/1898588, redo lsn=0/1898550
2023-04-11 20:56:57.558 MSK [676060] LOG: automatic vacuum of table "testdb.public.tbl": index scans: 0
pages: 5 removed, 0 remain, 5 scanned (100.00% of total)
tuples: 1000 removed, 0 remain, 0 are dead but not yet removable
-rw------- 1 law law 0 апр 11 20:56 .../tmpdb/base/16384/16385
waiting for server to shut down.... done
server stopped
waiting for server to start.... stopped waiting
pg_ctl: could not start server
Examine the log output.

server stops with the following stack trace:
Core was generated by `postgres: startup recovering 000000010000000000000001 '.
Program terminated with signal SIGABRT, Aborted.

warning: Section `.reg-xstate/790626' in core file too small.
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140209454906240) at ./nptl/pthread_kill.c:78
#2 __GI___pthread_kill (threadid=140209454906240, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3 0x00007f850ec53476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007f850ec397f3 in __GI_abort () at ./stdlib/abort.c:79
#5 0x0000557950889c0b in ExceptionalCondition (
    conditionName=0x557950a67680 "mode == RBM_NORMAL || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_ON_ERROR",
    fileName=0x557950a673e8 "bufmgr.c", lineNumber=1008) at assert.c:66
#6 0x000055795064f2d0 in ReadBuffer_common (smgr=0x557952739f38, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
    blockNum=4294967295, mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, hit=0x7fff22dd648f) at bufmgr.c:1008
#7 0x000055795064ebe7 in ReadBufferWithoutRelcache (rlocator=..., forkNum=MAIN_FORKNUM, blockNum=4294967295,
    mode=RBM_ZERO_AND_CLEANUP_LOCK, strategy=0x0, permanent=true) at bufmgr.c:800
#8 0x000055795021c0fa in XLogReadBufferExtended (rlocator=..., forknum=MAIN_FORKNUM, blkno=0,
    mode=RBM_ZERO_AND_CLEANUP_LOCK, recent_buffer=0) at xlogutils.c:536
#9 0x000055795021bd92 in XLogReadBufferForRedoExtended (record=0x5579526c4998, block_id=0 '\000', mode=RBM_NORMAL,
    get_cleanup_lock=true, buf=0x7fff22dd6598) at xlogutils.c:391
#10 0x00005579501783b1 in heap_xlog_prune (record=0x5579526c4998) at heapam.c:8726
#11 0x000055795017b7db in heap2_redo (record=0x5579526c4998) at heapam.c:9960
#12 0x0000557950215b34 in ApplyWalRecord (xlogreader=0x5579526c4998, record=0x7f85053d0120, replayTLI=0x7fff22dd6720)
    at xlogrecovery.c:1915
#13 0x0000557950215611 in PerformWalRecovery () at xlogrecovery.c:1746
#14 0x0000557950201ce3 in StartupXLOG () at xlog.c:5433
#15 0x00005579505cb6d2 in StartupProcessMain () at startup.c:267
#16 0x00005579505be9f7 in AuxiliaryProcessMain (auxtype=StartupProcess) at auxprocess.c:141
#17 0x00005579505ca2b5 in StartChildProcess (type=StartupProcess) at postmaster.c:5369
#18 0x00005579505c5224 in PostmasterMain (argc=3, argv=0x5579526c3e70) at postmaster.c:1455
#19 0x000055795047a97d in main (argc=3, argv=0x5579526c3e70) at main.c:200

As I can see, autovacuum removes pages from the table file, and this causes
the crash while replaying the record:
rmgr: Heap2 len (rec/tot): 60/ 988, tx: 0, lsn: 0/01898600, prev 0/01898588, desc: PRUNE
snapshotConflictHorizon 732 nredirected 0 ndead 226, blkref #0: rel 1663/16384/16385 blk 0 FPW

Best regards,
Alexander

#51

Andres Freund

andres@anarazel.de

almost 3 years ago

In reply to: Alexander Lakhin (#50)

1 attachment(s)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:

A few days later I've found a new defect introduced with 31966b151.

That's the same issue that Tom also just reported, at
/messages/by-id/392271.1681238924@sss.pgh.pa.us

Attached is my WIP fix, including a test.

Greetings,

Andres Freund

Attachments:

v1-0001-WIP-Fix-ExtendBufferedRelTo-assert-failure-in-rec.patchtext/x-diff; charset=us-asciiDownload

From fbe84381fa39a48f906358ade0a64e93b81c05c9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 11 Apr 2023 16:04:44 -0700
Subject: [PATCH v1] WIP: Fix ExtendBufferedRelTo() assert failure in recovery
 + tests

Reported-by: Alexander Lakhin <exclusion@gmail.com>
Reported-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/392271.1681238924%40sss.pgh.pa.us
Discussion: https://postgr.es/m/0b5eb82b-cb99-e0a4-b932-3dc60e2e3926@gmail.com
---
 src/backend/storage/buffer/bufmgr.c          |  4 +-
 src/test/recovery/meson.build                |  1 +
 src/test/recovery/t/036_truncated_dropped.pl | 99 ++++++++++++++++++++
 3 files changed, 102 insertions(+), 2 deletions(-)
 create mode 100644 src/test/recovery/t/036_truncated_dropped.pl

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7778dde3e57..4f2c46a4339 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -889,7 +889,7 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb,
 	Assert(eb.smgr == NULL || eb.relpersistence != 0);
 	Assert(extend_to != InvalidBlockNumber && extend_to > 0);
 	Assert(mode == RBM_NORMAL || mode == RBM_ZERO_ON_ERROR ||
-		   mode == RBM_ZERO_AND_LOCK);
+		   mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 
 	if (eb.smgr == NULL)
 	{
@@ -933,7 +933,7 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb,
 	 */
 	current_size = smgrnblocks(eb.smgr, fork);
 
-	if (mode == RBM_ZERO_AND_LOCK)
+	if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
 		flags |= EB_LOCK_TARGET;
 
 	while (current_size < extend_to)
diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index e834ad5e0dc..20089580100 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -41,6 +41,7 @@ tests += {
       't/033_replay_tsp_drops.pl',
       't/034_create_database.pl',
       't/035_standby_logical_decoding.pl',
+      't/036_truncated_dropped.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/036_truncated_dropped.pl b/src/test/recovery/t/036_truncated_dropped.pl
new file mode 100644
index 00000000000..571cff9639a
--- /dev/null
+++ b/src/test/recovery/t/036_truncated_dropped.pl
@@ -0,0 +1,99 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+# Tests recovery scenarios where the files are shorter than in the common
+# cases, e.g. due to replaying WAL records of a relation that was subsequently
+# truncated.
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('n1');
+
+$node->init();
+
+# Disable autovacuum to guarantee VACUUM can remove rows / truncate relations
+$node->append_conf('postgresql.conf', qq[
+wal_level = 'replica'
+autovacuum = off
+]);
+
+$node->start();
+
+
+# Test replay of PRUNE records for blocks that are later truncated. With FPIs
+# used for PRUNE.
+
+$node->safe_psql(
+	'postgres', qq[
+CREATE TABLE truncme(i int) WITH (fillfactor = 50);
+INSERT INTO truncme SELECT generate_series(1, 1000);
+UPDATE truncme SET i = 1;
+CHECKPOINT; -- generate FPIs
+VACUUM truncme; -- generate prune records
+TRUNCATE truncme; -- make blocks non-existing
+INSERT INTO truncme SELECT generate_series(1, 10);
+]);
+
+$node->stop('immediate');
+
+ok($node->start(), 'replay of PRUNE records affecting truncated block (FPIs)');
+
+is($node->safe_psql('postgres', 'select count(*), sum(i) FROM truncme'),
+   '10|55',
+   'table contents as expected after recovery');
+$node->safe_psql('postgres', 'DROP TABLE truncme');
+
+
+# Test replay of PRUNE records for blocks that are later truncated. Without
+# FPIs used for PRUNE.
+
+$node->safe_psql(
+	'postgres', qq[
+CREATE TABLE truncme(i int) WITH (fillfactor = 50);
+INSERT INTO truncme SELECT generate_series(1, 1000);
+UPDATE truncme SET i = 1;
+VACUUM truncme; -- generate prune records
+TRUNCATE truncme; -- make blocks non-existing
+INSERT INTO truncme SELECT generate_series(1, 10);
+]);
+
+$node->stop('immediate');
+
+ok($node->start(), 'replay of PRUNE records affecting truncated block (no FPIs)');
+
+is($node->safe_psql('postgres', 'select count(*), sum(i) FROM truncme'),
+   '10|55',
+   'table contents as expected after recovery');
+$node->safe_psql('postgres', 'DROP TABLE truncme');
+
+
+# Test replay of partial relation truncation via VACUUM
+
+$node->safe_psql(
+	'postgres', qq[
+CREATE TABLE truncme(i int) WITH (fillfactor = 50);
+INSERT INTO truncme SELECT generate_series(1, 1000);
+UPDATE truncme SET i = i + 1;
+-- ensure a mix of pre/post truncation rows
+DELETE FROM truncme WHERE i > 500;
+
+VACUUM truncme; -- should truncate relation
+
+-- rows at TIDs that previously existed
+INSERT INTO truncme SELECT generate_series(1000, 1010);
+]);
+
+$node->stop('immediate');
+
+ok($node->start(), 'replay of partial truncation via VACUUM');
+
+is($node->safe_psql('postgres', 'select count(*), sum(i), min(i), max(i) FROM truncme'),
+   '510|136304|2|1010',
+   'table contents as expected after recovery');
+$node->safe_psql('postgres', 'DROP TABLE truncme');
+
+
+done_testing();
-- 
2.38.0

#52

Alexander Lakhin

exclusion@gmail.com

almost 3 years ago

In reply to: Andres Freund (#51)

Re: refactoring relation extension and BufferAlloc(), faster COPY

12.04.2023 02:21, Andres Freund wrote:

Hi,

On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:

A few days later I've found a new defect introduced with 31966b151.

That's the same issue that Tom also just reported, at
/messages/by-id/392271.1681238924@sss.pgh.pa.us

Attached is my WIP fix, including a test.

Thanks for the fix. I can confirm that the issue is gone.
ReadBuffer_common() contains an Assert(), that is similar to the fixed one,
but it looks unreachable for the WAL replay case after 26158b852.

Best regards,
Alexander

#53

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Alexander Lakhin (#52)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-04-12 08:00:00 +0300, Alexander Lakhin wrote:

12.04.2023 02:21, Andres Freund wrote:

Hi,

On 2023-04-11 22:00:00 +0300, Alexander Lakhin wrote:

A few days later I've found a new defect introduced with 31966b151.

That's the same issue that Tom also just reported, at
/messages/by-id/392271.1681238924@sss.pgh.pa.us

Attached is my WIP fix, including a test.

Thanks for the fix. I can confirm that the issue is gone.
ReadBuffer_common() contains an Assert(), that is similar to the fixed one,
but it looks unreachable for the WAL replay case after 26158b852.

Good catch. I implemented it there too. As now all of the modes are supported,
I removed the assertion.

I also extended the test slightly to also test the case of dropped relations.

Greetings,

Andres Freund

#54

Muhammad Malik

muhammad.malik1@hotmail.com

over 2 years ago

In reply to: Andres Freund (#1)

Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

Could you please share repro steps for running these benchmarks? I am doing performance testing in this area and want to use the same benchmarks.

Thanks,
Muhammad
________________________________
From: Andres Freund <andres@anarazel.de>
Sent: Friday, October 28, 2022 7:54 PM
To: pgsql-hackers@postgresql.org <pgsql-hackers@postgresql.org>; Thomas Munro <thomas.munro@gmail.com>; Melanie Plageman <melanieplageman@gmail.com>
Cc: Yura Sokolov <y.sokolov@postgrespro.ru>; Robert Haas <robertmhaas@gmail.com>
Subject: refactoring relation extension and BufferAlloc(), faster COPY