refactoring relation extension and BufferAlloc(), faster COPY

Started by Andres Freundover 3 years ago57 messageshackers
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

1) obviously has to happen with the relation extension lock held because
otherwise we might miss another relation extension. 2+3) need to happen with
the lock held, because otherwise another backend not doing an extension could
read the block before we're done extending, dirty it, write it out, and then
have it overwritten by the extending backend.

The reason we currently do so much work while holding the relation extension
lock is that bufmgr.c does not know about the relation lock and that relation
extension happens entirely within ReadBuffer* - there's no way to use a
narrower scope for the lock.

My fix for that is to add a dedicated function for extending relations, that
can acquire the extension lock if necessary (callers can tell it to skip that,
e.g., when initially creating an init fork). This routine is called by
ReadBuffer_common() when P_NEW is passed in, to provide backward
compatibility.

To be able to acquire victim buffers outside of the extension lock, victim
buffers are now acquired separately from inserting the new buffer mapping
entry. Victim buffer are pinned, cleaned, removed from the buffer mapping
table and marked invalid. Because they are pinned, clock sweeps in other
backends won't return them. This is done in a new function,
[Local]BufferAlloc().

This is similar to Yuri's patch at [0]/messages/by-id/3b108afd19fa52ed20c464a69f64d545e4a14772.camel@postgrespro.ru, but not that similar to earlier or
later approaches in that thread. I don't really understand why that thread
went on to ever more complicated approaches, when the basic approach shows
plenty gains, with no issues around the number of buffer mapping entries that
can exist etc.

Other interesting bits I found:

a) For workloads that [mostly] fit into s_b, the smgwrite() that BufferAlloc()
does, nearly doubles the amount of writes. First the kernel ends up writing
out all the zeroed out buffers after a while, then when we write out the
actual buffer contents.

The best fix for that seems to be to optionally use posix_fallocate() to
reserve space, without dirtying pages in the kernel page cache. However, it
looks like that's only beneficial when extending by multiple pages at once,
because it ends up causing one filesystem-journal entry for each extension
on at least some filesystems.

I added 'smgrzeroextend()' that can extend by multiple blocks, without the
caller providing a buffer to write out. When extending by 8 or more blocks,
posix_fallocate() is used if available, otherwise pg_pwritev_with_retry() is
used to extend the file.

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.

c) Contention on the FSM and the pages returned by it is a serious bottleneck
after a) and b).

The biggest issue is that the current bulk insertion logic in hio.c enters
all but one of the new pages into the freespacemap. That will immediately
cause all the other backends to contend on the first few pages returned the
FSM, and cause contention on the FSM pages itself.

I've, partially, addressed that by using the information about the required
number of pages from b). Whether we bulk insert or not, the number of pages
we know we're going to need for one heap_multi_insert() don't need to be
added to the FSM - we're going to use them anyway.

I've stashed the number of free blocks in the BulkInsertState for now, but
I'm not convinced that that's the right place.

If I revert just this part, the "concurrent COPY into unlogged table"
benchmark goes from ~240 tps to ~190 tps.

Even after that change the FSM is a major bottleneck. Below I included
benchmarks showing this by just removing the use of the FSM, but I haven't
done anything further about it. The contention seems to be both from
updating the FSM, as well as thundering-herd like symptoms from accessing
the FSM.

The update part could likely be addressed to some degree with a batch
update operation updating the state for multiple pages.

The access part could perhaps be addressed by adding an operation that gets
a page and immediately marks it as fully used, so other backends won't also
try to access it.

d) doing
/* new buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);

under the extension lock is surprisingly expensive on my two socket
workstation (but much less noticable on my laptop).

If I move the MemSet back under the extension lock, the "concurrent COPY
into unlogged table" benchmark goes from ~240 tps to ~200 tps.

e) When running a few benchmarks for this email, I noticed that there was a
sharp performance dropoff for the patched code for a pgbench -S -s100 on a
database with 1GB s_b, start between 512 and 1024 clients. This started with
the patch only acquiring one buffer partition lock at a time. Lots of
debugging ensued, resulting in [3]/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de.

The problem isn't actually related to the change, it just makes it more
visible, because the "lock chains" between two partitions reduce the
average length of the wait queues substantially, by distribution them
between more partitions. [3]/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de has a reproducer that's entirely independent
of this patchset.

Bulk extension acquires a number of victim buffers, acquires the extension
lock, inserts the buffers into the buffer mapping table and marks them as
io-in-progress, calls smgrextend and releases the extension lock. After that
buffer[s] are locked (depending on mode and an argument indicating the number
of blocks to be locked), and TerminateBufferIO() is called.

This requires two new pieces of infrastructure:

First, pinning multiple buffers opens up the obvious danger that we might run
of non-pinned buffers. I added LimitAdditional[Local]Pins() that allows each
backend to pin a proportional share of buffers (although always allowing one,
as we do today).

Second, having multiple IOs in progress at the same time isn't possible with
the InProgressBuf mechanism. I added a ResourceOwnerRememberBufferIO() etc to
deal with that instead. I like that this ends up removing a lot of
AbortBufferIO() calls from the loops of various aux processes (now released
inside ReleaseAuxProcessResources()).

In very extreme workloads (single backend doing a pgbench -S -s 100 against a
s_b=64MB database) the memory allocations triggered by StartBufferIO() are
*just about* visible, not sure if that's worth worrying about - we do such
allocations for the much more common pinning of buffers as well.

The new [Bulk]ExtendSharedRelationBuffered() currently have both a Relation
and a SMgrRelation argument, requiring at least one of them to be set. The
reason for that is on the one hand that LockRelationForExtension() requires a
relation and on the other hand, redo routines typically don't have a Relation
around (recovery doesn't require an extension lock). That's not pretty, but
seems a tad better than the ReadBufferExtended() vs
ReadBufferWithoutRelcache() mess.

I've done a fair bit of benchmarking of this patchset. For COPY it comes out
ahead everywhere. It's possible that there's a very small regression for
extremly IO miss heavy workloads, more below.

server "base" configuration:

max_wal_size=150GB
shared_buffers=24GB
huge_pages=on
autovacuum=0
backend_flush_after=2MB
max_connections=5000
wal_buffers=128MB
wal_segment_size=1GB

benchmark: pgbench running COPY into a single table. pgbench -t is set
according to the client count, so that the same amount of data is inserted.
This is done oth using small files ([1]COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);, ringbuffer not effective, no dirty
data to write out within the benchmark window) and a bit larger files ([2]COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);,
lots of data to write out due to ringbuffer).

To make it a fair comparison HEAD includes the lwlock-waitqueue fix as well.

s_b=24GB

test: unlogged_small_files, format: text, files: 1024, 9015MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 58.63 207 50.22 242 54.35 224
2 32.67 372 25.82 472 27.30 446
4 22.53 540 13.30 916 14.33 851
8 15.14 804 7.43 1640 7.48 1632
16 14.69 829 4.79 2544 4.50 2718
32 15.28 797 4.41 2763 3.32 3710
64 15.34 794 5.22 2334 3.06 4061
128 15.49 786 4.97 2452 3.13 3926
256 15.85 768 5.02 2427 3.26 3769
512 16.02 760 5.29 2303 3.54 3471

test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 68.18 178 59.41 205 63.43 192
2 39.71 306 33.10 368 34.99 348
4 27.26 446 19.75 617 20.09 607
8 18.84 646 12.86 947 12.68 962
16 15.96 763 9.62 1266 8.51 1436
32 15.43 789 8.20 1486 7.77 1579
64 16.11 756 8.91 1367 8.90 1383
128 16.41 742 10.00 1218 9.74 1269
256 17.33 702 11.91 1023 10.89 1136
512 18.46 659 14.07 866 11.82 1049

test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 63.27s 192 56.14 217 59.25 205
2 40.17s 303 29.88 407 31.50 386
4 27.57s 442 16.16 754 17.18 709
8 21.26s 573 11.89 1025 11.09 1099
16 21.25s 573 10.68 1141 10.22 1192
32 21.00s 580 10.72 1136 10.35 1178
64 20.64s 590 10.15 1200 9.76 1249
128 skipped
256 skipped
512 skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 71.89s 169 65.57 217 69.09 69.09
2 47.36s 257 36.22 407 38.71 38.71
4 33.10s 368 21.76 754 22.78 22.78
8 26.62s 457 15.89 1025 15.30 15.30
16 24.89s 489 17.08 1141 15.20 15.20
32 25.15s 484 17.41 1136 16.14 16.14
64 26.11s 466 17.89 1200 16.76 16.76
128 skipped
256 skipped
512 skipped

Just to see how far it can be pushed, with binary format we can now get to
nearly 6GB/s into a table when disabling the FSM - note the 2x difference
between patch and patch+no-fsm at 32 clients.

test: unlogged_small_files, format: binary, files: 1024, 9508MB total
seconds tbl-MBs seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch no_fsm no_fsm
1 34.14 357 28.04 434 29.46 413
2 22.67 537 14.42 845 14.75 826
4 16.63 732 7.62 1599 7.69 1587
8 13.48 904 4.36 2795 4.13 2959
16 14.37 848 3.78 3224 2.74 4493
32 14.79 823 4.20 2902 2.07 5974
64 14.76 825 5.03 2423 2.21 5561
128 14.95 815 4.36 2796 2.30 5343
256 15.18 802 4.31 2828 2.49 4935
512 15.41 790 4.59 2656 2.84 4327

s_b=4GB

test: unlogged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.55 194 54.22 224
2 37.11 328 28.94 421
4 25.97 469 16.42 742
8 20.01 609 11.92 1022
16 19.55 623 11.05 1102
32 19.34 630 11.27 1081
64 19.07 639 12.04 1012
128 19.22 634 12.27 993
256 19.34 630 12.28 992
512 19.60 621 11.74 1038

test: logged_small_files, format: text, files: 1024, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.71 169 63.63 191
2 46.93 259 36.31 335
4 30.37 401 22.41 543
8 22.86 533 16.90 721
16 20.18 604 14.07 866
32 19.64 620 13.06 933
64 19.71 618 15.08 808
128 19.95 610 15.47 787
256 20.48 595 16.53 737
512 21.56 565 16.86 722

test: unlogged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 62.65 194 55.74 218
2 40.25 302 29.45 413
4 27.37 445 16.26 749
8 22.07 552 11.75 1037
16 21.29 572 10.64 1145
32 20.98 580 10.70 1139
64 20.65 590 10.21 1193
128 skipped
256 skipped
512 skipped

test: logged_medium_files, format: text, files: 64, 9018MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 71.72 169 65.12 187
2 46.46 262 35.74 341
4 32.61 373 21.60 564
8 26.69 456 16.30 747
16 25.31 481 17.00 716
32 24.96 488 17.47 697
64 26.05 467 17.90 680
128 skipped
256 skipped
512 skipped

test: unlogged_small_files, format: binary, files: 1024, 9505MB total
seconds tbl-MBs seconds tbl-MBs
clients HEAD HEAD patch patch
1 37.62 323 32.77 371
2 28.35 429 18.89 645
4 20.87 583 12.18 1000
8 19.37 629 10.38 1173
16 19.41 627 10.36 1176
32 18.62 654 11.04 1103
64 18.33 664 11.89 1024
128 18.41 661 11.91 1023
256 18.52 658 12.10 1007
512 18.78 648 11.49 1060

benchmark: Run a pgbench -S workload with scale 100, so it doesn't fit into
s_b, thereby exercising BufferAlloc()'s buffer replacement path heavily.

The run-to-run variance on my workstation is high for this workload (both
before/after my changes). I also found that the ramp-up time at higher client
counts is very significant:
progress: 2.1 s, 5816.8 tps, lat 1.835 ms stddev 4.450, 0 failed
progress: 3.0 s, 666729.4 tps, lat 5.755 ms stddev 16.753, 0 failed
progress: 4.0 s, 899260.1 tps, lat 3.619 ms stddev 41.108, 0 failed
...

One would need to run pgbench for impractically long to make that effect
vanish.

My not great solution for these was to run with -T21 -P5 and use the best 5s
as the tps.

s_b=1GB
tps tps
clients master patched
1 49541 48805
2 85342 90010
4 167340 168918
8 308194 303222
16 524294 523678
32 649516 649100
64 932547 937702
128 908249 906281
256 856496 903979
512 764254 934702
1024 653886 925113
2048 569695 917262
4096 526782 903258

s_b=128MB:
tps tps
clients master patched
1 40407 39854
2 73180 72252
4 143334 140860
8 240982 245331
16 429265 420810
32 544593 540127
64 706408 726678
128 713142 718087
256 611030 695582
512 552751 686290
1024 508248 666370
2048 474108 656735
4096 448582 633040

As there might be a small regression at the smallest end, I ran a more extreme
version of the above. Using a pipelined pgbench -S, with a single client, for
longer. With s_b=8MB.

To further reduce noise I pinned the server to one cpu, the client to another
and disabled turbo mode on the CPU.

master "total" tps: 61.52
master "best 5s" tps: 61.8
patch "total" tps: 61.20
patch "best 5s" tps: 61.4

Hardly conclusive, but it does look like there's a small effect. It could be
code layout or such.

My guess however is that it's the resource owner for in-progress IO that I
added - that adds an additional allocation inside the resowner machinery. I
commented those out (that's obviously incorrect!) just to see whether that
changes anything:

no-resowner "total" tps: 62.03
no-resowner "best 5s" tps: 62.2

So it looks like indeed, it's the resowner. I am a bit surprised, because
obviously we already use that mechanism for pins, which obviously is more
frequent.

I'm not sure it's worth worrying about - this is a pretty absurd workload. But
if we decide it is, I can think of a few ways to address this. E.g.:

- We could preallocate an initial element inside the ResourceArray struct, so
that a newly created resowner won't need to allocate immediately
- We could only use resowners if there's more than one IO in progress at the
same time - but I don't like that idea much
- We could try to store the "in-progress"-ness of a buffer inside the 'bufferpin'
resowner entry - on 64bit system there's plenty space for that. But on 32bit systems...

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

Greetings,

Andres Freund

[0]: /messages/by-id/3b108afd19fa52ed20c464a69f64d545e4a14772.camel@postgrespro.ru
[1]: COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT test);
[2]: COPY (SELECT repeat(random()::text, 5) FROM generate_series(1, 6*100000)) TO '/tmp/copytest_data_text.copy' WITH (FORMAT text);
[3]: /messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

Attachments:

v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patchtext/x-diff; charset=us-asciiDownload+17-16
v1-0002-aio-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload+30-14
v1-0003-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload+7-8
v1-0004-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload+236-1
v1-0005-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload+46-49
v1-0006-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload+381-305
v1-0007-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload+103-57
v1-0008-bufmgr-Move-relation-extension-handling-into-Bulk.patchtext/x-diff; charset=us-asciiDownload+546-170
v1-0009-Convert-a-few-places-to-ExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload+70-95
v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload+76-7
v1-0011-hio-Use-BulkExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload+170-12
v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patchtext/x-diff; charset=us-asciiDownload+52-1
#2Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#1)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

Thanks a lot for this great work. There are 12 patches in this thread,
I believe each of these patches is trying to solve separate problems
and can be reviewed and get committed separately, am I correct?

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

Yes, I too have observed this in the past for parallel inserts in CTAS
work - /messages/by-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk+KD2H7ZmRg@mail.gmail.com.
Tackling bulk relation extension problems will unblock the parallel
inserts (in CTAS, COPY) work I believe.

The fundamental issue, in my opinion, is that we do *way* too much while
holding the relation extension lock. We acquire a victim buffer, if that
buffer is dirty, we potentially flush the WAL, then write out that
buffer. Then we zero out the buffer contents. Call smgrextend().

Most of that work does not actually need to happen while holding the relation
extension lock. As far as I can tell, the minimum that needs to be covered by
the extension lock is the following:

1) call smgrnblocks()
2) insert buffer[s] into the buffer mapping table at the location returned by
smgrnblocks
3) mark buffer[s] as IO_IN_PROGRESS

Makes sense.

I will try to understand and review each patch separately.

Firstly, 0001 avoids extra loop over waiters and looks a reasonable
change, some comments on the patch:

1)
+    int            lwWaiting;        /* 0 if not waiting, 1 if on
waitlist, 2 if
+                                 * waiting to be woken */
Use macros instead of hard-coded values for better readability?

#define PROC_LW_LOCK_NOT_WAITING 0
#define PROC_LW_LOCK_ON_WAITLIST 1
#define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2

2) Missing initialization of lwWaiting to 0 or the macro in twophase.c
and proc.c.
proc->lwWaiting = false;
MyProc->lwWaiting = false;

3)
+        proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+        found = true;
I guess 'found' is a bit meaningless here as we are doing away with
the proclist_foreach_modify loop. We can directly use
MyProc->lwWaiting == 1 and remove 'found'.

4)
if (!MyProc->lwWaiting)
if (!proc->lwWaiting)
Can we modify the above conditions in lwlock.c to MyProc->lwWaiting !=
1 or PROC_LW_LOCK_ON_WAITLIST or the macro?

5) Is there any specific test case that I can see benefit of this
patch? If yes, can you please share it here?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#3Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#2)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2022-10-29 18:33:53 +0530, Bharath Rupireddy wrote:

On Sat, Oct 29, 2022 at 8:24 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

I'm working to extract independently useful bits from my AIO work, to reduce
the size of that patchset. This is one of those pieces.

Thanks a lot for this great work. There are 12 patches in this thread,
I believe each of these patches is trying to solve separate problems
and can be reviewed and get committed separately, am I correct?

Mostly, yes.

For 0001 I already started
/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de
to discuss the specific issue.

We don't strictly need v1-0002-aio-Add-some-error-checking-around-pinning.patch
but I did find it useful.

v1-0012-bufmgr-debug-Add-PrintBuffer-Desc.patch is not used in the patch
series, but I found it quite useful when debugging issues with the patch. A
heck of a lot easier to interpret page flags when they can be printed.

I also think there's some architectural questions that'll influence the number
of patches. E.g. I'm not convinced
v1-0010-heapam-Add-num_pages-to-RelationGetBufferForTuple.patch is quite the
right spot to track which additional pages should be used. It could very well
instead be alongside ->smgr_targblock. Possibly the best path would instead
be to return the additional pages explicitly to callers of
RelationGetBufferForTuple, but RelationGetBufferForTuple does a bunch of work
around pinning that potentially would need to be repeated in heap_multi_insert().

In workloads that extend relations a lot, we end up being extremely contended
on the relation extension lock. We've attempted to address that to some degree
by using batching, which helps, but only so much.

Yes, I too have observed this in the past for parallel inserts in CTAS
work - /messages/by-id/CALj2ACW9BUoFqWkmTSeHjFD-W7_00s3orqSvtvUk+KD2H7ZmRg@mail.gmail.com.
Tackling bulk relation extension problems will unblock the parallel
inserts (in CTAS, COPY) work I believe.

Yea. There's a lot of places the current approach ended up being a bottleneck.

Firstly, 0001 avoids extra loop over waiters and looks a reasonable
change, some comments on the patch:

1)
+    int            lwWaiting;        /* 0 if not waiting, 1 if on
waitlist, 2 if
+                                 * waiting to be woken */
Use macros instead of hard-coded values for better readability?

#define PROC_LW_LOCK_NOT_WAITING 0
#define PROC_LW_LOCK_ON_WAITLIST 1
#define PROC_LW_LOCK_WAITING_TO_BE_WOKEN 2

Yea - this was really more of a prototype patch - I noted that we'd want to
use defines for this in
/messages/by-id/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de

3)
+        proclist_delete(&lock->waiters, MyProc->pgprocno, lwWaitLink);
+        found = true;
I guess 'found' is a bit meaningless here as we are doing away with
the proclist_foreach_modify loop. We can directly use
MyProc->lwWaiting == 1 and remove 'found'.

We can rename it, but I think we still do need it, it's easier to analyze the
logic if the relevant check happens on a value from while we held the wait
list lock. Probably should do the reset inside the locked section as well.

4)
if (!MyProc->lwWaiting)
if (!proc->lwWaiting)
Can we modify the above conditions in lwlock.c to MyProc->lwWaiting !=
1 or PROC_LW_LOCK_ON_WAITLIST or the macro?

I think it's better to check it's 0, rather than just != 1.

5) Is there any specific test case that I can see benefit of this
patch? If yes, can you please share it here?

Yep, see the other thread, there's a pretty easy case there. You can also see
it at extreme client counts with a pgbench -S against a cluster with a smaller
shared_buffers. But the difference is not huge before something like 2048-4096
clients, and then it only occurs occasionally (because you need to end up with
most connections waiting for one of the partitions). So the test case from the
other thread is a lot better.

Greetings,

Andres Freund

#4vignesh C
vignesh21@gmail.com
In reply to: Andres Freund (#1)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3993.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
f2857af485a00ab5dbfa2c83af9d83afe4378239 ===
=== applying patch
./v1-0001-wip-lwlock-fix-quadratic-behaviour-with-very-long.patch
patching file src/include/storage/proc.h
Hunk #1 FAILED at 217.
1 out of 1 hunk FAILED -- saving rejects to file src/include/storage/proc.h.rej
patching file src/backend/storage/lmgr/lwlock.c
Hunk #1 succeeded at 988 with fuzz 2 (offset 1 line).
Hunk #2 FAILED at 1047.
Hunk #3 FAILED at 1076.
Hunk #4 FAILED at 1104.
Hunk #5 FAILED at 1117.
Hunk #6 FAILED at 1141.
Hunk #7 FAILED at 1775.
Hunk #8 FAILED at 1790.
7 out of 8 hunks FAILED -- saving rejects to file
src/backend/storage/lmgr/lwlock.c.rej

[1]: http://cfbot.cputube.org/patch_41_3993.log

Regards,
Vignesh

#5Andres Freund
andres@anarazel.de
In reply to: vignesh C (#4)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-01-06 11:52:04 +0530, vignesh C wrote:

On Sat, 29 Oct 2022 at 08:24, Andres Freund <andres@anarazel.de> wrote:

The patches here aren't fully polished (as will be evident). But they should
be more than good enough to discuss whether this is a sane direction.

The patch does not apply on top of HEAD as in [1], please post a rebased
patch.

Thanks for letting me now. Updated version attached.

Greetings,

Andres Freund

Attachments:

v2-0001-aio-Add-some-error-checking-around-pinning.patchtext/x-diff; charset=us-asciiDownload+30-14
v2-0002-hio-Release-extension-lock-before-initializing-pa.patchtext/x-diff; charset=us-asciiDownload+7-8
v2-0003-Add-smgrzeroextend-FileZero-FileFallocate.patchtext/x-diff; charset=us-asciiDownload+236-1
v2-0004-bufmgr-Add-Pin-UnpinLocalBuffer.patchtext/x-diff; charset=us-asciiDownload+46-49
v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patchtext/x-diff; charset=us-asciiDownload+381-305
v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patchtext/x-diff; charset=us-asciiDownload+105-57
v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patchtext/x-diff; charset=us-asciiDownload+546-170
v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload+70-95
v2-0009-heapam-Add-num_pages-to-RelationGetBufferForTuple.patchtext/x-diff; charset=us-asciiDownload+76-7
v2-0010-hio-Use-BulkExtendRelationBuffered.patchtext/x-diff; charset=us-asciiDownload+170-12
v2-0011-bufmgr-debug-Add-PrintBuffer-Desc.patchtext/x-diff; charset=us-asciiDownload+52-1
#6David Rowley
dgrowleyml@gmail.com
In reply to: Andres Freund (#5)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote:

Thanks for letting me now. Updated version attached.

I'm not too sure I've qualified for giving a meaningful design review
here, but I have started looking at the patches and so far only made
it as far as 0006.

I noted down the following while reading:

v2-0001:

1. BufferCheckOneLocalPin needs a header comment

v2-0002:

2. The following comment and corresponding code to release the
extension lock has been moved now.

/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more.
*/

I think it's worth detailing out why it's fine to release the
extension lock in the new location. You've added detail to the commit
message but I think you need to do the same in the comments too.

v2-0003

3. FileFallocate() and FileZero() should likely document what they
return, i.e zero on success and non-zero on failure.

4. I'm not quite clear on why you've modified FileGetRawDesc() to call
FileAccess() twice.

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

v2-0005
v2-0006

David

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: David Rowley (#6)
Re: refactoring relation extension and BufferAlloc(), faster COPY

I'll continue reviewing this, but here's some feedback on the first two
patches:

v2-0001-aio-Add-some-error-checking-around-pinning.patch:

I wonder if the extra assertion in LockBufHdr() is worth the overhead.
It won't add anything without assertions, of course, but still. No
objections if you think it's worth it.

v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:

Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK,
which zeroes the page, and then we call PageInit to zero the page again.
RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache
previously, but with P_NEW, that is always true.

- Heikki

#8Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#7)
Re: refactoring relation extension and BufferAlloc(), faster COPY

v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch

This can be applied separately from the rest of the patches, which is
nice. Some small comments on it:

* Needs a rebase, it conflicted slightly with commit f30d62c2fc.

* GetVictimBuffer needs a comment to explain what it does. In
particular, mention that it returns a buffer that's pinned and known
!BM_TAG_VALID.

* I suggest renaming 'cur_buf' and other such local variables in
GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some
other buffer involved too, but there is no 'prev' or 'next' or 'other'
buffer. The old code called it just 'buf' too, and before this patch it
actually was a bit confusing because there were two buffers involved.
But with this patch, GetVictimBuffer only deals with one buffer at a time.

* This FIXME:

/* OK, do the I/O */
/* FIXME: These used the wrong smgr before afaict? */
{
SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
InvalidBackendId);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);

FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context);
LWLockRelease(content_lock);

ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf_hdr->tag);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);
}

I believe that was intentional. The probes previously reported the block
and relation whose read *caused* the eviction. It was not just the smgr
but also the blockNum and forkNum that referred to the block that was
being read. There's another pair of probe points,
TRACE_POSTGRESQL_BUFFER_FLUSH_START/DONE, inside FlushBuffer that
indicate the page that is being flushed.

I see that reporting the evicted page is more convenient with this
patch, otherwise you'd need to pass the smgr and blocknum of the page
that's being read to InvalidateVictimBuffer(). IMHO you can just remove
these probe points. We don't need to bend over backwards to maintain
specific probe points.

* InvalidateVictimBuffer reads the buffer header with an atomic read op,
just to check if BM_TAG_VALID is set. If it's not, it does nothing
(except for a few Asserts). But the caller has already read the buffer
header. Consider refactoring it so that the caller checks VM_TAG_VALID,
and only calls InvalidateVictimBuffer if it's set, saving one atomic
read in InvalidateVictimBuffer. I think it would be just as readable, so
no loss there. I doubt the atomic read makes any measurable performance
difference, but it looks redundant.

* I don't understand this comment:

/*
* Clear out the buffer's tag and flags and usagecount. We must do
* this to ensure that linear scans of the buffer array don't think
* the buffer is valid.
*
* XXX: This is a pre-existing comment I just moved, but isn't it
* entirely bogus with regard to the tag? We can't do anything with
* the buffer without taking BM_VALID / BM_TAG_VALID into
* account. Likely doesn't matter because we're already dirtying the
* cacheline, but still.
*
*/
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the
cacheline have to do with the correctness here?

* pgindent

- Heikki

#9Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#8)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:

v2-0005-bufmgr-Acquire-and-clean-victim-buffer-separately.patch

This can be applied separately from the rest of the patches, which is nice.
Some small comments on it:

Thanks for looking at these!

* Needs a rebase, it conflicted slightly with commit f30d62c2fc.

Will work on that.

* GetVictimBuffer needs a comment to explain what it does. In particular,
mention that it returns a buffer that's pinned and known !BM_TAG_VALID.

Will add.

* I suggest renaming 'cur_buf' and other such local variables in
GetVictimBufffer to just 'buf'. 'cur' prefix suggests that there is some
other buffer involved too, but there is no 'prev' or 'next' or 'other'
buffer. The old code called it just 'buf' too, and before this patch it
actually was a bit confusing because there were two buffers involved. But
with this patch, GetVictimBuffer only deals with one buffer at a time.

Hm. Yea. I probably ended up with these names because initially
GetVictimBuffer() wasn't a separate function, and I indeed constantly got
confused by which buffer was referenced.

* This FIXME:

/* OK, do the I/O */
/* FIXME: These used the wrong smgr before afaict? */
{
SMgrRelation smgr = smgropen(BufTagGetRelFileLocator(&buf_hdr->tag),
InvalidBackendId);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);

FlushBuffer(buf_hdr, smgr, IOOBJECT_RELATION, io_context);
LWLockRelease(content_lock);

ScheduleBufferTagForWriteback(&BackendWritebackContext,
&buf_hdr->tag);

TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(buf_hdr->tag.forkNum,
buf_hdr->tag.blockNum,
smgr->smgr_rlocator.locator.spcOid,
smgr->smgr_rlocator.locator.dbOid,
smgr->smgr_rlocator.locator.relNumber);
}

I believe that was intentional. The probes previously reported the block and
relation whose read *caused* the eviction. It was not just the smgr but also
the blockNum and forkNum that referred to the block that was being read.

You're probably right. It's certainly not understandable from our docs
though:

<row>
<entry><literal>buffer-write-dirty-start</literal></entry>
<entry><literal>(ForkNumber, BlockNumber, Oid, Oid, Oid)</literal></entry>
<entry>Probe that fires when a server process begins to write a dirty
buffer. (If this happens often, it implies that
<xref linkend="guc-shared-buffers"/> is too
small or the background writer control parameters need adjustment.)
arg0 and arg1 contain the fork and block numbers of the page.
arg2, arg3, and arg4 contain the tablespace, database, and relation OIDs
identifying the relation.</entry>
</row>

I see that reporting the evicted page is more convenient with this patch,
otherwise you'd need to pass the smgr and blocknum of the page that's being
read to InvalidateVictimBuffer(). IMHO you can just remove these probe
points. We don't need to bend over backwards to maintain specific probe
points.

Agreed.

* InvalidateVictimBuffer reads the buffer header with an atomic read op,
just to check if BM_TAG_VALID is set.

It's not a real atomic op, in the sense of being special instruction. It does
force the compiler to actually read from memory, but that's it.

But you're right, even that is unnecessary. I think it ended up that way
because I also wanted the full buf_hdr, and it seemed somewhat error prone to
pass in both.

* I don't understand this comment:

/*
* Clear out the buffer's tag and flags and usagecount. We must do
* this to ensure that linear scans of the buffer array don't think
* the buffer is valid.
*
* XXX: This is a pre-existing comment I just moved, but isn't it
* entirely bogus with regard to the tag? We can't do anything with
* the buffer without taking BM_VALID / BM_TAG_VALID into
* account. Likely doesn't matter because we're already dirtying the
* cacheline, but still.
*
*/
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the
cacheline have to do with the correctness here?

There's nothing wrong with clearing out the tag, but I don't think it's a hard
requirement today, and certainly not for the reason stated above.

Validity of the buffer isn't determined by the tag, it's determined by
BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID).

Without either having pinned the buffer, or holding the buffer header
spinlock, the tag can change at any time. And code like DropDatabaseBuffers()
knows that, and re-checks the the tag after locking the buffer header
spinlock.

Afaict, there'd be no correctness issue with removing the
ClearBufferTag(). There would be an efficiency issue though, because when
encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(),
which'd find that BM_[TAG_]VALID isn't set, and not to anything.

Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further. Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

Greetings,

Andres Freund

#10Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#9)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On 2023-02-11 13:36:51 -0800, Andres Freund wrote:

Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further. Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

There's a small regression for a single relation, but after that it's a clear
benefit.

32GB shared buffers, empty. The test creates N new relations and then rolls
back.

tps tps
num relations HEAD precheck
1 46.11 45.22
2 43.24 44.87
4 35.14 44.20
8 28.72 42.79

I don't understand the regression at 1, TBH. I think it must be a random code
layout issue, because the same pre-check in DropRelationBuffers() (exercised
via TRUNCATE of a newly created relation), shows a tiny speedup.

#11Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#7)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-10 18:38:50 +0200, Heikki Linnakangas wrote:

I'll continue reviewing this, but here's some feedback on the first two
patches:

v2-0001-aio-Add-some-error-checking-around-pinning.patch:

I wonder if the extra assertion in LockBufHdr() is worth the overhead. It
won't add anything without assertions, of course, but still. No objections
if you think it's worth it.

It's so easy to get confused about local/non-local buffers, that I think it is
useful. I think we really need to consider cleaning up the separation
further. Having half the code for local buffers in bufmgr.c and the other half
in localbuf.c, without a scheme that I can recognize, is not a good scheme.

It bothers me somewhat ConditionalLockBufferForCleanup() silently accepts
multiple pins by the current backend. That's the right thing for
e.g. heap_page_prune_opt(), but for something like lazy_scan_heap() it's
not. And yes, I did encounter a bug hidden by that when making vacuumlazy use
AIO as part of that patchset. That's why I made BufferCheckOneLocalPin()
externally visible.

v2-0002-hio-Release-extension-lock-before-initializing-pa.patch:

Looks as far as it goes. It's a bit silly that we use RBM_ZERO_AND_LOCK,
which zeroes the page, and then we call PageInit to zero the page again.
RBM_ZERO_AND_LOCK only zeroes the page if it wasn't in the buffer cache
previously, but with P_NEW, that is always true.

It is quite silly, and it shows up noticably in profiles. The zeroing is
definitely needed in other places calling PageInit(), though. I suspect we
should have a PageInitZeroed() or such, that asserts the page is zero, but
otherwise skips it.

Seems independent enough from this series, that I'd probably tackle it
separately? If you prefer, I'm ok with adding a patch to this series instead,
though.

Greetings,

Andres Freund

#12Andres Freund
andres@anarazel.de
In reply to: David Rowley (#6)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-01-20 13:40:55 +1300, David Rowley wrote:

On Tue, 10 Jan 2023 at 15:08, Andres Freund <andres@anarazel.de> wrote:

Thanks for letting me now. Updated version attached.

I'm not too sure I've qualified for giving a meaningful design review
here, but I have started looking at the patches and so far only made
it as far as 0006.

Thanks!

I noted down the following while reading:

v2-0001:

1. BufferCheckOneLocalPin needs a header comment

v2-0002:

2. The following comment and corresponding code to release the
extension lock has been moved now.

/*
* Release the file-extension lock; it's now OK for someone else to extend
* the relation some more.
*/

I think it's worth detailing out why it's fine to release the
extension lock in the new location. You've added detail to the commit
message but I think you need to do the same in the comments too.

Will do.

v2-0003

3. FileFallocate() and FileZero() should likely document what they
return, i.e zero on success and non-zero on failure.

I guess I just tried to fit in with the rest of the file :)

4. I'm not quite clear on why you've modified FileGetRawDesc() to call
FileAccess() twice.

I do not have the faintest idea what happened there... Will fix.

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

I think it'd be nicer to just move the read inside the if
(adjust_usagecount). That way the rest of the function doesn't have to be
duplicated.

Thanks,

Andres Freund

#13Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#12)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-11 14:25:06 -0800, Andres Freund wrote:

On 2023-01-20 13:40:55 +1300, David Rowley wrote:

v2-0004:

5. Is it worth having two versions of PinLocalBuffer() one to adjust
the usage count and one that does not? Couldn't the version that does
not adjust the count skip doing pg_atomic_read_u32()?

I think it'd be nicer to just move the read inside the if
(adjust_usagecount). That way the rest of the function doesn't have to be
duplicated.

Ah, no, we need it for the return value. No current users of
PinLocalBuffer(adjust_usagecount = false)
need the return value, but I don't think that's necessarily the case.

I'm somewhat inclined to not duplicate it, but if you think it's worth it,
I'll do that.

Greetings,

Andres Freund

#14Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#9)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On 11/02/2023 23:36, Andres Freund wrote:

On 2023-02-11 23:03:56 +0200, Heikki Linnakangas wrote:

* I don't understand this comment:

/*
* Clear out the buffer's tag and flags and usagecount. We must do
* this to ensure that linear scans of the buffer array don't think
* the buffer is valid.
*
* XXX: This is a pre-existing comment I just moved, but isn't it
* entirely bogus with regard to the tag? We can't do anything with
* the buffer without taking BM_VALID / BM_TAG_VALID into
* account. Likely doesn't matter because we're already dirtying the
* cacheline, but still.
*
*/
ClearBufferTag(&buf_hdr->tag);
buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
UnlockBufHdr(buf_hdr, buf_state);

What exactly is wrong with clearing the tag? What does dirtying the
cacheline have to do with the correctness here?

There's nothing wrong with clearing out the tag, but I don't think it's a hard
requirement today, and certainly not for the reason stated above.

Validity of the buffer isn't determined by the tag, it's determined by
BM_VALID (or, if you interpret valid more widely, BM_TAG_VALID).

Without either having pinned the buffer, or holding the buffer header
spinlock, the tag can change at any time. And code like DropDatabaseBuffers()
knows that, and re-checks the the tag after locking the buffer header
spinlock.

Afaict, there'd be no correctness issue with removing the
ClearBufferTag(). There would be an efficiency issue though, because when
encountering an invalid buffer, we'd unnecessarily enter InvalidateBuffer(),
which'd find that BM_[TAG_]VALID isn't set, and not to anything.

Okay, gotcha.

Even though it's not a correctness issue, it seems to me that
DropRelationsAllBuffers() etc ought to check if the buffer is BM_TAG_VALID,
before doing anything further. Particularly in DropRelationsAllBuffers(), the
check we do for each buffer isn't cheap. Doing it for buffers that don't even
have a tag seems .. not smart.

Depends on what percentage of buffers are valid, I guess. If all buffers
are valid, checking BM_TAG_VALID first would lose. In practice, I doubt
it makes any measurable difference either way.

Since we're micro-optimizing, I noticed that
BufTagMatchesRelFileLocator() compares the fields in order "spcOid,
dbOid, relNumber". Before commit 82ac34db20, we used
RelFileLocatorEqual(), which has this comment:

/*
* Note: RelFileLocatorEquals and RelFileLocatorBackendEquals compare
relNumber
* first since that is most likely to be different in two unequal
* RelFileLocators. It is probably redundant to compare spcOid if the
other
* fields are found equal, but do it anyway to be sure. Likewise for
checking
* the backend ID in RelFileLocatorBackendEquals.
*/

So we lost that micro-optimization. Should we reorder the checks in
BufTagMatchesRelFileLocator()?

- Heikki

#15Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#5)
Re: refactoring relation extension and BufferAlloc(), faster COPY

v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch

This looks straightforward. My only concern is that it changes the order
that things happen at abort. Currently, AbortBufferIO() is called very
early in AbortTransaction(), and this patch moves it much later. I don't
see any immediate problems from that, but it feels scary.

@@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
static void
AtProcExit_Buffers(int code, Datum arg)
{
- AbortBufferIO();
UnlockBuffers();

CheckForBufferLeaks();

Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on
elog(FATAL)? Do we need to worry about that?

- Heikki

#16Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#5)
Re: refactoring relation extension and BufferAlloc(), faster COPY

v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch

+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)

Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.

v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index de1427a1e0e..1810f7ebfef 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
* whole relation will be rolled back.
*/
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendRelationBuffered(index, NULL, true,
+								  index->rd_rel->relpersistence,
+								  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+								  NULL);
Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);

brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
BRIN_CURRENT_VERSION);

Since we're changing the API anyway, how about introducing a new
function for this case where we extend the relation but we what block
number we're going to get? This pattern of using P_NEW and asserting the
result has always felt awkward to me.

-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendRelationBuffered(irel, NULL, false,
+									 irel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);

These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW).
I'd suggest something like:

buf = ExtendBuffer(rel);

Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
ExtendRelationBuffered?

Is it ever possible to call this without a relcache entry? WAL redo
functions do that with ReadBuffer, but they only extend a relation
implicitly, by replay a record for a particular block.

All of the above comments are around the BulkExtendRelationBuffered()
function's API. That needs a closer look and a more thought-out design
to make it nice. Aside from that, this approach seems valid.

- Heikki

#17Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Heikki Linnakangas (#16)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On 2023-Feb-21, Heikki Linnakangas wrote:

+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)

Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.

Yeah, I noticed this too. I think it would be easy enough to add a new
struct that can be passed as a pointer, which can be stack-allocated
by the caller, and which holds the input arguments that are common to
both functions, as is sensible.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Update: super-fast reaction on the Postgres bugs mailing list. The report
was acknowledged [...], and a fix is under discussion.
The wonders of open-source !"
https://twitter.com/gunnarmorling/status/1596080409259003906

#18Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#15)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 17:40:31 +0200, Heikki Linnakangas wrote:

v2-0006-bufmgr-Support-multiple-in-progress-IOs-by-using-.patch

This looks straightforward. My only concern is that it changes the order
that things happen at abort. Currently, AbortBufferIO() is called very early
in AbortTransaction(), and this patch moves it much later. I don't see any
immediate problems from that, but it feels scary.

Yea, it does feel a bit awkward. But I suspect it's actually the right
thing. We've not even adjusted the transaction state at the point we're
calling AbortBufferIO(). And AbortBufferIO() will sometimes allocate memory
for a WARNING, which conceivably could fail - although I don't think that's a
particularly realistic scenario due to TransactionAbortContext (I guess you
could have a large error context stack or such).

Medium term I think we need to move a lot more of the error handling into
resowners. Having a dozen+ places with their own choreographed sigsetjmp()
recovery blocks is error prone as hell. Not to mention tedious.

@@ -2689,7 +2685,6 @@ InitBufferPoolAccess(void)
static void
AtProcExit_Buffers(int code, Datum arg)
{
- AbortBufferIO();
UnlockBuffers();
CheckForBufferLeaks();

Hmm, do we call AbortTransaction() and ResourceOwnerRelease() on
elog(FATAL)? Do we need to worry about that?

We have before_shmem_exit() callbacks that should protect against
that. InitPostgres() registers ShutdownPostgres(), and
CreateAuxProcessResourceOwner() registers
ReleaseAuxProcessResourcesCallback().

I think we'd already be in trouble if we didn't reliably end up doing resowner
cleanup during process exit.

Perhaps ResourceOwnerCreate()/ResourceOwnerDelete() should maintain a list of
"active" resource owners and have a before-exit callback that ensures the list
is empty and PANICs if not? Better a crash restart than hanging because we
didn't release some shared resource.

Greetings,

Andres Freund

#19Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#16)
Re: refactoring relation extension and BufferAlloc(), faster COPY

Hi,

On 2023-02-21 18:18:02 +0200, Heikki Linnakangas wrote:

v2-0007-bufmgr-Move-relation-extension-handling-into-Bulk.patch

+static BlockNumber
+BulkExtendSharedRelationBuffered(Relation rel,
+								 SMgrRelation smgr,
+								 bool skip_extension_lock,
+								 char relpersistence,
+								 ForkNumber fork, ReadBufferMode mode,
+								 BufferAccessStrategy strategy,
+								 uint32 *num_pages,
+								 uint32 num_locked_pages,
+								 Buffer *buffers)

Ugh, that's a lot of arguments, some are inputs and some are outputs. I
don't have any concrete suggestions, but could we simplify this somehow?
Needs a comment at least.

Yea. I think this is the part of the patchset I like the least.

The ugliest bit is accepting both rel and smgr. The background to that is that
we need the relation oid to acquire the extension lock. But during crash
recovery we don't have that - which is fine, because we don't need the
extension lock.

We could have two different of functions, but that ends up a mess as well, as
we've seen in other cases.

v2-0008-Convert-a-few-places-to-ExtendRelationBuffered.patch

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index de1427a1e0e..1810f7ebfef 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -829,9 +829,11 @@ brinbuild(Relation heap, Relation index, IndexInfo *indexInfo)
* whole relation will be rolled back.
*/
-	meta = ReadBuffer(index, P_NEW);
+	meta = ExtendRelationBuffered(index, NULL, true,
+								  index->rd_rel->relpersistence,
+								  MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+								  NULL);
Assert(BufferGetBlockNumber(meta) == BRIN_METAPAGE_BLKNO);
-	LockBuffer(meta, BUFFER_LOCK_EXCLUSIVE);
brin_metapage_init(BufferGetPage(meta), BrinGetPagesPerRange(index),
BRIN_CURRENT_VERSION);

Since we're changing the API anyway, how about introducing a new function
for this case where we extend the relation but we what block number we're
going to get? This pattern of using P_NEW and asserting the result has
always felt awkward to me.

To me it always felt like a code smell that some code insists on specific
getting specific block numbers with P_NEW. I guess it's ok for things like
building a new index, but outside of that it feels wrong.

The first case I found just now is revmap_physical_extend(). Which seems to
extend the relation while holding an lwlock. Ugh.

Maybe ExtendRelationBufferedTo() or something like that? With a big comment
saying that users of it are likely bad ;)

-		buf = ReadBuffer(irel, P_NEW);
+		buf = ExtendRelationBuffered(irel, NULL, false,
+									 irel->rd_rel->relpersistence,
+									 MAIN_FORKNUM, RBM_ZERO_AND_LOCK,
+									 NULL);

These new calls are pretty verbose, compared to ReadBuffer(rel, P_NEW). I'd
suggest something like:

I guess. Not sure if it's worth optimizing for brevity all that much here -
there's not that many places extending relations. Several places end up with
less code, actually , because they don't need to care about the extension lock
themselves anymore. I think an ExtendBuffer() that doesn't mention the fork,
etc, ends up being more confusing than helpful.

buf = ExtendBuffer(rel);

Without the relation in the name it just seems confusing to me - the extension
isn't "restricted" to shared_buffers. ReadBuffer() isn't great as a name
either, but it makes a bit more sense at least, it reads into a buffer. And
it's a vastly more frequent operation, so optimizing for density is worth it.

Do other ReadBufferModes than RBM_ZERO_AND_LOCK make sense with
ExtendRelationBuffered?

Hm. That's a a good point. Probably not. Perhaps it could be useful to support
RBM_NORMAL as well? But even if, it'd just be a lock release away if we always
used RBM_ZERO_AND_LOCK.

Is it ever possible to call this without a relcache entry? WAL redo
functions do that with ReadBuffer, but they only extend a relation
implicitly, by replay a record for a particular block.

I think we should use it for crash recovery as well, but the patch doesn't
yet. We have some gnarly code there, see the loop using P_NEW in
XLogReadBufferExtended(). Extending the file one-by-one is a lot more
expensive than doing it in bulk.

All of the above comments are around the BulkExtendRelationBuffered()
function's API. That needs a closer look and a more thought-out design to
make it nice. Aside from that, this approach seems valid.

Thanks for looking! I agree that it can stand a fair bit of polishing...

Greetings,

Andres Freund

#20Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#1)
Re: refactoring relation extension and BufferAlloc(), faster COPY

On 10/28/22 9:54 PM, Andres Freund wrote:

b) I found that is quite beneficial to bulk-extend the relation with
smgrextend() even without concurrency. The reason for that is the primarily
the aforementioned dirty buffers that our current extension method causes.

One bit that stumped me for quite a while is to know how much to extend the
relation by. RelationGetBufferForTuple() drives the decision whether / how
much to bulk extend purely on the contention on the extension lock, which
obviously does not work for non-concurrent workloads.

After quite a while I figured out that we actually have good information on
how much to extend by, at least for COPY /
heap_multi_insert(). heap_multi_insert() can compute how much space is
needed to store all tuples, and pass that on to
RelationGetBufferForTuple().

For that to be accurate we need to recompute that number whenever we use an
already partially filled page. That's not great, but doesn't appear to be a
measurable overhead.

Some food for thought: I think it's also completely fine to extend any
relation over a certain size by multiple blocks, regardless of
concurrency. E.g. 10 extra blocks on an 80MB relation is 0.1%. I don't
have a good feel for what algorithm would make sense here; maybe
something along the lines of extend = max(relpages / 2048, 128); if
extend < 8 extend = 1; (presumably extending by just a couple extra
pages doesn't help much without concurrency).

#21Andres Freund
andres@anarazel.de
In reply to: Jim Nasby (#20)
#22Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#21)
#23Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#19)
#24Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#23)
#25Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#19)
#26Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Alvaro Herrera (#17)
#27Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#17)
#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#27)
#30Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#29)
#31Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#30)
#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Andres Freund (#28)
#33Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#32)
#34Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
#35Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#33)
#36Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#35)
#37Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#35)
#38Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#36)
#39Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#37)
#40Melanie Plageman
melanieplageman@gmail.com
In reply to: Andres Freund (#38)
#41Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#40)
#42Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#41)
#43John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#42)
#44Andres Freund
andres@anarazel.de
In reply to: John Naylor (#43)
#45Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#42)
#46John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#44)
#47Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#45)
#48Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#47)
#49Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#48)
#50Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#49)
#51Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#50)
#52Alexander Lakhin
exclusion@gmail.com
In reply to: Andres Freund (#51)
#53Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#52)
#54Muhammad Malik
muhammad.malik1@hotmail.com
In reply to: Andres Freund (#1)
#55Andres Freund
andres@anarazel.de
In reply to: Muhammad Malik (#54)
#56Muhammad Malik
muhammad.malik1@hotmail.com
In reply to: Andres Freund (#55)
#57Andres Freund
andres@anarazel.de
In reply to: Muhammad Malik (#56)