strange perf regression with data checksums

Started by Tomas Vondra8 months ago32 messages

tomas@vondra.me

8 months ago

Hi,

While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.

I've only observed this on large machines with 96+ cores, on azure, both
with Intel (Xeon 8370C) and AMD (EPYC 9V33X). I've not been successful
in reproducing it on the puny machines I have at home.

Let me demonstrate the issue.

1) Create a cluster, with an increased connection limit:

pg_ctl -D data init
echo 'max_connections = 100' >> data/postgresql.conf
pg_ctl -D data -l pg.log start

Now the benchmark itself - it's fairly trivial, regular pgbench on scale
1, with an extra index on "pgbench_accounts.bid" column:

createdb test
pgbench -i -s 1 test
psql test -c "create index on pgbench_accounts (bid)"

and a script with a simple query using the index

select count(*) from pgbench_accounts where bid = 0

Cool, now let's get some numbers for 32-160 clients:

for c in 32 64 96 128 160; do

pgbench -n -f select.sql -T 10 -M prepared -c $c -j $c test | grep
'tps';

done;

Which produces this:

tps = 752464.727387 (without initial connection time)

tps = 1062150.015297 (without initial connection time)

tps = 572205.386159 (without initial connection time)

tps = 568579.663980 (without initial connection time)

tps = 561360.493639 (without initial connection time)

Clearly, at 96 clients the throughput just tanks. Now let's disable
checksums on the cluster:

pg_ctl -D data -l pg.log stop

pg_checksums --disable data

pg_ctl -D data -l pg.log start

and run the script again

tps = 753484.468817 (without initial connection time)

tps = 1083752.631856 (without initial connection time)

tps = 1862008.466802 (without initial connection time)

tps = 1826484.489433 (without initial connection time)

tps = 1818400.279069 (without initial connection time)

Clearly, the throughput does not drop, and it's ~3.5x higher. This is
from the Xeon machine, but I've seen the same thing on the EPYC boxes.
The impact varies, but in general it's 70-80%.

I'm not suggesting this is caused by 04bec894a04c, or even specific to
PG 18. I see the same issue on 17, except that 17 does not enable
checksums by default. For example on EPYC 9V74 the 17 does this:

32 762187.724859
64 1284731.279766
96 2978828.264373
128 2991561.835178
160 2971689.161136

and with checksums

32 874266.549870
64 1286382.426281
96 569647.384735
128 562128.010301
160 561826.908181

So, similar regression ...

I find this quite bizarre / puzzling, because this is a read-only
workload, with absolutely no writes, and tiny data set (~15MB), i.e.
everything fits into shared buffers. Why would that be affected by
checksums at all?

I spent some time profiling this, without much success. This is what I
get from perf top:

Samples: 6M of event 'cycles:P', 4000 Hz, Event count (approx.):
5270683795302 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
50.94% postgres [.] pg_atomic_read_u32_impl
17.32% postgres [.] pg_atomic_compare_exchange_u32_impl
10.17% postgres [.] spin_delay
5.83% postgres [.] pg_atomic_fetch_or_u32_impl
1.64% postgres [.] pg_atomic_compare_exchange_u32_impl
1.20% postgres [.] BufferDescriptorGetBuffer
0.92% postgres [.] perform_spin_delay

and the report with backtraces says most of the time is spent here:

--47.89%--_bt_relandgetbuf

The atomic ops come from pinning/unpinning buffers. I realize it's
possible it gets much more expensive under concurrency (the clients
simply have to compete when updating the same counter, and with enough
clients there'll be more conflicts and retries). Kinda unfortunate, and
maybe we should do something about it, not sure.

But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?

regards

--
Tomas Vondra

Aleksander Alekseev

aleksander@timescale.com

8 months ago

In reply to: Tomas Vondra (#1)

Re: strange perf regression with data checksums

Hi Tomas,

While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.

[...]

But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?

These are interesting results.

Just wanted to clarify: did you make sure that all the hint bits were
set before executing the benchmark?

I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.

--
Best regards,
Aleksander Alekseev

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Aleksander Alekseev (#2)

Re: strange perf regression with data checksums

On 5/9/25 14:53, Aleksander Alekseev wrote:

Hi Tomas,

While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.

[...]

But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?

These are interesting results.

Just wanted to clarify: did you make sure that all the hint bits were
set before executing the benchmark?

I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.

Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.

Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.

regards

--
Tomas Vondra

Aleksander Alekseev

aleksander@timescale.com

8 months ago

In reply to: Tomas Vondra (#3)

Re: strange perf regression with data checksums

Hi,

I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.

Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.

Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.

There is one more thing I would check. As I recall perf shows only
on-CPU time while actually the backends may be sleeping on the locks
most of the time. If this is the case perf will not show you the
accurate picture.

In order to check this personally I create gdb.script with a single GDB command:

```
bt
```

And execute:

```
gdb --batch --command=gdb.script -p (backend_pid_here)
```

... 10+ times or so. If what you are observing is actually a lock
contention and the backend sleeps on a lock most of the time, 8/10 or
so stacktraces will show you this.

I assume of course that the benchmark is done on release builds with
disabled Asserts, etc.

BTW do you believe this is a problem related exclusively to NUMA CPUs
with 90+ cores or I can reproduce it on SMT as well?

--
Best regards,
Aleksander Alekseev

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Aleksander Alekseev (#4)

2 attachment(s)

Re: strange perf regression with data checksums

On 5/9/25 15:19, Aleksander Alekseev wrote:

Hi,

I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.

Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.

Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.

There is one more thing I would check. As I recall perf shows only
on-CPU time while actually the backends may be sleeping on the locks
most of the time. If this is the case perf will not show you the
accurate picture.

Right, perf only shows on-cpu time (at least by default). But the
backends really are consuming 100% CPU (or close to that, the pgbench
needs some CPU too), there are no lock waits that I can see.

I forgot to share flamegraphs I collected on the EPYC machine, for cases
with 96 clients. So here they are.

In order to check this personally I create gdb.script with a single GDB command:

```
bt
```

And execute:

```
gdb --batch --command=gdb.script -p (backend_pid_here)
```

... 10+ times or so. If what you are observing is actually a lock
contention and the backend sleeps on a lock most of the time, 8/10 or
so stacktraces will show you this.

I assume of course that the benchmark is done on release builds with
disabled Asserts, etc.

Yes, there are regular builds with just --enable-debug --enable-depend,
nothing else (certainly not asserts).

BTW do you believe this is a problem related exclusively to NUMA CPUs
with 90+ cores or I can reproduce it on SMT as well?

No idea. I couldn't reproduce this on my ryzen machine at all, but that
only has 12 cores. The xeon (with 2x22 cores) seems to reproduce it, but
the difference is much smaller (1.2M vs. 1.5M tps), the pin/unpin have
only ~5% CPU, not 50%. Assuming it's the same issue, of course.

regards

--
Tomas Vondra

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Tomas Vondra (#5)

1 attachment(s)

Re: strange perf regression with data checksums

Hi,

I looked at this again, and I think the reason is mostly obvious. Both
why it's trashing, and why it happens with checksums=on ...

The reason why it happens is that PinBuffer does this:

old_buf_state = pg_atomic_read_u32(&buf->state);
for (;;)
{
if (old_buf_state & BM_LOCKED)
old_buf_state = WaitBufHdrUnlocked(buf);

buf_state = old_buf_state;

... modify state ...

if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
buf_state))
{
...
break;
}
}

So, we read the buffer state (which is where pins are tracked), possibly
waiting for it to get unlocked. Then we modify the state, and update it,
but only if it didn't change. If it did change, we retry.

Of course, as the number of sessions grows, the probability of something
updating the state in between increases. Another session might have
pinned the buffer, for example. This causes retries.

I added a couple counters to track how many loops are needed, and with
96 clients this needs about 800k retries per 100k calls, so about 8
retries per call. With 32 clients, this needs only about 25k retries, so
0.25 retry / call. That's a huge difference.

I believe enabling data checksums simply makes it more severe, because
the BufferGetLSNAtomic() has to obtain header lock, which uses the same
"state" field, with exactly the same retry logic. It can probably happen
even without it, but as the lock is exclusive, it also "serializes" the
access, making the conflicts more likely.

BufferGetLSNAtomic does this:

bufHdr = GetBufferDescriptor(buffer - 1);
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);

AFAICS the lock is needed simply to read a consistent value from the
page header, but maybe we could have an atomic variable with a copy of
the LSN in the buffer descriptor?

regards

--
Tomas Vondra

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Tomas Vondra (#3)

Re: strange perf regression with data checksums

On Fri, May 9, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:

Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.

Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.

We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.

In general, we call BufferGetLSNAtomic() to stash the page LSN within
so->currPos.lsn, for later. so->currPos.lsn provides us with a way to
detect whether the page was modified during the period in which we
dropped our pin on the leaf page -- plain index scans cannot set
LP_DEAD bits on dead index tuples within _bt_killitems() if the page
has changed. But, index-only scans never drop the pin on the leaf page
to begin with, and so don't even use so->currPos.lsn (bitmap index
scans *never* call _bt_killitems(), and so obviously have no possible
use for so->currPos.lsn, either).

--
Peter Geoghegan

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Peter Geoghegan (#7)

1 attachment(s)

Re: strange perf regression with data checksums

On Mon, May 19, 2025 at 12:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.

Attached quick-and-dirty prototype patch shows how this could work. It
fully avoids calling BufferGetLSNAtomic() from _bt_readpage() during
index-only scans. I find that "meson test" passes with the patch
applied (I'm reasonably confident that this general approach is
correct).

Does this patch of mine actually fix the regressions that you're
concerned about?

--
Peter Geoghegan

Attachments:

v1-0001-Avoid-BufferGetLSNAtomic-calls-during-index-only-.patchapplication/octet-stream; name=v1-0001-Avoid-BufferGetLSNAtomic-calls-during-index-only-.patchDownload

From dc023aa6ece08eba610e5b7a0a69f82c62113be0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 May 2025 13:05:28 -0400
Subject: [PATCH v1] Avoid BufferGetLSNAtomic() calls during index-only scans.

---
 src/include/access/nbtree.h                   |  1 +
 src/backend/access/nbtree/nbtpreprocesskeys.c |  8 +++++++
 src/backend/access/nbtree/nbtsearch.c         | 22 ++++++++++---------
 src/backend/access/nbtree/nbtutils.c          | 11 +++++-----
 4 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588..933507ec3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1054,6 +1054,7 @@ typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
 	bool		qual_ok;		/* false if qual can never be satisfied */
+	bool		drop_pin;		/* true if it's safe to drop leaf page pins */
 	int			numberOfKeys;	/* number of preprocessed scan keys */
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..b38fe013a 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -205,6 +205,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 	/* initialize result variables */
 	so->qual_ok = true;
+
+	/*
+	 * Can only safely drop leaf page pins during plain index scans of logged
+	 * relations (XXX What about bitmap index scans?)
+	 */
+	so->drop_pin = (IsMVCCSnapshot(scan->xs_snapshot) &&
+					RelationNeedsWAL(scan->indexRelation) &&
+					!scan->xs_want_itup);
 	so->numberOfKeys = 0;
 
 	if (numberOfKeys < 1)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a38869..7d6361949 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so,
+											   BTScanPos sp);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
@@ -63,14 +64,12 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
  *
  * See nbtree/README section on making concurrent TID recycling safe.
  */
-static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+static inline void
+_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so, BTScanPos sp)
 {
-	_bt_unlockbuf(scan->indexRelation, sp->buf);
+	_bt_unlockbuf(rel, sp->buf);
 
-	if (IsMVCCSnapshot(scan->xs_snapshot) &&
-		RelationNeedsWAL(scan->indexRelation) &&
-		!scan->xs_want_itup)
+	if (so->drop_pin)
 	{
 		ReleaseBuffer(sp->buf);
 		sp->buf = InvalidBuffer;
@@ -1627,7 +1626,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	}
 
 	/* initialize remaining currPos fields related to current page */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+	if (so->drop_pin)
+		so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
 	so->currPos.dir = dir;
 	so->currPos.nextTupleOffset = 0;
 	/* either moreLeft or moreRight should be set now (may be unset later) */
@@ -2251,12 +2251,14 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	 */
 	if (_bt_readpage(scan, dir, offnum, true))
 	{
+		Relation	rel = scan->indexRelation;
+
 		/*
 		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
 		 * so->currPos.buf in preparation for btgettuple returning tuples.
 		 */
 		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		_bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
 		return true;
 	}
 
@@ -2413,7 +2415,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	 */
 	Assert(so->currPos.currPage == blkno);
 	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	_bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7..7f72a1a77 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3364,7 +3364,6 @@ _bt_killitems(IndexScanDesc scan)
 	int			i;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
-	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
 
 	Assert(BTScanPosIsValid(so->currPos));
 
@@ -3374,7 +3373,7 @@ _bt_killitems(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	if (BTScanPosIsPinned(so->currPos))
+	if (!so->drop_pin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
@@ -3382,7 +3381,7 @@ _bt_killitems(IndexScanDesc scan)
 		 * re-use of any TID on the page, so there is no need to check the
 		 * LSN.
 		 */
-		droppedpin = false;
+		Assert(BTScanPosIsPinned(so->currPos));
 		_bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
 
 		page = BufferGetPage(so->currPos.buf);
@@ -3391,8 +3390,8 @@ _bt_killitems(IndexScanDesc scan)
 	{
 		Buffer		buf;
 
-		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
+		Assert(!BTScanPosIsPinned(so->currPos));
 		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
@@ -3442,7 +3441,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !droppedpin case), so it's
+				 * since we first read it (in the !so->drop_pin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3458,7 +3457,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !droppedpin);
+					Assert(kitem->indexOffset == offnum || !so->drop_pin);
 
 					/*
 					 * Read-ahead to later kitems here.
-- 
2.49.0

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Peter Geoghegan (#8)

Re: strange perf regression with data checksums

On 5/19/25 19:20, Peter Geoghegan wrote:

On Mon, May 19, 2025 at 12:42 PM Peter Geoghegan <pg@bowt.ie> wrote:

We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.

Attached quick-and-dirty prototype patch shows how this could work. It
fully avoids calling BufferGetLSNAtomic() from _bt_readpage() during
index-only scans. I find that "meson test" passes with the patch
applied (I'm reasonably confident that this general approach is
correct).

Does this patch of mine actually fix the regressions that you're
concerned about?

For index-only scans, yes. The numbers I used to see were

64 clients: 1.2M tps
96 clients: 0.6M tps

and with the patch it's

64 clients: 1.2M tps
96 clients: 2.9M tps

I'm aware it more than doubles, even if the client count grew only 1.5x.
I believe this is due to a VM config issue I saw earlier, but it wasn't
fixed on this particular VM (unfortunate, but out of my control).

The regular index scan however still have this issue, although it's not
as visible as for IOS. If I disable IOS, the throughput is

64 clients: 0.7M tps
96 clients: 0.6M tps

And the profile looks like this:

Overhead Shared Object Symbol
28.85% postgres [.] PinBuffer
22.61% postgres [.] UnpinBufferNoOwner
9.67% postgres [.] BufferGetLSNAtomic
4.92% [kernel] [k] finish_task_switch.isra.0
3.78% [kernel] [k] _raw_spin_unlock_irqrestore
...

In an earlier message I mentioned maybe we could add an atomic variable
tracking the page LSN, so that we don't have to obtain the header lock.
I didn't have time to try yet.

regards

--
Tomas Vondra

#10

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Tomas Vondra (#9)

Re: strange perf regression with data checksums

On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:

For index-only scans, yes.

Great.

The regular index scan however still have this issue, although it's not
as visible as for IOS.

We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.

We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface). In other words, we only
need to call BufferGetLSNAtomic for pages that _bt_readpage returns
"true" for when called. There are plenty of leaf pages that
_bt_readpage will return "false" for, especially during large range
scans, and skip scans.

It's easy to see why this extension to my v1 POC is correct: the whole
point of dropping the leaf page pin is that we don't block VACUUM when
btgettuple returns -- but btgettuple isn't going to return until the
next call to _bt_readpage that returns "true" actually takes place (or
until the whole scan ends).

--
Peter Geoghegan

#11

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Peter Geoghegan (#10)

1 attachment(s)

Re: strange perf regression with data checksums

On Mon, May 19, 2025 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:

The regular index scan however still have this issue, although it's not
as visible as for IOS.

We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.

We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface).

Attached is v2, which does things this way. What do you think?

v2 also manages to avoid calling BufferGetLSNAtomic during all bitmap
index scans. You didn't complain about any regressions in bitmap index
scans, but I see no reason to take any chances (it's easy to just not
call BufferGetLSNAtomic there).

--
Peter Geoghegan

Attachments:

v2-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patchapplication/octet-stream; name=v2-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patchDownload

From 957019074b42bcf0332a353634e8d37b52bd974a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 May 2025 13:05:28 -0400
Subject: [PATCH v2] Avoid BufferGetLSNAtomic() calls during nbtree scans.

Avoid BufferGetLSNAtomic() calls during nbtree index-only scans, bitmap
index scans, and page reads of pages that won't need to return items to
the scan via the btgettuple interface.
---
 src/include/access/nbtree.h                   |  3 +-
 src/backend/access/nbtree/nbtpreprocesskeys.c | 13 +++++++
 src/backend/access/nbtree/nbtree.c            |  4 ++
 src/backend/access/nbtree/nbtsearch.c         | 39 ++++++++++++-------
 src/backend/access/nbtree/nbtutils.c          | 12 +++---
 5 files changed, 50 insertions(+), 21 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588..1b0f143cf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -967,7 +967,7 @@ typedef struct BTScanPosData
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN */
+	XLogRecPtr	lsn;			/* currPage's LSN (iff so.drop_pin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
@@ -1054,6 +1054,7 @@ typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
 	bool		qual_ok;		/* false if qual can never be satisfied */
+	bool		drop_pin;		/* true if it's safe to drop leaf page pins */
 	int			numberOfKeys;	/* number of preprocessed scan keys */
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..ae48f49d7 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -205,6 +205,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
 
 	/* initialize result variables */
 	so->qual_ok = true;
+
+	/*
+	 * We prefer to eagerly drop a leaf page pin to avoid blocking concurrent
+	 * heap TID recycling by VACUUM.  But we cannot safely drop leaf page pins
+	 * during index-only scans, nor scans of logged relations, where checking
+	 * if the page's LSN changed while the pin was dropped isn't sufficient.
+	 * (Setting so->drop_pin=true doesn't meaningfully affect the behavior of
+	 * bitmap index scans, so they always set it 'false' to avoid needlessly
+	 * calling BufferGetLSNAtomic from _bt_readpage.)
+	 */
+	so->drop_pin = (IsMVCCSnapshot(scan->xs_snapshot) &&
+					RelationNeedsWAL(scan->indexRelation) &&
+					!scan->xs_want_itup && scan->heapRelation);
 	so->numberOfKeys = 0;
 
 	if (numberOfKeys < 1)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 765659887..aad896327 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -228,6 +228,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		res;
 
+	Assert(scan->heapRelation != NULL);
+
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
@@ -289,6 +291,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
+	Assert(scan->heapRelation == NULL);
+
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a38869..851825a64 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so,
+											   BTScanPos sp);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
@@ -63,18 +64,26 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
  *
  * See nbtree/README section on making concurrent TID recycling safe.
  */
-static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+static inline void
+_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so, BTScanPos sp)
 {
-	_bt_unlockbuf(scan->indexRelation, sp->buf);
-
-	if (IsMVCCSnapshot(scan->xs_snapshot) &&
-		RelationNeedsWAL(scan->indexRelation) &&
-		!scan->xs_want_itup)
+	if (!so->drop_pin)
 	{
-		ReleaseBuffer(sp->buf);
-		sp->buf = InvalidBuffer;
+		/* Just drop the lock (not the pin) */
+		_bt_unlockbuf(rel, sp->buf);
+		return;
 	}
+
+	/*
+	 * Drop the lock and the pin.
+	 *
+	 * Have to establish so->currPos.lsn to provide _bt_killitems with an
+	 * alternative mechanism to ensure that no unsafe concurrent heap TID
+	 * recycling has taken place.
+	 */
+	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+	_bt_relbuf(rel, so->currPos.buf);
+	sp->buf = InvalidBuffer;
 }
 
 /*
@@ -1610,6 +1619,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
 	so->currPos.prevPage = opaque->btpo_prev;
 	so->currPos.nextPage = opaque->btpo_next;
+	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
 
 	Assert(!P_IGNORE(opaque));
 	Assert(BTScanPosIsPinned(so->currPos));
@@ -1626,8 +1636,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 								 so->currPos.currPage);
 	}
 
-	/* initialize remaining currPos fields related to current page */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+	/* initialize most remaining currPos fields related to current page */
 	so->currPos.dir = dir;
 	so->currPos.nextTupleOffset = 0;
 	/* either moreLeft or moreRight should be set now (may be unset later) */
@@ -2251,12 +2260,14 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	 */
 	if (_bt_readpage(scan, dir, offnum, true))
 	{
+		Relation	rel = scan->indexRelation;
+
 		/*
 		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
 		 * so->currPos.buf in preparation for btgettuple returning tuples.
 		 */
 		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		_bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
 		return true;
 	}
 
@@ -2413,7 +2424,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	 */
 	Assert(so->currPos.currPage == blkno);
 	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	_bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7..8b02f2c7f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3364,9 +3364,9 @@ _bt_killitems(IndexScanDesc scan)
 	int			i;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
-	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
 
 	Assert(BTScanPosIsValid(so->currPos));
+	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -3374,7 +3374,7 @@ _bt_killitems(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	if (BTScanPosIsPinned(so->currPos))
+	if (!so->drop_pin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
@@ -3382,7 +3382,7 @@ _bt_killitems(IndexScanDesc scan)
 		 * re-use of any TID on the page, so there is no need to check the
 		 * LSN.
 		 */
-		droppedpin = false;
+		Assert(BTScanPosIsPinned(so->currPos));
 		_bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
 
 		page = BufferGetPage(so->currPos.buf);
@@ -3391,8 +3391,8 @@ _bt_killitems(IndexScanDesc scan)
 	{
 		Buffer		buf;
 
-		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
+		Assert(!BTScanPosIsPinned(so->currPos));
 		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
@@ -3442,7 +3442,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !droppedpin case), so it's
+				 * since we first read it (in the !so->drop_pin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3458,7 +3458,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !droppedpin);
+					Assert(kitem->indexOffset == offnum || !so->drop_pin);
 
 					/*
 					 * Read-ahead to later kitems here.
-- 
2.49.0

#12

Andres Freund

andres@anarazel.de

8 months ago

In reply to: Tomas Vondra (#6)

Re: strange perf regression with data checksums

Hi,

On 2025-05-19 18:13:02 +0200, Tomas Vondra wrote:

I believe enabling data checksums simply makes it more severe, because
the BufferGetLSNAtomic() has to obtain header lock, which uses the same
"state" field, with exactly the same retry logic. It can probably happen
even without it, but as the lock is exclusive, it also "serializes" the
access, making the conflicts more likely.

BufferGetLSNAtomic does this:

bufHdr = GetBufferDescriptor(buffer - 1);
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);

AFAICS the lock is needed simply to read a consistent value from the
page header, but maybe we could have an atomic variable with a copy of
the LSN in the buffer descriptor?

I think we can do better - something like

#ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
lsn = PageGetLSN(page);
#else
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
#endif

All perf relevant systems support reading 8 bytes without a chance of
tearing...

Greetings,

Andres Freund

#13

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Andres Freund (#12)

Re: strange perf regression with data checksums

On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:

I think we can do better - something like

#ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
lsn = PageGetLSN(page);
#else
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
#endif

All perf relevant systems support reading 8 bytes without a chance of
tearing...

Even assuming that this scheme works perfectly, I'd still like to
pursue the idea I had about fixing this in nbtree.

The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.

--
Peter Geoghegan

#14

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Peter Geoghegan (#11)

Re: strange perf regression with data checksums

On 5/19/25 20:44, Peter Geoghegan wrote:

On Mon, May 19, 2025 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:

The regular index scan however still have this issue, although it's not
as visible as for IOS.

We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.

We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface).

Attached is v2, which does things this way. What do you think?

v2 also manages to avoid calling BufferGetLSNAtomic during all bitmap
index scans. You didn't complain about any regressions in bitmap index
scans, but I see no reason to take any chances (it's easy to just not
call BufferGetLSNAtomic there).

Same effect as v1 for IOS, with regular index scans I see this:

64 clients: 0.7M tps
96 clients: 1.5M tps

So very similar improvement as for IOS.

regards

--
Tomas Vondra

#15

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Tomas Vondra (#14)

Re: strange perf regression with data checksums

On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:

Same effect as v1 for IOS, with regular index scans I see this:

64 clients: 0.7M tps
96 clients: 1.5M tps

So very similar improvement as for IOS.

Just to be clear, you're saying my v2 appears to fix the problem
fully, for both plain index scans and index-only scans? (You haven't
mentioned bitmap index scans at all, I think, even though they are
relevant to v2.)

I'd be surprised if my v2 *fully* fixed the issue with plain index
scans. It's only going to avoid calls to BufferGetLSNAtomic()
following _bt_readpage calls that return false, but I doubt that your
particular pgbench-variant test queries really look like that.

*Looks again* Oh, wait. This is another one of those benchmarks with
an index scan that returns no rows whatsoever (just like on the other
recent thread involving regressions tied to the introduction of a new
skip support support function to nbtree). Fully makes sense that my v2
would fix that particular problem, even with plain index scans.
But...what about slightly different queries, that actually do return
some rows? Those will look different.

--
Peter Geoghegan

#16

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Peter Geoghegan (#15)

Re: strange perf regression with data checksums

On 5/19/25 22:29, Peter Geoghegan wrote:

On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:

Same effect as v1 for IOS, with regular index scans I see this:

64 clients: 0.7M tps
96 clients: 1.5M tps

So very similar improvement as for IOS.

Just to be clear, you're saying my v2 appears to fix the problem
fully, for both plain index scans and index-only scans? (You haven't
mentioned bitmap index scans at all, I think, even though they are
relevant to v2.)

With v2 the regression disappears, both for index-only scans and regular
index scans. I haven't tried anything with bitmap scans - I hit the
regression mostly by accident, on a workload that does IOS only. I may
try constructing something with bitmap scans, but I didn't have time for
that right now.

I'd be surprised if my v2 *fully* fixed the issue with plain index
scans. It's only going to avoid calls to BufferGetLSNAtomic()
following _bt_readpage calls that return false, but I doubt that your
particular pgbench-variant test queries really look like that.

Not sure. My workload is pretty simple, and similar to what I used in
the other nbtree regression (with malloc overhead), except that it's not
using partitioning.

pgbench -i -s 1 test
psql test <<EOF
update pgbench_accounts set bid = aid;
create index on pgbench_accounts(bid);
vacuum full;
analyze;
EOF

# select.sql
set enable_indexonlyscan = off;
select count(*) from pgbench_accounts where bid = 1

I don't know what "fully fix" means in this context. I see a massive
improvement with v2, I have no idea if that's the best we could do.

*Looks again* Oh, wait. This is another one of those benchmarks with
an index scan that returns no rows whatsoever (just like on the other
recent thread involving regressions tied to the introduction of a new
skip support support function to nbtree). Fully makes sense that my v2
would fix that particular problem, even with plain index scans.
But...what about slightly different queries, that actually do return
some rows? Those will look different.

Actually, this particular query does return rows (one, to be precise).

But you're right - it seems sensitive to how many rows are returned, and
at some point the contention goes away and there's no regression.

I need to do proper automated testing, to get reliable results. I've
been doing manual testing, but it's easy to make mistakes that way.

Do you have any suggestions what cases you'd like me to test?

regards

--
Tomas Vondra

#17

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Tomas Vondra (#16)

Re: strange perf regression with data checksums

On Mon, May 19, 2025 at 8:21 PM Tomas Vondra <tomas@vondra.me> wrote:

With v2 the regression disappears, both for index-only scans and regular
index scans. I haven't tried anything with bitmap scans - I hit the
regression mostly by accident, on a workload that does IOS only. I may
try constructing something with bitmap scans, but I didn't have time for
that right now.

I imagine bitmap index scans will be similar to plain index scans.

I don't know what "fully fix" means in this context. I see a massive
improvement with v2, I have no idea if that's the best we could do.

You expected there to be *zero* performance impact from enabling
checksums for this workload, since it is a pure read-only workload.
That's what I meant by "fully fix".

But you're right - it seems sensitive to how many rows are returned, and
at some point the contention goes away and there's no regression.

I need to do proper automated testing, to get reliable results. I've
been doing manual testing, but it's easy to make mistakes that way.

Do you have any suggestions what cases you'd like me to test?

Nothing comes to mind. Again, just be aware that we can only
completely avoid calling BufferGetLSNAtomic is only possible when:

* Scan is an index-only scan

* Scan is a bitmap index scan

* Scan is a plain index scan, reading a page that _bt_readpage()
returned "false" for when called.

In other words, plain index scans that read a lot of tuples might
receive no benefit whatsoever. It's possible that it already matters
less there anyway. It's also possible that there is some unforeseen
benefit from merely *delaying* the call to BufferGetLSNAtomic. But in
all likelihood these "unfixed" plain index scans are just as fast with
v2 as they are when run on master/baseline.

--
Peter Geoghegan

#18

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Peter Geoghegan (#17)

Re: strange perf regression with data checksums

On Tue, May 20, 2025 at 8:43 AM Peter Geoghegan <pg@bowt.ie> wrote:

I imagine bitmap index scans will be similar to plain index scans.

To be clear, I meant that the magnitude of the problem will be
similar. I do not mean that they aren't fixed by my v2. They should be
fully fixed by v2.

--
Peter Geoghegan

#19

Andres Freund

andres@anarazel.de

8 months ago

In reply to: Peter Geoghegan (#13)

Re: strange perf regression with data checksums

Hi,

On 2025-05-19 15:45:01 -0400, Peter Geoghegan wrote:

On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:

I think we can do better - something like

#ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
lsn = PageGetLSN(page);
#else
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
#endif

All perf relevant systems support reading 8 bytes without a chance of
tearing...

Even assuming that this scheme works perfectly, I'd still like to
pursue the idea I had about fixing this in nbtree.

The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.

I'm not against that - ISTM we should do both.

Greetings,

Andres Freund

#20

Peter Geoghegan

pg@bowt.ie

8 months ago

In reply to: Andres Freund (#19)

Re: strange perf regression with data checksums

On Wed, May 21, 2025 at 12:29 PM Andres Freund <andres@anarazel.de> wrote:

The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.

I'm not against that - ISTM we should do both.

Agreed.

Long term, we should move all of this stuff out of index AMs, which
don't have any good reason for doing their own thing. In a sense, the
known bugs in GiST and SP-GiST index-only scans (the "must not drop
the index page pin during index-only scans to avoid unsafe concurrent
TID recycling" bugs) exist because those index AMs couldn't just opt
in to some generic scheme, used by all index AMs that support
amgettuple-based scans.

Technically, the LSN is only needed to make kill_prior_tuple LP_DEAD
bit setting safe when the page pin is dropped. But it still seems
related to those GiST + SP-GiST IOS bugs, since we also need to
consider when dropping the pin is categorically unsafe (only nbtree
gets that aspect right currently).

--
Peter Geoghegan

#21

Tomas Vondra

tomas@vondra.me

8 months ago

In reply to: Peter Geoghegan (#18)

3 attachment(s)

Re: strange perf regression with data checksums

Hi,

I finally had time to do more rigorous testing on the v1/v2 patches.
Attached is a .tgz with test script that initializes a pgbench scale 1,
and then:

* Modifies the data to have different patterns / number of matching
rows, etc. This is dobe by scripts in init/ directory.

* Runs queries that either match or do not match any rows. This is
done by scripts in select/ directory.

* 32, 64 and 96 clients (the system has ~96 cores)

The scripts also force a particular scan type (bitmap/index/index-only),
and may also pin the processes to CPUs in different ways:

* default = no pinning, it's up to scheduler
* colocated = pgbench/backend always on the same core
* random = pgbench/backend always on a different random core

This is done by a custom pgbench patch (can share, if needed). I found
the pinning may have *massive* impact in some cases.

There's also CSV with raw results, and two PDF files with a summary of
the results:

* results-relative-speedup-vs-master.pdf - Shows throughput relative
to master (for the same client count), 100% means no difference.

* results-relative-speedup-vs-32.pdf - Slightly different view on the
data, showing "scalability" for a given build. It compares
throughput to "expected" multiple of the result we got for 32
clients. 100% means linear scalability.

As usual, green=good, red=bad. My observation is that v2 performs better
than v1 (more green, darker green). v2 helps even in cases where v1 did
not make any difference (e.g. some of the "nomatch" cases).

It's also interesting how much impact the pinnig has - the "colocated"
results are much better. It's also interesting that in a couple cases we
scale superlinearly, i.e. 96 has better throughput than 3x that of 32
clients.

I've seen this before, and I believe it's due to behavior of the
hardware, and some kernel optimizations. Perhaps there's something we
could learn from this, not sure.

Anyway, as a comparison of v1 and v2 I think this is enough.

regards

--
Tomas Vondra

Attachments:

results-relative-speedup-vs-32.pdfapplication/pdf; name=results-relative-speedup-vs-32.pdfDownload

%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(��Google Sheets)
/Title
(��Untitled spreadsheet)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
595
842
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x������Kn'v��d����qK�����\_���Y���iIO��@w�������df2��P}�u���`�c,\�����2����]������������1�r>��n=�r8��e*��"o
�����r�e�]N���r8���~w��w���pB�_���������������_,��o����t(��=�?�+�q�k�������W����i��H>���w��/���8]/�?]���t�/���p��w�:#����g�����������*�a: ��@@�k+�gh�0���z�����p�Nw����A������Rk��s=�n�t�#����?�������b(���Mk��u����8�i/;�q(�	z�I�����������R�w��Q'���p��$�u9 ����`�If��t��������r9T��t�`����o-5��lG:p�)�T�
�|X�����{m�)��jp������g��;�� '�����[KMy�yR����$���U)�^��9��U�u��r�h��z\�<6�(<Ed}Nd(�����"��u��c���@Xh!X/Ej��HM�MG�|	mQI�O�V���H`��V)���v�C#�'F
V�*"X��{m��?��>w7��]�B��HU$��iGv#ce����"?8�F�k��p�"�EOb�3��
Gt�c3 E�#�����^�b�S�����VLj����X	�P4rR:�����s�Q�����Cj��02Mg	�;R#����u�Q�>Dk������^�OBA	u���NB@��t���d7�"��7�J2U�P��5���RD�<(=%=�EM9��.:r�����������Qi��H#g��	�q��M�����
�*���`d��w��c�K���?
=EDTL2]�w���Z�x��d�vw��#s������;��z�ZJ���d�.��-j�7-0s�?I�-&���$=�EM��*HAZ�����+��I#O��	��[	Q��@��,�qGj���!��I)����AA�����~�����BI��l�N@���F���A��.b�XI	���P��5��
��9cE< 	@�hQ���W�jwV:zBC���U)eGBc43xll(u���3RG	�@��,�qGj<{�R�jwV:z"�uP�qGv#�N���N���t�D�k'�qGz#t��"&����EOiQ��1i���$�)-j�i������^.JG/hG*���}������G��S������RB�1��:K���#;Dq`���<
%E@�J2]�Pw�CT�Q�B�$�����p��~lo������E"=�E���l(��>^�z2����f�{wJ�:����d-t��7Hh�=e�6
Bl$B@������������BNd�
�5�H�'��������h�DFk'�qGv�{5�ELi)#6!����&���ZS���$�	-j�GH�QCI�����i��.^�Y�]1+)m�G&��!��:Kh���=E<�I�\�w�;�������MJGY��D���
��Ay�(b�XE	���P��5�*z���QQ�C�����������:�����Ni��H#�����Qzf�ztjwN�R���w��$�,�Yb@�����.}Q�vJK�d�
�5��w�(vk�y;��G2];	�;��q��xC����d�.��Y-j�������2����&������!5�����F=yV��w�{B�A(OA@������R�C�T�$h�;R�Js�U�c��Q�C�T%������Mws�:J�����d���+RD�T�%F�Z�t�{7����K�$--;�Q�Hl�����`��D�qGvx��(��R*7����OX�v{_,��[()�V*W!����1c���WI(z>���^G9bAx��0�P�k$���Y�d?\=���m��YAo?�
�HV���K=�;��4��s���x�Dh����7j,[�PD�T�&Fh�;R����Y_P������D���#������V
#&�7VH�f"�������@������
>!�N*?C�	-j�U�?��Z������G�V���x}��JP;�gb�F"4�H���#D�Sn&�Z��T�,H9�����)7����#�7�"&����EOiQ�0@�����2�Q����5��}��v���H����*��Qy����N`y�� Ay�w�|1@T:�c"< 	�5�H�'.���-n�Q������Dh���_3�
Jx�D������������oTh(�J*#��,j��A����(4��\L�d
�*�RO���@8z9R���)6�qGj@k�����r0���5��n|h8���9�}��r0��;R��h�35L�:�LQ��5�������<�j��-1o=�EM�����v^�s������?���|cO�v���)�CV�0Z��� �{<1D�9�\"�F2@m���+�J�\L�HBh�;�{,b�X?	��P��5��'.
��/B��)"6�!��5�o�����������S����.�����#�6l$"@����/n�@��rGv��Z����j�
�����;�[;	�;�����4RSl$"=�EM���+b�HK	HB�SZ�������H�R���Ayx3znjZ����y����������#5�;k=��Z*'�o�;R�������y�� j�r28���3�g���@)�������A*<��%�1c��$���
���'���I9��*�!���#O�[�*��N������w�|���)���'����|Z�v��D�@r�1Ho��7X����\#9`2H'�\���P+���:�L��'���~�D�S��F��1��5�Hll��-~�V���RR��A(a���A����Z�<L�Pwd�u�t1 j�r19
�<����t1i��W�#�L���{
�����9c-E�F2������;��8)�R#YK1�w����>��|q�q�*�x0�_"h�������P#�{�����
��_����"�Q#�{����� _��WP��T��#��O�������=�B*�#��,j���=������Z
�;r�����cP&B����3��J�L��w�|��Z�<L�$G��|=�G%<PJ�a:r��d�
�\�m(�R*#�	-j��eC���HVQ���y��nx+#VA�:�_b�FR4~k���;����*��K�w���%�C${*TQ'������ �5�"��U������A*�77��:��K��'���~*#.P�������]
�;���C�u�N������w��{��Z��K$�����sw*��/��_����eF�T�#'L���|(b�XGHN8�;C����bj$��	c����&��[{'T;�`b�J������E��r1 �z����#���y���r1�{����H�|'���@%�������&�i�T���������zB���7Q�������Z
�;rR�ZO�j������2$�B��w���~B����'�<�h"D 9�d�� �_�!�� Q5����I� �Ar���"����uS��N�0�|/r�Q7�����5���=�������_
�;r���OH�X��n����I1��qGj�+0V�Q�&��sN��d�4t�`�Uo������b���
5s����)�A~k��2�Q��3��
��W�Z�M�d��jh��G�k�|��v����������#5���Ny�H�8�|+m��P)����� �:��"&����9�Azs�$O��P+�����5��9������H�Q����yVn?jN������w��z�XOx������qG:C7kXyLh'�:*��c��A:t�3�d�n^#9�2Hw����~�E�k'"���g����U'9]������g��n�;?�=��&P����+I�qGv��_�!��Q/���!j���_E���)/!6�����_��*b�XS	��gN�I�+�"���zV���7s�P�������`
�;r�$�8y�{����������#5��|�_A *��51Bk�����j��5 ��P���j��Mg����C������X-�����5��~��y���HVS����Y���g��a��D��#5���^�<�n"@ 9�d���5O��H�(���U]�"f������AzC�cY����"��p6Hw��M"*��85�����qG��ScJ�0�q"h�����P�C�T'h�;�����)���� ��gK	�R����-j���� j�r91��������!*��95�����qG�������N�H��o�;R�Y:�Q1���!Z����[�S����P��3=%@PL�v:sR�d��$>�SD�Tn'F�sZ�����|U~�F��^0��6ko�'@��<Tn�d��&5Oy�J;sV�����Q������3��G��35%@�K�{�pV��H�#|���z�|O�����	�tQ�������j0�����>5��c����&w:A����}��	��4O������w��{��"*�r=1���F|�_C�S�'��*������ �-a@PM�z�pj�d��Ey��>!�n*�C�i-j�U�b�Ou��u��R��\O����Y�#�TT�������x��X���#5�+��bm|�{b���#5��Z�X�^��H�
(�
������������ ��U���S�*�p@rr�� ;8��]��-�.m���������w
�;�������.�2�$�O��P�'����_��S�'���N��D}���:*5ED�K�}�pv@1Ho��x�QN�PSB�T���L���}���=��"���?1������=���?5�������qGb���KL�_��7j��B1`%	�5���U�s�<��"�F2@m�����\��?�7%@TK���r�@y$�>I�c<PJ���r���H�3hG�1e���HF�Z�tskz�8J-m$k)F`C����;���)�� ���qGj��G6P<��"@ 9I�d��U*��A ��$PR���3"(��A]9M`2H����Op�������5�O�[�r�������{
�;r���R_�@����{�{"h���U�(�r=��p2H�'*=�� *�r= �� P�[,b�X7	���0��X���v" �� p6H
(���r�R�P��r=5�����qG�A�P\1����{�P�'B������"@�K�zb���#5��x/�<�z"@ 9I�d7�W�b�XO	���%0�|�U����"��#p6H���RR��F��b64����#���B��w��{�pO�������qGj�w/�Sy���(P��[�P���OW��R��L������zR��r����'.i+����O�$5�M����i���$�B����t����?��"L|��N�a��z!�r*#��w�vGi'1y��	/�q��d�������=VX����8c�l����X�Y �*?�Bg1 �����
oYW�Q�(��=�A��������#%P�}Ok��pRE��bP�9���t7���U�U���t�\�<���79������*���&�hExju�a
��{��Z(1lc���{��'�W��`���Q�}Ok��rN��V��
��i����yJZ�<V��{��,PQi���QJ7�E��3#���bT�Y��(�4�OOb��h�����{��9��*1�B�	hJ��=�Q_�Q	���U�,��{Z�����ri1,���PL�{�-r"�*���0���5�Q�e]��(�D���	�I��Z�0�|c��>+g[z����������\��{Z��� %�T�.�E�N&�q��E��pI-���a��l�b��U
��J��
4',L&����0���EN�PbE�s�&�z��So\��h����C����Z��p��~T�K�8��bh�����D�F�cT�����US�N�I%uT�0�z�I���9�"'Q�/�6z���E���q��~K�E$RR����-z��;���f�P��Ny��J������w��!�i>������I�KL�M����5����"*��"��'��P�IR@�c<�9�����FY��&T�����3J�L�PE�aI[��L�tS\�
<F8�b�}����^-4����}O��Z�].*���(���=�A�X���C%�P~,FE��I�3�Gu"TT��b������k��(�	Q����IkT��?^j����������{����6�k��N*����r�}Ok����3(��rp	h��T�D���C����*/�`������G�D
%&X�95a2i
��r��Z������%`��.Z����R*��B�1����Y{0�[�	G�<_��{Z����O"�N*1���*�����=�a��*P�Xj1���*�`���F��%&TTZ�c��Mt���ta������&`��.Z���=���Z��U{���{������F��h��@���P�z)�AIe�W�AZ���@�������J
�c{Nf(����cV�����0Y�;���7E���`��q�y.Z�u�u���
7Z�0F:b������>�K0,�E����o�{Z���� P�Y�!ch���X���?������C��D�b��H�i����?��l����H�k�J�����X�c�n���������p�/�S�h�L��{��>�g!�J��U�{Z��+v�y%R��"X�9%���3�������=U{�J�LZ�r<}7�-�"'Q�b"�y	g��0_�R���U�h�E��{�h{�.��Zs����A�}Ok��k
��������ig��WVF�!T�9o`2�n��6	AQqiN8������I�)j4/�C��}Oc�����E��{�C}�&��N"��1Z���G��6�6ja]8��X�7��F�@QY���p:�d��#^��\��9�B{hF���h����L�C�h�����{zr<DO��Qq(�MD(���5�[�K����I����5���?�_�%��N�������>�`E��>��)�Iw�}~�+r
��E��r�Z�f�8i�v5Zh0Fuc����s���E�
4�@�����&��)�v,���p2i
�V!t%��^#��=���I�i��*��F�R�y.Z���yM��y:��HsN������E�I)U��B�1����A{R^eq�Vj_�M����5���F���[E�Hs����5��Z�����YU8S��t7��B��V;�
'+L&�
����JZ��U���s�z�U��H+����B�1���Iy��7�[O�X��V��i
��Z*�vW*���p2iw��Y���B%�����)��������E�����Y�IwC}��k��(�Q��D��I;C}v�_.�U�h����C�����[~�RO
�}U��hB��=���,J�
�]U�
4���=�����/���jO��s�E�cu�M�yL��>p2�d��X�E��H��4�t]�"<]_���7.�W�h����{z�)?��_!@q(�UE ���5�{o1*���U1
��i���N�@Ie��������������39��	����g2L&�
�������PX��b�n��������E��J��U������������}RH��bT�)����F}sA����Q�(�����>�E*i��T8���t7����5V����0�t7����0*���T1J7�E����M��:���B�1����A;o��3,Pq,�=F(���5���\�J:��c��{�C}���%�����0���`��H���
4�0L&�M��r��JJ�]d��Mt���.��u������J|��Zh���������{�ZH��y���
�}OkX�n�QQkOd���t��VAbFE�=���)l�����-r"�*���0����Z�<
%FT�9��l�����7.�S�h�&RZh����{y��J��}e����t���[�J���eG�/(���Mm"T��.;r��d�=�+�_�D
uBX�������L�%#��i�Y��Q��pl��;/)�&��.3����}Ok�7�vB%��N3F��=���g	��V�����PL����G���P[�3;rv�d���3�EN�oBi���A�i.Z
������5Z�0��c��.�o�	�A��.3���}Ok�7�b3,��v�1���n�oUpgTRY�3;rrC1io�O-1���j����&��P���KJ��f��t�����vRJ�3k�Pb��=}��������c��@S���i
��FT�Y�0ch���i�OXKJ�]fGNn(&�����"�Q(1���
�I{�.Xu�&gR(1�"��
g��+�h�u8)�v�5Z(1F�c��>�/���
��(H�������2���{Z�����J��}e��{��()]�/���jg��3�Iw�}S	�U{����0��7���*��J���e��s�z`��2I���_�hV���C����F.f{�@�ZL�8�0#h������^j���8�}O��%'�Q�aIo�����
������%'S(2�6���
�E?�}����V;�Je,��v�1L7�E+��.�9�Wh�i��f���A����w��!k��������i��@Sz��i
��df�ZL����i�(���5��E�I8I)���P����b��k�����Y�5;qR�d��k����PY�6c�n����w�k���q�h����{�(W��B���s��@Sz��i
�r�Fj�]g�4gu�L���%���3�E��:�I{�}VY���j����:&������9�B�i��8�t7��^������F=��xl��G��x�
*�v�1.��d�{Z��*Hz��g��4�v�LZ��r��Zj��"������>{�[���j����;&�v�4��@Im���A�y.Z��{:k�Y�Y���{���F��A��h�	��t���`TB�E�=��;h��������B�B�E�=���)h��OFk_x��ZL��>s��d��h_�(r.�#,���u����{���4��1D�����,���`P�����i
*�������z��v��4�y�L����
�%����3�y���o:��V;����1��F}��o%��n;�f�hMxTb���b�Nj�]��c�<����Nsc����R��R��hJ���=�Q��N����g�(���������^I!���@���b��P�Y-r��(���1��}o3A����c� �$�O�����|��F��bl���b�Im	B5S�uEE�3,F����j�{Z�Z�j(0���j��@�����T��V;����QL���i-r�*���1��F}�*'Qh0�"��g�v��������B�(����)�9#T��9ch�����aI{������������*j����9�b2i���z���=�o� �<���`�	\�����]0P��4���*�E{���t���+5KJ��g���t��j�I�=#T�/�kQ,���-r&�l�/�m1Yt��IDPRZ�>c�n��V�wcSH+����B�1V����\�����g��{Z�����Jj��"��'�vG*��G;L����sv�t�b��X��a�E�����[L&�a��a%��~3���h=xo7L��f����������`�Jj��My���5������oF�Hs����;��bz	�TV��.�mQL��X3~V��.�k1�t7�f�e��/r�
#,��lq6i
�����R;�-�#��}O����}����1&����{Zc��#F���+�E��-N&��p��D�@s��d�����"E*�	Q�����Ik�7#	.'��j�P&�S��=}R���H�N���vh��MY���5���9R�
��L�A�@��vG��zb��9CL�W���p�E1iw�bNy�E���_B�-&���H��B����4�Y�MZc�ZnP�.�i�h��+�C����ORJx0*��PM(���;�'WT?EuEP�����~������z��"(���,�N�@���@e�'>�0,(,�6�`J7�E���`��R��:��@3L7�E���DQ��TR;�-T���}O����B���g���a��@S���i
�VI;�$���*���=�1���3����4R��iN�(&���9��	u���+�WL&�1���S*�r
��=U�Mq�*�����hgU���bd:����y�0TH��bT�)����F}�B*��U��4�W�L���
d���]E�Hs�E1io�O_�bT�Z���r~�d����TTQ#���j��Hs~�����>_[I'����B�1>���Q���`T�vW
��i����L*��U�4�X�LZ��6
���9�J��UWN�(&�
�'��V�1Q_������I+L��K��^F����_�0�,�o
��Q'�5Z�/�c��>+����wRH�$cX�)�����}S��V�����}��+)�v�*��[QL��g��F��
4gVL&�@E���V��E��Pa�E�S+�&���3|�.�F�
�'�m�{���
���L�q�6RC��8�"�O��P��t���#(i�r�	���������4R;��������?�g&�0Qa�w�1
L��;�o(0���*�����h5xz!����qQ�*��2Q��������o�tGy���i��^YDF%�Q�*F��=�z��.	L���b����y9�\2y���3,���S	�n��V�7�7d���
ha
1&��t��
��S���!uTn*�	4�@������/b��*'����=�1��!/x��bTRW��bl���H�b�r�*���0��B'U`)+��i
��HsN���_�aCzc�\T@��Xpl��G������)�#Q>*��=�A�z(�AIc��4��i
*�D8��Ta�%�U�1���=�����U���*���n����`�~��#*)���	�n��V�o�l��(����;��=��6�v�<RI���&h���{a��J:�<c��������K:��c�{����C��J��c�R��.Z���"'R(1��(�D����OW�Zh1���w4�7��\|TC���,��R@�;��}��!��_��4
�qG���2�P	��J6o�@P�;����+�"'Rh1��sF�d���U��(r"�#,�1�4�-��}��'�T2��cD8����v�bu1��,�����`���q�}Ok\q+�����^�� T�Z�&(������H+���A����b��P���8�E�Un2)�<�OG
��B�	�tV9�L7�E��7M����d@-�Htl����x�����(7�`����/K;�N*?�"��'����j��R*7�"���������%@Qi���AJ7�E��['�"gQh0b"��g����H,)�r�-���}O��c�E�*���	P�)����}U���Q9�i��8��F}]��4R��iN�(&���I�/r&�,��K1�t7����r*�#*��Jq6i�~��Ry��f%�c:��il�nd����	�F�����x�$�U�2����t�iaxPF�0���*O�@��v�j��9�B�	��{���,����3�J��d�������,����-T���}OO:���i�TR��,����{��}q
CJ�|e��4�R�L���%�T�2�E��)�Iw�o��
T�Z�.�s2�d��`��gT�Z�,c�n��V��m��:7.*�h��?��{z�����B��>:����}Ow�oj1���j/�@����T���V����t�^>��E����='qL&���9t9�B�i��8���n�PC����B�(����A�����C�|h�D��iN�8�t����D;�iN�(&�����$���3F��*&���o��gP��6c�n��V�o��O�q�ha1$����l����N:�=g�
4%8@�����ZED%���3F��=����`I)���`����b��l�)��P�������������Z��Hsb���;����q���F�M��=����;cT���
����M����j����=���zs�TR{�h��=����N����uF0��������SD%���3F���h=x�ZUr���F��tl�����yv�bX�Mi���;��1���Nd����7�Z����������=���Zb��J�b���+&����m'�A�����I��O���������?j�����/���T����?�^��|���~��6�(~v<�����vT�����]vW���D�A��?-Tb�"J����=��z$e�������p?�X
���M@�*~�B�(~v8L�g��oJ�������]��3�,R�/����� �BR'R�f�����H��"e��F"e�h��T����D�^�|�z��D�^�l){E�D�^�.�i�������g�H��P�����. u?C]@����. %���������7�D���.){D�@�����W�D�^�.�z��@�����W�D�^�.j��]?~����W�_�G�'���r)�{���?3�S�u3����Z��	�'�Na�lu����0��q�?���_�{N��'��W�Lma��uR?���Z��/�R�'�(�z����I�-��j��T�S��F6�?�U����/�_��a#�_�?U��F�����?���:}�oU�������a#�~�r(����?W��F��������}�`��
����i���'��F���Us�������q��_��V�h�����?n���*O>�_���<r����W��j%����}��m������������������Ezk�k�i�U5����p�i��_�?��?�������������>�����������������������������>���������?��ox����������?���#�\�({���������������?�C�T�������/��J������7�;}G �c���%��1m����|?GS����z7�q��E����7u�S��W!�m�`����?������9&�Z��f���0����?�1�o��0Mg��~�m��q_,3������>��q��D������&��n�%s���d����'����m���7��uW��_���f�x����z�a�����������6#w��n��b��/�����ds���jY����@[L����_n�<Q���d�oo��>���P�:����N��K7�&��c�1��������}�:�6���b�>�w�����f�����oo�77��~�B�O��	��������1��o*�B8��gy������+��m���m�r����������B1����U���7�?�.��w�o��6A�s��r����������������j�����O����L>��~D�������p;��|�
��L�������{��I�n���~�� {2�&�Y5���g�������wc{��
~�d�]��M�2m�������>t`M�F��;�YU�Y��
�+��MB��d
�+���������Ht��}F�}t��;z^(��;�]e
���g�=��-����6��.�_��V�����=�o�wI��������[��1�,�'��
��#���g"����M2��g"���up7J���g���u�Et���}w�4��D��(�
�
�0@h�m��]G��{=����{=�Ih��������?�������������7|�=�{�L��AXV����{��&�m����@�{�����~��^���g�W�V!��|�*y&��]�1����g��9�-��<��&$h����@��o���/��D�� �P���j��:��5/k�G����x?=��vz]�P�6��\[d�>���o�\��(�zH����~��6����������j�b[�w��V�q�[sOD���v�����3��zb^�O�b+� ,�q���x+��hu���[qWD_�v�3h�V��������Iol���d�M�����N�����N���Yx��aQ���c��������xgQ��_�k��u�U������+t���-O��*i��|�7��+��n����z����|�?�}m��OvG8��2EOF����p�u����g	��Y�n�Ol(��;�N;Dm��c�t���t���c���2n�3�p�G���<=cs>c#l��6�O��y��" ����bD��
���|���b��gS�y4����|�5�&�r������~S����'�mu���7�t�����	�t� �!l�9)|$�U����o��a3���Gp��z����c\5l���=�����Xg��q�o(k�M���5����l�M�u����>���|V5�e�^��?�O=(�/[w`�qT ���|65b�+�q������t��
�l��MU��
E]h���/yl�#l���tH��x1����1�vG���G`��?ST����x�Y~�n�o���3_��u7V�X��_�x��T�=y�o��j�\�l��t��)�/���AXW���l�]�,������Z�����Z�w���m�����a�M��������y�c2��eL�����y?�N8
�uaS/����w)�����������������I��T>�:�iw�3�W��������W��|��/*a�]�,���<��
�� lj��2|��v�H\p�]7������x�%�u��"�����x������+���35:��F�T��9�����u��2gm�	�����m5�vW>����e8�vW>����/����_��m�!��| �(jy����ii_�+aSm�2�=`���@���m7���1������������O�s!��:���g�=�������������G��6@;����+d�=����.���g��'���|Vuc�@"���g]w�Y���g����u�6��__����l�nd������t���u��;���)�J������������������a�\��tX������(��r���k����R�
�����������(�����.�wK/E�|V��BT�(���g]��A=�|�pG��^�L�M�4��8{����������c����+���x=����yw���������Q��E�5�����_�*j��V���(gY��g�P ���
5g��C{�K����]�����#�EU�h.6��|��x6�����"���W���|F�������<[4���|*Yco$��d�M���|4�N�����������d�?zl���������d#�<�����&�����+�O������tV���|����z!?��_��7�W't�J[�'��7oz�������=I����y������i�����l�M���sW�m����3�(�g����p~�������,���2Ym�FX���Ak|��{������`�V��Y��21�Tm��6�Q�j;�@���Q���	d�rS�����E�,.V��YoZ�����j;����2Rl��n��8=�K��}Ag�N������3}��]��:Y��������(�o�.���s����Ml��j��?����>1X*
7�6�����x����W��,��tO>-;mNa�
��@-di��������j�������>6@f�~���[�8���y|����@�������������;X$���b���
�q�7
���Q����BF�A6�u�L>�*�Hg	A�Aw)��p��������3��|t�%���
�i��^���$�'��K^�����pgo3�CC�?�����z�.j?�������s!���cs�s��OaY��e8�T,&�5���1��e����^*���jc���
gS7��j<6�VNa���K��:-)6@JK�+��j���\������� ^-���G}�gWWa����x�=�p�_^�;M<��go��V>��0�Y�y�.B�u��BhV�Bk6�-u����]w�������L_8��=��/�L��zI���/[p[�D@�m�vG@_������	h
G���
0��;�I?}2��d�M��8��N��9�h���'��u���@i	'�h�a�M���B�
_�u7����#b��J�3S����#�mM�5)���n������#��
da��S�GwG>k���
��;��7�&�������q7
��;=�wG;����k�d�Cm����4����&�d�����O�Ew�oa?�N�
�I�HA{�J���Gs�0.�foc��L�+�V�!i����",�4���g[�D�Yk�`���c)����^L��x�Z���%O��xt���0=�}t8�	��x�h�0��u]�7�P6��#�x��
�-�v��7��d�P�,KI�s�}��l�jI��'�H�tOm>[0{rs��m������V�K��et��[|e����/n���&�+?��0�ySn�`��ht�0��5�a/|���L��*�yH����I�@E��M�8NzHM��=�M��/r[2�5��df-L*k�m�Y���Z�4��������`�����w����_2��^�
y��s����D���zDt��xda`����b������m�>�	�������Mr����gA���:�dZ�h8��V[2K�}�\ul����������L{�1��7]�Y�u-���=��X|377������{��c�7��FK�������N�rh��4��{�%�	�����%�/��W�I�Enr��C"�N&��h>�&���g��uB�	w$3���c)�d�5?ZA�L�+��J���Jf�x
�L��=}j0�0�N��V`�G�M���t1N�o�T&|x�����������1�&�U���//���o���.��F�M}�1�fq[25�;��$qO2-5X���'�E�QI�f]��r�Q�d�=��(������'�h��{��q���2�v�w�N�T��F�p�)q5T��h.�����o�5DL���Lk����������M�~����i��c�G��}	V�q�_z�x���xq������6;��kq�K��a�;��][|��]0f�o�"�����k�;�M?R��7:o`�O��#c�,�Ff�x}������H�}�����o��z:����m&SO�,jPe���Y[�[�2�:�����g�v��?B��#�U[~�2�������.�l�-���Yp��Yn�7�����Y>X�d�m��^<]������Z_���&��Dl��H�	�v������#�l�@8����-�E=��7�d�m�_XF(i�m��k:Ltc��#�d���$��'n��'T��-@l�-�"cL����l���l���t^����s�����W�������2&��,&�����aW/
&�u�
Z������}��h ��hV�Z��28�Y���\\�l[�����*
�5�VG2�z�]�hep$�i���q��w��2�|7��&���g�9�S�\�N������7�����0-q�E&�b�b7�
Xl�-������4��{�Y�j�InKf�&}��d6�E�Lar%��@�&��d���'8:�Y�IKuU2k�-��7���{����6���~t�������`��FV���`1�����������q{�a+nqo�6\����-�M
sL����Y'."��{�i��d�#�m��r�{vW4�9� {����iA,��N�|��[|G��������t(����=������������aL�E�D���������:_�������8���h�`��������J��&�mKf�g�{pO2�}0�����df����?tu��je�I�p�o�-��p4�v��o��i'�;{Y�m�M^�\�'���D�@����#���f���z�l�-�����5��hZ�k��TW2�V('i�m�l`����������o[4�
���tE��������s1���D���v�:|��(��e-���6~��E��\L�T����W�M�x���l���m���������N�h�7�pG2s����E�Hf]E������Y�w8r�pWg�*�4�������Ev��(���WO�&|t�����4���	�-��+��������<���1��������?�[Zd�	0"*�d<"*g��N&����M"r��#&�D%Fw���Y�W��<v_�+�������s��'s�M\����@����mx���,�K-��^w�_��z�%4�������+��z��;yq��/�7L����T&���o"#��������o��������CV��iV���d+�����8{���U�m��L�V���g]�>�WAl�-�j��~4�HXu`Um�&\(��'�\�=��������6\P��`]����!�������:���u����&�u������Y���+ v9)[#cw��[@��8XV-�^=��,���v ��[�E�"A�vO>��������W��h�����^��a_K���a�h�
�g
�/�Y��a�D������M1v��:�Op���;��_FY|�������
�����	��j�q��Hcl~�������z�YC�z��p���x7�6�y-�� O�\��{�(xs�]_���)�:�/E}��j�8Z�	|]���H��U�i���I���i[&�{��*n��N�/�i����aN�v��o�N_wFV����|� ����!���1�u���������������+��FT8����&��N?�=��8�Z(�+'��oWD�e�����ogs��9�q��v��3��!�n�pY�H>}�b��x����|�;p,L��;����\��f5.�gw;q�l�
��93&�Sq���I��[�W�?\F�tb���g�.j������Z�p�,v.��k��{"Z��5��5vo���.w�e�������(�H���w�Vet��[|��l�������p�����U��d>2d���u��8X[�U��^-,��8��/�yD0�W�����'�������]'^���'�
��f����0�+�e��'�u���6�pQZ-l���v�/6}o�P��o�xA���7^j��H�3c����p��{a�-�/�B�	���f���L�g�v%�u[:K�D���[sH���?��f�����y�o(��'���a��[a��4��&��&���b~�I/3���<�g���7�ZFf��; ��o�`�7W��z��,A36����2�l�m�-\|G?L��6{���.l�=-jH�<�vYw�����<�y����:������6��������R%�qYx_J�P"Y��Ho�}�'�����_���&�y���[|���W�v��HQ�
������+k��\���iL6��B�,�*
�-�E]�f�i�-�yP��
��Bh���.�������������������.�c� �
������6����.=l3N�����1�v�sEv��H�}c�������ro!l�"��y��
~������J���SK���P�g.#m��������J�xO���X6�������g"l�'�/8&�p<��xOE�6��t�6�b<do�x���{�K\�M'#oj\0[y���4�U��[5
$�"���~�����V�P��gS�l�98�W�������4�����`Zi�=)�T�pd�0�����8V�����-�f��v��	l�<��{�f��qx�j���t1����9��c"���;"���<��*��.Z�t�gOv��[�k.�G�]C^�,<vD�M���/����Z������l9�Vu����J��"�lo(�#_���*�y"��%k�7l$6���=O�u��8^R�\~9=�-"=�
��a:��s.�gwS���CF���V?Uy�����f���	����{6��|��n?���g^��_�c
yaO��`a#���&��_F�Q�@G��F�����R�5�&�#��<�pz�7��l~�&2���LF~x����_=>;�p���s�'R���Kf�oh��2� ���
�|� ��-��Jo�2�3���'X?�������_���
���F��������'�Z)\eCy`�������?d����x=�L�`������?id#���m��YO�V��f=���`�,��3������V6��&�������g���t�I<]K���D���^�;zO��t|���a�-�C���x���"�m�x���{�%��,c6�c{�F�lds�~�Fi/���1�����?���E�����he����#���o�$���{��Uk.�).^,�9U��{rO�>��`m�'�E��r��{r[D5Z�NCzOn$����=���;=N'_���������]l������F^����l�M+7���&��?�p�M�Ypa�-�Z
I&����=A�6��|m	IWN���p����=�����h�O��M�'�����0�@b?�&�b<����q>h��^�iy=���G�!�f�3��F�����1�P+>������{��H�Z;6�G�Ai�p���V1��{lw��t�+����M���\6\j>�`9L^6�(P;�v]}�� �<�
0���������q|=:~9����?�'�/�����1������J�"�A8��0�~1r�~BD����Mw4$O�����T��2VQ��g�o�g[�G�<�����l���a^�5�6}1�D��qW�0��D��'�	�o!, c������Ln<p�����o
y�Lg�sg=�������e��@�uWj8�z�v��������!��F�gA;!������gSF���x�����K}��Ir!�������-�-Lr���+���`���/��*��>2�,v����F�9���O9�X^ 8	h#��������31/
���}����5����k��Y6�D����i5w�7m�-�vh�o����"��&�~�Bm"�@��V�w�o`1�v��1�N�G�G����0��E~��x�y9�G\�F���� e���~D_��������1�b�p����E������-�Z1�USH/B;]�J
���'"Z�3��"��hS�H��}�7m��b��X,�!^{r�����z��S��D�^$
TvFJ�O�!2q�&k.�"�����P����`U/��u��W>�lO���#�f6�K �9���7�xpE�m%8���j�0��&B+����m���m�ggr�����g����-r6�o`@�T���w3�K&�!�K&���K&B}��]l�+�36��g������=����V?^������'��
^r���j������ot�#�+O�	��_j}.�-�+����FX�Spy��	��/*�ZY�a��Z�=N?+��J>���q|y��W���(.��G#�''�8k��X��H��gPP)�,����2���5J�w[DT3(i�=�j��da�m�vS/������Ve5z-uyC@�p������6#���2�����=N��Ik&�!���q�>���������u�aN�����Dj����FKBJ�n�����u����w[@m	�V�9�v��7����rp�|�u�`�m���e�X���J>-�)p�T>�D=���&�����b��{�c��v��'���v�����dd�
-d"l�V�,������zpx��:��Lu���z!� �:A&�ZQ2�M�]_Je�NS� W*�pf
r���V]��
]����JAO�%Z����#����d��1��8��������5SY�=��h��FB��H e]9W�TKq�b��}��f#�����^�#]xO����~���a���P�V7`S� ���;zi��rG@�xj�b�sy"�y�������k�l:�	���f�nK>�����u���sqz�[�u�{����<���c��2C�w4��}�&\�_Zw�{�aU#@�SW@�u	�n�w��A��e>����^R�V{p������S�qq�S{>nH���g	/�d�����L��z;l���o\��o\����N�����NF
��
A&��q6&VX�Aa�&�x��2r���3x�$�j�"j����]:����)�K_z����#X�����W��I��>����	����Q��Z��|,�m|+�+��8c�i��{<`���K���1����s.F�O��1q�uO���������M����l��R�"|��Q{�%]���Z��t,�����x��������D
I��hQ���c]}ALX�kl�`2��/����1�x�q��l���m��8�l�������1��P�'�����&�V�q��Dg�����Kx��([y�=\v�ZN��������Z������;yW�2e���"�e��������'�1g+o����z�w�N�����{LV~����;��V~od�
-�c"����<|�+<�&����`��u��u���!��-���
F�[W@_-�1K� �P����M�M�����9��}"��z����w�l�@:w�H�x�e�%����c��6����a���qd���?�g����1Q�Pb=�v�{w����u�L���g�
�n��e���;�o���v�u����+|I&���C���������>b���hS�����9&�����8SO�����8c��v������1�w�qv�~2r�~Bd����������������V�=m�m���Y�ogl}�|���
����<�� ��%���EXxO���],��C���XZxOD�����1���V���L<Yx�����8�e-c��v��O4V�~5������zvpy���Z�����k���q9Sd"�'29�'2��c8G�t���{���6��^v8�U,s�|�P!�`pxC>�z�tf_���SC	�o1�U������5{�/�/�F_���
`���[,q��%L����q������=�%����%��q2y��3R�~B ��C�J����gP.b�xLr��h�����jE�pnb������m7�[<�m�S����|f��D �������vG>�rl&�����8�����q���l���mwz���������L��Lm����%�E�o�����"�]v{�Y6��)s�����c��u������1��_�:I�6�����^��{�Y��Y6^�d�p'e2^�*����B��3f�>��l�������DZ��D�TN>���u1Y�@A~9<c����zL��J$x��?=H��c}=?�"\h�k�8Y�a���\ ��n�M�%�
	Cmw{���b��3fCm�8����c6���Hj��Q�h��F�����X���������g�#y�b#����"|�*����0�8�T�3���\������o$�M�-���"y�����f�H�uZ��{I��	o�H�5��tz�rz���~�����=N��9]����d�,�hY�k&oW,�����?k$nW,���w����n��C���U��X�����p���]��3�����k���0A9{�bk���
��tq�bg���f+���GT�w5>m���+���7�l�����92��% z��3r��U�1���16+������^*l�ba�>�+��TG>5�7|�gG>5���"'�`
E���/�^8�����t5���Y�o��A�A�B�h�	^5L���:n&���L�q"��V
��� G��;y��O����	A4&Z��vKZw��ok/f��5Z�����G>�L�*z#��-�%\W�l�-��U�������W{.=m�}��*�</a[�dl��Ua�-�!��b���b�8d�n�8F3Z�l�m�N^������u�c"m���V�J�w�����(Zm�����s;���n��@�E<B��B>K�=~�n����J�w[���$^���R>�y�pa�m�mu���"�_����*�;=�_&�}�{<���1�w���^�}?�cC+��[�M�����ME�v�]���3��u������/�e�+��&G>K,M����|Ze�p�{a�}��*������n��q�������n1�o�{���uz<�V����[[���2��H��c�l�6<������9�u����&����y�e�����
j��m�+�y-��-����B��:]{��OK��Co\����,l��������vg��wL�#l����m7/��G��l����=�eo$L�����Z>�O�I�n�o�;�V�2k�-�����r����I�}����|�pf��[wO>�J,��-[wO�!��u��S���w�������'�n$���[�G�S��od��w�N�Xw[��I�J��`dN��p����Lt��wHd��yhd������U��Wj�_��Q��a�_��}�v�+�v�2(��W���/������hm	�y�i��1����=�dOq��U*�X�@�D	N^�l��+���1�@6�+���
d�8�c��'#%�'D��8�z��V�����m�#(��L�3H����/�a��j�������z@��=S���]�,[���U���J�T�GOyj��p@�0�@"��
��x�����a��
�`�l�m�Y�~y�(���#i���kii�=��D)B�a��/�E�����}�	^���m�&��u�=dD��x�������������V��uO8����6����R&���Bg�{�F��q�&���Q�c0����Z�D�}V�����V%1Yo[��k��c�n%`��T��6
�5q�|
���d����8&�e���T��c�n/D���&�#�x�j��y���{yC�
:k��������k����U|�P�����!9&�����FT�4����K�6X\���g���g��u�}������d�"�d0�o���������oG>����������|B��t��?�TnL>��[�x��������br��{������l�|C����U�����Q����|��v����X��������	�;&����N�'ry�����������~
v4���Xl����E�L�*`g
���;�p�Pi!���#�����qK�'�u���d��#,�������&�L�22�v���s��k���1���G�������a�sL�C�kL�%<����K��S}���5K���E�l���b�����w���	�����f��dk����U�'zZ8�r���������&���~�k����@>5�|>��U��ja1^�X-��0�9=��Db3Z�����lL��O�v��_�X�����`�r�E�'�V=3��v�������&M�3pcn���l�=�l[�v�`��9����r���[���i�0����p�{<��{s�6��e����������s��bc#��9Kw��s��1�6�Uut�w_�����/�Mu=%C'=�|�2-���g��� ��	g�b������/���zz���w
���i���i�{��{H�vO��z��{����N��}x���{��7��H�Y*�FXB�atGz�^�����w�/L�����	��I���d�{O8��K0a�m�[0��{�g��(F���g��&���_S(���+{k�
`�E�����c���U�1[��=f�nk�<YHa?k��O
�1A�P�=Zt]n�
�Y7�0��/�H�,vE�2��{tS2_u�������=���
?�%,�)�5D��
�#��|k�^��^w�^=
sn�����e�17��������g�R}�����=���@h����)`�
�%<Y�+���~��_��l�M�kxA#}�������w��dja�tT���=��US2����g���0��p�u5
�".��)�^M(���?���BO{HG����)��x`#��=fcn�8n�[	��t=<3��P���(��� `QC���q�r�1���C4F,=�mU�M��I,=�hf�f�S2sx&x��{!S2[p�$���*�"���u���V�>��DgV�>��8rO/:&���$���|��|\���/Fp�/�����x��
�`������������U���>�X�����]���dE�{{�e8g�I�����a8O$�N�`nK�^%��a{z����B��6�
��(���:o�������?\��l����
�_�����
`�
�y��q��&u'���}]c'���Y�y�J��y��=P6������@i+����*N������HM��{���
hL2���T�1�.�OR�|����N�5����$�/������Z6&�2Q&�J�������8x�dcr�JD�P!�s+���X����2��Celi�Cb6�����-�"6�4�O����p�z�X�|�0����6�����N�����d����o�!XG���������Be.tZ4@��fT����`�.���b�/bgmpo�}q{z:���fQO����bkm�fQC���bkmp_�g5ZN@l����]tzkmJfU����xkm*��-:�"d&�!��u�����c�N�|����MW)p�"g���?G��3�������3���g����
7�/�&N����H�MiKf^7���u�{{�"z!L�#���d4_�M���E{�;;b��'<���D������|��
�����p�m��	+_6��|�K�����QefL�CX,|�3�A+���:l�)��2X���R�:���#�M]�����2��i�h�^,�M*�������L������t����e���I�>m�^�+�����o~e�_Z��V�o<�	W�o2�e��9X,��E��4��u;	�&������p�����d��k���l�M�,��vP�pG4�������d6��]!�
w��=��
B����qW��=a��in�g����w�
7���c��6|h���3�~J�������%����
@��CXq��W�jn�O�:�Y�zZ�����a���^�
+np��	%Jn�f9��kvO2���H���Jf�"��V�Q��7{=�����;J���SQm���SM��'Z��_S�������g8OZo�}��%w��������~���M5 ��7xnJfQ�H4EoW2�pT�0��d������ga���l��4��r�{�Gj��$��x���7�����������.����1�(�.d�����[�[�}��5��m]}��<6��hf���l�m���!��Ypso�f]�������h���}�����k-������ue�\<�Yn��g
�/Yp�ob�!>��d���-��1��'D�X0-�l�\%m��~]�D���da�
���iy����i�D�io�em�#��z����$��^���pS2�-�u���+�m"?W�pS2��zFch�	7��M�p�o���I#m����M��7�b�U7(������'�
endstream
endobj
8
0
obj
27723
endobj
9
0
obj
[
]
endobj
7
0
obj
<<
/Font
<<
/Font0
10
0
R
/Font1
11
0
R
>>
/Pattern
<<
>>
/XObject
<<
>>
/ExtGState
<<
>>
/ProcSet
[
/PDF
/Text
/ImageB
/ImageC
/ImageI
]
>>
endobj
10
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+Arial-ItalicMT
/Encoding
/Identity-H
/DescendantFonts
[
12
0
R
]
/ToUnicode
13
0
R
>>
endobj
11
0
obj
<<
/Type
/Font
/Subtype
/Type0
/BaseFont
/MUFUZY+ArialMT
/Encoding
/Identity-H
/DescendantFonts
[
16
0
R
]
/ToUnicode
17
0
R
>>
endobj
13
0
obj
<<
/Filter
/FlateDecode
/Length
20
0
R
>>
stream
x�}R�j�0��+tL��b'-AI)��u���N�,d�����M�$�,<�����l[=U����^�hk��0��W@w�7�pA�Q�������,��i�U��IY��#��'�x���H��5xc�t���#�G�~�(#RR
m4zi�k���lY�X7aZF��sr@E���^����{ %�G��9I���:C��E	�_q����'�U�aL09#�'�s�����n3m��{d����n!��M�?�K�G�N��C �X����i�v�[
�M����?��������8��?i�����������������
endstream
endobj
15
0
obj
<<
/Filter
/FlateDecode
/Length
21
0
R
>>
stream
x��}y`E��S���sOO2�L��0I ��d��+rKL�p����!�����������%�UO��]�c��D��uU2���g&�x��}�����O]�U��S�Q=bDd��$�6i���9��/F�DJ��9�f�wpV���W�]�dj����D����wM�2a�{;�F�.�=E�Q�����V�s��Z�x�u�#�<Q�/�=i���h���5a�G��
��
h�3o��A�]s��Ad���4�(]�#��)�������.:[����G��	:Hh7m�g'iL�������0}A����������"�8{���Zp�Ks����I�1�A���L�DG�e�����,�l�F�i�������f���)��i=&}M�0������z'R����e~.���*��4��aL�c����X��_���)L��������F>�/�[���1��t��'�)����O���0������a������~�Q\oG��/U�B��z���O�?l��C�<i�����T}��I�h>Ci���t8v'���J��D ���F]L�0�Z��e�+s�4����Al4�������K�*��� )R>>E���^������ y��O�M��+���b}�~P�<�S>��g��hfu	��5��u>w�f����O�u��>����d�XoV�����b�����W�k��O������T����$��#���kQ���n�����;�]��+�������h'��)��
��nC���Q��s��F������y������+`El�j�4��-a���l����"l7F�{���}��a��3`3�q����#��;�a|_�7�|/�c�
����������R�4H"��fK��%�
i;�����,c�\r��Q�\�O~T~U�R�Q�)�*7(�*�(������1�0M7�dj0��J�Hu��B�L]�>f&s���vaw��L���qt7������6�����~v3sJ>�)����T�U��G�P�*��-b�(Ez���S���;,$�����I�L�Xv����F�@~M��q�����<���m��� 5�f�]�C/��X���4�e��MX����Q6X�
?I_bw�Y]�}���U���R�3n�*��?`��4���*���^d������T����M�F�'�������|
�+OS�`g!6L��#�Ci�4�'�������;�e:Wz������_�!�g�����s��.��I�1��\fk�v�ni�l�~�G�CO���X�����
zEc}o�&?"�T��6����}��~�K�����4v���}y)v�<p�F�q�����T.������6v�h�s��v�p����0��f��j��!R��C���S{�W����������S��WI����u�R��S�P�����rs�m����6�i>oj�'9������j1�&���cEp@} �W����u��LhVP	�h��m"�z�Y���a����e8�2���i�R*��1PD^.X��j��)�"'��P#-�2���#P��^���@Ed����+�����6k�`�)�Ni����
��78g'��aF�{+z��dv`T��`yE$-X.��r+&L��Y]Q���]��c�����`��+d4��F7S��jt�!�CWvv<�~C�F�C�����UG�	5�w��G�K���d�����k��fH�+|3"�~��@d�������ZS�g�^�;�~�t�Ap�W�������&5%X!J�/D,�~���/��������%�������)�"����`v�,#X3���N��dwZ8�vvM��;5w��;��x��h���Tg���"U9���L�(8b	L
`$�AL���L�I�'�D3��0���e����_����������',{��Wg�L���r��I$�p4	��H(��A����1�1�=:u\����s��G#�q[M��<;[���
a��Ld���X>@3vQ� T����P�&e��Y��i��>��C�]K����~]Zjr��^��;�Sb�����#k������<��\��gS]<I�_-e�x�gHF-$q\Sc���G�\��I����!�F	�h��b�kv�yS����� gn�3�+tv��Y���g_/a�r�<�v�z�Yu�w��X_�~B��rb0�����������O�h�~�����
5��t������ [7rg��][�_�����]�����jv���z�(l���R��U2H�.n6�2���V��Q`�'502���2F�x�L3���Iq��
>�V*
�����3<X��"En`��HdUEb/�4�I��$��e�S�������a������T��v��]����\\�t: :V�g
�����6l�I��*9iKx��d�&�YU,V�l2;�v�dv)�~�������=�j�vY���G�!Q]0'�f�����S{<�TUY�$2�t�Y��i��WV���)����s{K�%%k;��.fmg_h��L��TR�������g�WU+]�=��K�&�l�eKy�Lj{�����������}R�m�V��i�����9�~����Xpuf��]�
�����y�.����+�-����2��� �TX@�����2 ��|��K��|��;�G���v��}����h�H_� TZ��uCO��=��"�H��B�Z#�*��r1�cuusC�B�����S<�dRM���n�E�S��{^��`�X8A6������?�j�������������{i���������7���������2����L�y���~/dmf����3�sB�������\�J���E6;�����I����$���XL�%;�Ff�i21S�OQd��4�H�3-2n�,#c�h��6y���B�RZ:L���������������xHC�T���Jc���p�t�b0�A�R������l)�q�Pn<}O�4(x��B���r��h���|���L�\��MA$P�p�A?�;�Cw���
���t����g\�'3�I�b���I�2"���������%3���f3�����+a����7`�Y![�
�}E�,�X��'������6��"�X-�i������il�E�,y�y�e�}�cv�����K�.K�J���>���w�>H����A:�i�����c��4����v����p�������tO	���8%b�$2��������%!1@�
3��K�p�3`zB?D
 �AIVnR�f��p:m.��v&{RR�R�>_�:��h���V��p����R�8��d%�x�sR�i���V��i�Z�lN���������K���2s>�q_R~��m�Y�X�l[�Wa�,���kP��|����w�bq:���u7hJ�A��N��������M�m����M��������2��v�%�lRx��%-�iK�i6��;`�X��P\,
l)��5�7��#q����T��{Xp-
�Z��P,#��i��7Ky�JI<Tf���	BqA�+��fP�y	���o^�g���g�l��u� h����b�[0�(�����=�Z���3-�l����+�o�1�����'o�X�|V�����
?
�G`��;�W��j����k ��!�NHk2���@Z�,qJ%��Te�s�����'�Y�����,V�.Ms���5-Ys$�w��f���Ts���R9��Rs��Z{�C����k�p4����v���.�j����au���ve�=Y��EHGr��&�dQ=��{V��?cU������Cb�t����u5��_�o�j���3
�E��/+K���,+5.�����,�YK��{����]m���PH,a�k(V1������
��an��n�����t��	E���@�_O�����g�~ 
�p��c��j���"���t8<rl'�k����{�!�T`.1��=-[����M.��sH.��
���d��[z~(����B9Rs�^���OW������[�Je�
������)����j-8'��zV��������YR�|1��<\}�a�?�BC�]wB��<5�)*;QVv����Zg��<��D@�#����H��l���{LeN1�%le[��������wU������aU��������$��G&��g���ac��t.[2���}=������i����s��w���V��1�v[��#+^[s��$�3sH���nS����`��1s��Y1����?]�����������r�x���������,q����.LY�~���mj�.�-��R�25��4�II����d��&G2��92�@�a�H���Mjr>Y}~��H*k��MA��6���l��i������<!�
�(�f�B���p���r{U5��e��q�%tX�\8����a���f���d���3�@L,���g�I�5��)��R`%�����k��F,���u�;�j��7�h��������,�z���SW�����=�������q�����w��i���+X�%W�x�������������H3<�2n1�>l������z������]jY��fVL�TO{KV���&W�s���t�z���$�MM��6��.��o��5�������;+3{V
������z����������1}o\2�r���<1��J�{�������o�����fm����n�&����>�F<zx�}����OW�[w%$�"�������������K���H&���&��78s��NOf{5�2�y������m��Re�jUz�n���J��:$��s��x=;e�-��i�����������������]
,��l-�����\c��2|���������"B�-7�������������
��X{'B�����������xO����]:^8d��+/��w��A���O��[�������r���~Zqm���;�B��/V=������
>X��T��j�S~��}W������\��i��IyRG�%�k�em)#��$�veR�����0��@Bf������MI[�xR���U/$#��_���y���+�
t'����������'&��|��c�����7<*���!!*���	�k���&a���������L��H^������v7�����y�;��Iu{�9�<�n���lF���f��6p�W/
�O����li����`��!����N��\�����Y{[��A�������$*FN��RRS�H�b��%���NM
��w2g��<������)�(y�GZ��J'-����ZMy��q���={��7@�,��K���!�����o
����XT����\�X�?�D;�*Zi)��R���������p��jh�P
Ju������+�?�bM��>=�=�2_8,�i�-s'�����c��
��
	�=_���
Y�~�����
1�M�P���]=]��+�������������&���Q�_NMM���Z�i��|_ZZ���e��"
]4�k����G��{M�~����t�|���j�,^p��������J��^y��lj��c	4j`������l����Wf���������k�n��]��R7�����1��;D��?%�����&����?i5�����:�g�1n����k��a���vtq���1�M�xS�hu'������:l�����k6�������en�����^�b�����d�.����nf�Kg����q��)9X�����Y���{�g��'���lT=�x����t���^�������{���y���Y���v��{�y0`>_Zk~��NS�����>n~�m�m��{l���s�/�:�4�����a[��u@������F�i���Q�����C����2�O�<�)���=	A?kih�,�4l��[��L����B�������6^�%��<b�;�p$UglS�g��mfqU	�~L}�O�v��G�]~}`������'���a<����\=������5r�����H�'��l�#nVYA�z�������$[DunA���-+-�,��R`�ly�r��[L��2��JbR>W���<�d���I1���\S��e�fp
zD<������\�\���f�)r�qq�������^9<�s�q��,�L��5r�+�b����(����������m
b�oj�����������P���_�Q\Q�����#6��4�N����Pe����	E�����|���N��~����"�6��m4<��pe�rL����(��G�����\��Mg$A2�����D�U�+��+l�==vQ5>�6��y1�Q��TF	��`��s�)F���n�d�(�D	&�`�Q
&Q�����+*�Igs��u�}�N���A0�N?��.�V���������<�\�������|�y��!�Al���G�N)��e���^�)�J���m��
�i,���
�9g�|����.������Wk aluvf�?�������.o5k`��g�w����6��
��1��a�JCe��`�g�Z����E2n���e
��_n\������y`������&ep���N������aa�����'������I����F�{N
�����

�!B�vV�vN[nR2R<YR��6�*��?;��o����"����Y{$��_V��\
w*�Ny�iR5u�*����Mi�����_a-���L�,Y+�M�NM�pd��&�o��[�?������\�����f��7�"���c6���n`�/=��K���>p����|{A�����n�<��W�)���� ��M��S�m��J��U�2���J�������(�rgW(�X.u)Z��Mx�S��R�[����b���Mi�<{O6�M���L�l���R�����-�i���M0y�N�;���@:��O��
<;�#"-��
������4���P�|�����>��3����_�v�I�@O��{,�r��>[AJ�Cp�u���D��(��7�	���
����0���/�������;�i�*�8���b��~���7N���+��'�
{���4���}G�����0��� G3�����9�:l;`g���J�2S��X�,�#7�����9����dj+�B�����{�t7e�f�����4y���r5�Yp8�&qh�.��iq��5��+��iZ�6\�a���fq#��������E��9����O_�{N�1�M��pv�
�y	a6�YR��AK�g�"`���~��/����	����#������'���������x�G�>��[f]����K��e�.[}c��,�{/�Xp�E���,�����Yd�#���|D�\�I2��|�'+���vS�..�8�wY����������N��qF�$���f�����o��b�K�*��/�g��d$�A�L�'�z������nw��~'�Awe�u}M�i	���r��	��zH(���.A�m�4��QE����|I�����E�h�F�D�J�k4�~u����:��5�o����!q����;�J���i7�n��6��q9�����B������!������������v9`�I�m�mAe��I�L9��O�'����-3���ua��U���D\�:���o���ry�7��{����)UK�������������Su�����w�bn���84p{K�;v�E1��8<	%�q���N8��T��R���'�c���1{���Z�8'7�Q~����<�y��r��;e-?VT5����U��L�����-��aYZ���V����~NU�ny�E��G_2���c��*_���(���;qMu-r�u���-�~K���6�%[�1��\�d����������v�� ��R2������x�.��<��A0�����Y6Z$K��;��n���.t��)�C\c�
�F�'�'W5���{����%���d�eV�s.
wI_�1sc��#we�\���Pz�����|��,�����wNj�-��r�H��}�����,7�S�~���O��-���W�nRo�n�4��Y�}��E���>SGV������4���4O*������i��P�s�f�(f2��Z���\Psi>�����fV4ou�p(\���4��
O�6;MNC\�;;\�"m��.
;�|�|�H�xi��U�J������&6����1�\���pPO��(-��z��0bg�����+��?s��Ji�_vK��{�M<���fd�f��}��C�\3���|Jc�����CVT�g/�����������5<:�����S�Z���@�k�`��0�����K�df�\�V�:=��t8��&9�0V��d�J�da��Nk����S
�vn��/8�*vTl$�W�,6]c�1�n�h�3�R�����q�$��|��c����m���v��s��=�8v�����S����N��7~�y�t��C#��������e��/��*���f6�^v���
�����~�{�4������g�O�F������FY�7F��_P���o�K�0���R��4��f����fk�����C�
R&}%��B*��(�������u�|V����������~��W<�	��l��l�J��(G�0�q��Tg*�8�P���n9���Y�LQ-�'��q�U�>�|=��p���sHG�k���>��nx�x�^'F���
���g���c>�E��h'��C��#t;��Q?�"���h?�&�_
��'��a����D��c�Lz#�"T��;h&��|�~��56o@�sn���/����>�7����+[��T����@�2�����}LC�Oh����|��O���R3�c��({p�M�O��d�NQO�-5�L���xW��t?��6�r�q���x�m�<s�!��<���x�'������;/�'�����5�
|?-v�x�af�p�J��X�\�;�j���6�l��60�l��~�����w��uq9w�����6A�!C��0x�����!=�a����9�3�2#d������V����Pm���g����x�������<�6���"dV�3�l![Bf������)�2�D��#���![	*���}Ay�z
*R��Y!o	*�b��Q��&zf�:��a�7iV\�W&h�Mt�o�{�Az�<�&J�S��:M�Q�(=��3�bn�K��|�����m-�����B�}m��M�t��&o+�	;�]�\!vT��W�_��`�bu�
4���������l��H�������bO�_�.@ AQ�X	t0������AC����I��0�R�T,���)�D�(��O�k���%Ma+a�d)j
l���&��o��|�9���,�k)K	����T���LT�������{��7� ��a��
��_�$�Gi"���|�-���f����k)�-�a[���T�����~:N�H��x	�O�oI���
�On3���T��77�P��q~�}�L�2���i8M2=K��64�T����zS!e`�_7�������[��
>����U����g/����y�:vc��t5�RI��{R�Cc��
��T�Y�^���K�~�\>�VuD��7���Dx����
��,����C��������So���<�T������Oe�Qy�����X��y������>���M&}�� ��0�Tc�{�~�x�Z���?�I�~��=��_��Q���^���q��>��u��oj��ND�s�S��2��5X����{���M���l]d����*�Y���	;,��[���\
:
s7(�>���o��(�3�O��C�������D�-Q>F�T&��>������	�?�!�1_&�����wL���*�1?e>-�����>*�.i~�2�v��1�
�=1�"���;zhg���4��!��VI/���X�Y�*����J]!n���P���K.�-�]�@W#���LJR�=1�_� �Q�M�F�v�T]^Gw���P�I�Jo����H�>�Z�_9�G��p�	�1����2]I����E;`�����5�1�Z��WR'i���"=���b�M�\?I��W�'�m<��6��T>���htd3��_��f4 (�t����yoA���gI��x�)����>���
e�����=�����
���m�gPV.g`����2�]���c��G����'g����K��������G�ex���p��"}�#p�\�{�G}G����t!���k��������6�c=�?ki��k�X�?�-mWb���6�A[��hfK�f&(��9���@G�
�,t����>�S����/�
]=xJ�b�����>\�0��]����F����w�����
B_�=B[��.cM�6�
�y�1�r�����z��H?�U�������	�C����m�/�wI����`W���c��4�K��=����<|���7[�8vv��c<I1�s���T��<�I�w

�V�e3��E<!
��A�����������E�>��T�������^F��(����?��O#|4S���sK�4�G�#��4�7q�y��[����?�cq{_����ko������	�8�-i�������^1�q��$��2���=��#H��cg	�r�l���es�{h���_�����D�U�0N���Fl}i�Y�/?���+�F�U��Uy5u��0+��������Oa
�cg�%#�p<K|�p����{���{���H������]h�g��Mk�j��9��r�d������������$�e�b�����?�&c�3�N��k��t��8B)�_�����8�)m��^������%�K�~��a�����wR/�C������L�m��8/���1;{�|x�����{7v��:���M���?�s���B��8�5�p���o�S��o�/���?��%�vb���9�[��on��-�{-h	�������@���Z��~e������[�y-�����q��s�-�������%P^�?�o�9�%P��;��(�r���c�XT���������������!
?������1��_$7P|����y?���Y<���
_�s�c d�<�\�9�S}F�G�$PQ'|�@09������?Z���>�����c�������B>&���,/ c�a6�����}4^����������_���@
�:�����U����(6c�_�E�XL/��c�����S�T9@'U']*��w��]����V=o����N2!n�����v��N����~��b�I$|�]�!9M^�
���!��gi���w1b�QT��G��>�{��awE�Z��%�q����sDL��z!�}�&�<�MzS�ZB���+��}��O�&�������)SYF��]�~��K
z��>�����&���(��{��>=���|3��o�e��<�3�B��fg�%�gm,!�u�/�59}n|�+

�m�<T����n�g�a?����������T���v7|�}�y5���"����T;��J{��x���_���!�
#��*2��������� ?
����q��X^���k~T3���U�|��|@��l�������jE�/�'��%���8�����`��w�`�v�������sfF�
�;���g�w���������`���H7A��~�:-�������c�E���L@l��<�ex(E�_����/���7���������t���Q��Q�-��#�����!Fe;���8n��f�ZM��lA���b�-�����6�{X�+����3To�������^��D�O��	?�%�c�����mc4*d�V�qK��\�������N�(=����kAK��7�������'�Q��P�^nN�}�o��U�3�[���\56��_���B���������{���C�s�4��ZM
�k�����+0e�j`���LP���w`���6�v���6������:$F����utp�=f���N6��~�D���_'��(�{�	��w�94's�X����R�)`��+���~��=��|L�%1��1'��?��u������?Z���y����~�4����!���Yc|��W�_�KM��|���!j���������4��\������/��&���x>�^�N��M�D�����D��k�Q���R����x��6a��/�Q�����o�R�z
z�c�v�m���ij������u���6�dLt����:�y�e-����G`�_xX������A��A��Y
x��C����_����c������e1_4zOB�J�����~�������c������T.���-i��[n�����H�]�8{�O]0���6�����Qv�Q?H���-"�����_�M��@�|���E_��������
O����
�I_#6�Iw�R��;�����^��l0-�;��
��T�
zx�m\�^Z���@Gz_���U��������aO�f��z�~��YA�R[Zw>�_�z�z����������O��mBl�m�����A��{����.��P>���*�1�Ut�d�����`k����n��E��6H�s�IB����������F���H��Q�M��2���#yM��s�X��c�=��W�M�����A�v�����1���]�'l�}���~FL�G�'�BJ+U7|:;���D�/���#��s��<�>zN|�?����Q6!s��g���wfb�c>�.��~�5��P�,�"��c�U`7��
d|�@��0���m�OJ6��RL>�0v~�7�oM�����f�[����sT}�����_�����$���8�}�F���f�j���������A�/�����V������V�����t�=m����[�}�Bn�[��S	L���}�e��d�>1�@-_���l��/� S;��!*��|����R1t�B�������\�OPc�_���V`
�z6�����gv��������5�MBv������O�G6���K[h���x���Hc= w���&��S�@��-D���w�J
]j�Z����Z�n��]�)d�My���Sl����@;���X\�^MA�[\�����bL_��V ��V���'���?(���Q��1SO�{�i���<D������(���(=��C�s���@���F������[|i�Y'R�����1�b>����	Y�SZI/�������m��u:����:��<������Vu�<"iXD��	�����=X`km���"t�cp�pg%]I���3���!��7����(�=�;Q���N&�j8��Z��V���hE+Z��V���hE+Z��V���hE+Z��V���hE+Z��V����C0����;*�ZR��F4����CI&y�y]��U�?���0�����tt�*���'u����o�G���I@�.��n��k�h�m�x��6��JA����������^1��P����b��o���W,��k,����5�c<i\]�e�e�����N���f���������u������{�!���:i�i��O�-f_�i���e�u�q-3�����B�n\�G�k�q-3����l�j�g'��
�/���}N����L��4?wda?��,������fw�w��w
�A����������bF���a����Q��n`;�E�:�:����vu8����z�YtW�	�wu��
w�,Q�o����S������`f��?�8'���?������{��������O��
���th���.���Q�q�b�E��F�E� a�n��P��4�^��^������,����c���*a'��n����]�����Fwc����-_�����Y[��
���a)�����u�6u�
uS_uSouS���������@�R7���2U�9����f��j6�Mf���d��?�
�c�1��*i�����z��`�9
�H�T�+G�c��C��rb ����fYQ��X$��*�����U6���Hq�2��[���kkP���W���D������UM���A�+�����Ee���>����r��_��}{����r���X��������n�M"����\9�:�PfM��H��5��F�U�g��������T��:�G*F�r�cyMM%��h�D�{D�3�Ee����2��,�.h������(h���j���:�v�)�h����Y����;����m��)5���h��7�`���&~��&�4�o���Ig�'��i���qm����3�_(T1C�����f�W�\��js���H�s_�zM|=T��El�~TV�3��+�=bB�
����}+2��0Z�Q��Wu��������*��o��U���3��U�����8,X��U�(o���Y��2�ate�ldm�NU�����kP�%Qf�U4��b��QX*
%��aS��on<6�#�g�BMh>����sp�|��
��m��
endstream
endobj
12
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+Arial-ItalicMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
14
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
0
0
277
]
4
35
0
36
[
666
0
0
722
666
]
41
43
0
44
[
277
]
45
47
0
48
[
833
722
]
50
67
0
68
69
556
70
[
500
556
556
277
556
556
222
0
0
222
0
]
81
84
556
85
[
333
500
277
556
]
89
91
0
92
[
500
]
]
>>
endobj
14
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+Arial-ItalicMT
/Flags
68
/FontBBox
[
-517
-324
1358
997
]
/Ascent
728
/Descent
-207
/ItalicAngle
-12.0
/CapHeight
715
/StemV
80
/FontFile2
15
0
R
>>
endobj
17
0
obj
<<
/Filter
/FlateDecode
/Length
22
0
R
>>
stream
x�]Q�n� ���Cv�4�,�*U%�P�~���T����/^�������`������H�Kp��Hcu������h,)J����O5IOX2w�aj��H�P���s��i��
a�AC0v���S�pw��&��r"�0�B��?�	(C���)o��M����������4�^*��@����CZ�����<��!�$�\�KF}��u��e-VT����&T�����
�a��[U��#�}��>�Y�?dr@T��TH�u&������~q����Bj�{�v�X�N�;��p�S��
endstream
endobj
19
0
obj
<<
/Filter
/FlateDecode
/Length
23
0
R
>>
stream
x���	xTE�7~���{��N�[:��JH�$,�H�e_%Hd�EdQQ��**�����+�B������#�0�8���(���/2�����������}���=���_m��V������T���Ec�#�KPGM�{��W��\���������i��-z�5��u��N�r,����s�2|%�l�����>{��	����� ��S��3y�T0�U�`~��I����\���X?6w���yo���n��){ �+�CH�����)��3���s�����������^�;����n��?B��w-�_�JPa�����`��X(YE��a3�������,����r�ul�\�
=a�������0��o��0����:96y{���#�(����l�a2�J~��5������>����K5�|��z�Jf�+���������������	�%Ro|���5��X+U0������Y�����!�c���@-���z�-������#���vp~O���I�
�+p����P�O����%8�r�����+	����
�`4��ql�1�'��J�~�^��y��f^���0+bC�������`�;�����}?��=g����~-?%�Q3�%��"�� <�g.��[�nbo�yo>�?�?�~%?)��2	��2�
��S�O�c��pv)������N�;���OyO>���_K��y�o�^�;R^ ���PnU?m����O��L'W�p��e8�{`~�n8o����S����7���hv=���nc[��IV��a���7�[v���<�g�l�����5�W�!~��/�wR@���Rg�\�����VJ��w��w9,��8�����F�	�)��I�i��
�W�������Fh\�xocmc]����k�Y�B9�~�����)n���8waV�z�A83�L6�-�����g�����=����5���#b�xg������T>���w�:�&�^�H�#�I�R�J�*-�����j�W�w����Y�M�v9*g��r\�/O���7����(��W��T�:[]�����t�����TY�����a������g��;&-��J;�v^"��k�5��	0E��R�l����\e�z��
��r>������H��@6f�N���Ty+F��pB~��5|�b��n�_�N�e������r\z���gy3�M��;���!�V����,�!��4��;y_��Z��!l+��Q���KJ��� u�>��a�+�@>^��)�p;��%�	<�\�V�J-T���|�����:����ue,�IJ*��������-��vxOzG��F,�TF���7�
��\�*c�?�+@bc O>��m�T,ga���x�i������)
�� R� ���(!����('d������{
��Q��P����J��|H^W%���(V&������x�-o��B&r�{l���V�%��5�->��{���l�� |����L�9X#�FBErm�(RwJ��r���W~�=��AI��=�O����>O>��2;LO^	C�yx���$K������z��G$JSg�<����������D���z&*z\X~A��n];��w�X��}�xa��6�y�9�Y�hfF$=
�i�)>��q����jQY�����71V�?�F��0�=�s&a��fkbX���:5���Z���	�9�E��^3�T�i�r(o�.�7'Vs�ON���>����������"�N�]��������}b5lb�oM�E�����_��a���{��};�nw`����@���,������������	���[��C#����N�R3l���}���*���a�'�\^9�j<qQz�nj��5�Ml}
���n����\>1���3e���5��J���~���;<����z�]��i���opF��k����l>���,
++��������v�'q�����W��a���}	}��}Ss�R����[N���kfN��	����f������c�[3jlNVMEzN��>����f��;B�X��'��m����nw{����<1���H���8�if�(�"$�����dl~S7
�v�5��a5��d��f
���[��k��TN�k�<-'��[@
�9���%��5O�(It�Dj��L���5��D"����8�"��}�E�<'g������N��^����E|k}.�LM���z>���B�(^Y�'��}�������|��|bRr���Vc�o����)}�w�a��y<U>pd���������h���Q�����������c�tn�x�$�"Q�o�L���9������[�H������h�a�=+��lT�<I�Dt��1��������?ox�5U��Q������IM��"#B��Qc�b�k`4rf��O��F�L�I����
Hz��=�b�����������[��_N����k&�'�/��i9kv�?�?���w�I8��=����[[�s5�uG���k{[5|{��9n�n
�
�F��������r{.>�;���J��21��@�Y���~���@�SY���z��j�1�\��2�,�X&�e	QF?$cz���zKV�Gj�L�
��n��fy�0`�t���}g
�����*]�<�nT^�]C��
Z��}]h]�O��������'J
C��4O(�)�{�o*j'��pH�PHb��`^�G���X���<��CG������tg��i���+m1���K/�k��N�:���*N4�����
��i����Vv������1����V�63��X�(k����\L��'
9-��vL�X1U�Qe�Y�k�!
s0p3g��4�/
� n�03��U��v))Fm��d��l5-�_R��si��J��g�l�w��gW,���zW���!e�u����
������;��'�{����y�{q���-X����/Pm������o��%���k���������qE�%Bs�\�f�9g9���\�������WiC���:�<���Yi=y[V�s'��N~"{�}�A�$\���k���p���'?Mx)�)o�(L�R��Z�?���\�FGT
=N
-��9�d�����L�+`��uM�D�����T~�C'�AN�!!T�����?|�)|�Jo������!�C�WjT��I!�5���;��XH��XCu�U��Cx��Cdu%_�X�y���,� ��2(��P��Q)����F����rLN�2mVhb���u��:�J�~������������y�M����Lb��:�6L�+�J�X`Kd��v�1�i6.>�j�)�P
���W���M�V*�.m�6��Q�TD
X�����N��U�R�j�H�������0De�I�q���GD���i��M~�FT���s�-��Y�oZT�����7�q����\���'o\rq�����'�hL������3�?�n�����C����h|����:;���c�G.B��B�]�
J	��t�������O����%���������,#AR�<V�iHH���^A^A^nA^����I!�Z�N�T�w��]
�)	�+!�V����L��y?�)�P�2�-���*Sfd�xU���K��%�L]��G��A�B��K���p���5��L�
��+�X�
%D�`*$���tn�y5W0%E�"��zE���M�0���d�Bf&=����I���6��?�pr{ �j^�cQ��Eo����+(�_L���:t�|\t��y����X��K��3S���]���?���3�c�GR���,����9e����%���E�J�(�,��u��RV���}�(�d����l
O�"Z�����K��Y<V�m���L�fU���pj$�F���pDrej���C��[��;�����q��dT���vbT!���#�����2����%���|����"��sQ���]�])�x�vJ����r�fU��>�V�;����H��l6y�T���'��e�wE
�B�$'[���b����d���<����_?���7=�v���O�������|����������i��~hM���>f���Y5�q�����i/��vIA!��
���f��
&h��`�q'fX�;jO���6-3"g��(m]9.g0�����b�|�N��_D��P�����B;����?q@;�+����	�������Z���z/�.J�F���f�N�_��6u�kM���G]v%&	�s8]n���_FkJ��s���.�?&M���@�OO�p�
��3��g.��I�L����X�X$Vm1�Z�������Bl�3���9~��g�U����z��6�:�����9�d��v��.�:���1���xU�Xn8N|qB���
J�.l^%����~��@Q��)i���B@sdL]��YK�m��dP����~��kS��>������M�i]��o�>�n>����%�S7��7L���[b;_��v���:d���}��~L�B%���A��b��g����4?.U���m�8�"��*B�.�Eh�
�'B�P\V�^Nl�"BU�6�j�7�9������e�2H���,�����J���,�O\����*I���t��9�X�����,c8h����g������M�a��
��J��z�5��$�sJ-�Y�-�<����J-4�c\����`��.j�w���Z��_�&�q���\�XjC;U~��[V��VFy��xp�qS�\��j}e(m�H8J����e���QN��Db�:�Tg�Q��V�L��9�#�/����vm�q��;�oIZ�W�2~o�-����k��&<*�:{����Q���0����Y�c���&RR�r�D�BC)J�"��R�u��4
�Dz%J�c�'BGP�7U�}��
g�E��"n{fZZ�G*���������De/�K���I�*d���(eH���	%�����k2�My<������[m)AwaX�uT::��B�P�h)�4_J�A�'������P�$Rh 	�&7w�=�4f�Y��^'Q��%���y'hs�������P	
�d��<h
�����y�<���n���?&\�����KU9�#����@	||��C\A�����a���tZ�Msy�B&%+-KBAi�����m�W�T���K�<y;��������Y�v��
�Z[s��-�k�V��?�n\4����^���I��H_i��2��K���>)��[o$b�lY�D��r��(K����-�I���,��%q��iQ��	���	s���IW�3S]�Ly��-3�M�F��I��Kz����I[U'��UD[�g�XK"�O�Ol�oTl�4�2�:�7%��zud�uE�M�~�%F�F�*}�3Je�z�&���^�0�q����v�'�f������y������DYy4AY
~��g�����C��-��)�:S�L!b3�YY"�"0!0'�4 ��/ f1���?�4P�sw��v�kNz't�&�N�Agbs�2����=�����B��G�NN6x����Xj32�����h����/�=������#�������?}����Co2��-�_�U���qp�_�3yb�?�����l [����8�n����
�o��3:	u�_y\07���b2��V����MG�d���@�8M�Pa�I<��.�}C��&p��9l)n�Bn��h����>ub�v�vN��e�2��C��=�
�j����u�$�\�xb`�n���Z-���{}�g��������pF"���k��~�!bn��B[����"_$������Iv���.v����"M����5��x/����|���p��$�4��"�������������91_�����Xq��Gw����T��L�+�,
�88�K�
����s�A	����!ir��%5�������i��i������w>��DYY��Y1������ExL�o�Oe�8���B��Z�	��&������a_���	+];�Y[�E���?v~�Q�ud7�
l���:�1�y���������WX�9nq�p�x�{ �����6S�Q�>�O2%�6��� 8q�:�f3m��LS�d�B4s�^���W@��7+��dM|�'CY�a��<�S��L���B���	
\P#1��_�H�s��\�0wA����%�x��N/��M(4��8�����g_Y��D��Y#�Z��y0��G�"����������3`IJj3��\��s��x���g�^y[����j<}��+fM_�z���_�n��'��i��Rz��gnz��M��k�n�������;~�FM���	�W�r69x����o�����#��D}�i7x�Urh:��z(4�stN����~M�����T���6"����0����������s�3�f(�O�9�uD�8���irZ�wP��"q2Q@�

�
��7�mwdz�Q�P��v��0��F��y�����l������W-Q,����Q#M���M��f�8g�%
���L����W����w�(��O�\��$g�t�v�����7d�
�����d������c�c)�7���K��NQy�s���Y15V�9�=�L�u�^���:D�����>Q������a<v�4�c���A��{����Y�����Yu��m�#o�������jj��j2MM
U�XU�L�&���\���4�����
wa>L�{��>�_2i�
���.���u�Y�q���������+�.\\���6���uW
Z������Y����������o?i����R����?�U5r���O���
�+"�����\������2W�����/*�rU�t�(��q�kE����O��
t:����������]��
���9��k?#H�S��Q������PJE�`�]ai�\��"y@�1���4���"�J��~��.��Z�f�V�[(�J
Nh;�-o)rW��pot'��F�6��H����v��+���h�����E�<�`'�J����Eu�G������p�	�zOj$b���C�6����h;I�&!�p��br�*��yY���/E�� !�)��)�4��\�;�R����F�+����u�in=�4�n� �H,�c��|�	�����Q��i&Hy���T&$A&Rs��2������<1�^��@^0�H�r���"A�E�{��*��*WS�K�]B�i.U��� }�*T>�*>�	U�|R;ukr�
���(�-����)�ay�}D��xY%~\�`6���y�&����b�-��#!����E��wS=8�l�4��@N>��n��
V���������t���������f��:�z��a�-��|$p��9��g���%?�����Z>d��T�+��g��������n��tq��'�,��{� �.0���^C���)���3 �r�z�8=�Jg���TDk�<���DzE�F�E��)��r����U�*�XO����L�������W��E�r�x;�A���/Cf�&���R�)J��Tx� �0e��v�����SKs�*���ja����4�\����&�'�;���i��c���!G�G�!8�\������h�����R��T�Ik�X��\�r���O0�,�A�<���J$At�~���>���M���dr�U�����O6
�a�SD�L��F�#�c��}"������� ���Rw,�����44'���mh0\G��2��V"G��-��U[���o���L���%�FJ
43]�?^7����%�����x��;=�����V�4|{��;.h�������{G�z��#�/hX�S�J�0���@m�-||bm"��P��T��~�����<��a070�/��#����,V�0��0���t���`�Co�w�����mN��N��2201eb�A�������H�iu��3�i�r�s�����s�m�}���w�p~�%w���R��a$;�;
j"kl�cpl��8��#8t� =D��V<�n���������)U�J,����GCk�%��^�Y�(�%��B5XBPDB��AaAE���	�K�.-���E-nqS3���Y�v��g����Sz�~s/n�9W5��  �i�*���S�����oY�Vu�%���gZ�,���K��`��N�K��3����������g��n-�j�#����-<{�e0���/��9}��/���nB��i�}�9^�9{K�\v��J]}\J����K�(�����+�e�mr��������wC�|��u���GB����x���0I'K������;����~�E.��q]��H���=;��X��vh�9�P�H�	�=+�N	��7��U��z�
�AAy�v��4��&z��(��ru����H�
uLr���{������^7������\�;������t���}����C_?g�k9ly����DC-�%S���{�L���C�4�,�L����`"�&�"
�5A@
�����)'�5��P�t&���M�;�������K�^=���'�[��!����}����WlX{����fxO�����z���~u?i����2Q��!ulH�I�m{�Re��*�R���:�id�Su<1�R�T���|�z:,w�uu���
������L��O�,V���������O���K��g��I��&�G��������hZ*
���Gy���0k	�G��2\d�����U�6��5.�
���oG^~)���eQ?���bV�SM��&�@��$rK����^��f������Xq���u?_�T�7�2DC8=��7m���(o�W.vJ>���,���X���o�%Kl�Y�p���e{�}�����Y�;G�����^�|�����pg�1��<��~]���C����������=��=+zO�tN
�C��:X�Df��yBE���Dhn�A�C�']����U��C4��p�4���������xj�,�`���R�)bSr@$b2N���A�w1��};:u+
��h�:`�qo(�B��Tq��@\)�&~�v���ohN5�u����������~=��@�fv����3�
�*�������"oa�8;.���M����fQ�h�j6_:xUO:��x��e,��8������siWr���&1�F�=�7��o^4h|z��}����7���%����&^���4����MK*�g7�q2����U��yrg����%���@�4`�:�����Q,�����N�+�$i��y�DI������1���M��@��4�6�{x��O�a�0�����������H<
��n\:�K����i���v)�������M��/�y�X�T����R���'4�I���#h��X������Qh���l����\,,�������U�<}�PN�ff]��8���V�|�
�U�%��:����kt�d��fI�n���i����j�we����f`�����V��7wU=�P��y��#o���9��������c�'B�.��D���"B�J�m�Y��NO������D���"Bl�`A
�U=��t�&�$BU�����m&3!z�.������m���������Y����U�6E�lI�=jC��"s��J���E�72PU���<���&�F�'��}�I����`N�u#�S����b�e;�/�M�������)�C�-�`~�8����7]b$9:^��~R:��I�������/>�&��y����t��d���ky���>���u�
�G����5�����q���k7q7jGig=��I����;Sy(*<JT�����C18�HQe�R�$y��%���M���PJm�mn����u��2����|C�Z
���<&�B�B����%�5�-<����6��3N�us����~��B5����nH1����������s�"f"�L��D��'1�P"�L����>K\�0n3�1)������Lx�D�)�53�3^3�2��V3������Wi�|\>n�{���rT9�k,�L��$)'3��Ep!,L�	�4��<�.oS����u^�������&	�7U8g��[Lqa;�	,�H^��-aV��Z��b
��Kg������E��	�R�BR����6�*#�I]��.�t��xI�x}�P9BM���#�hc��PCQ~�[t������{��6��^8k��S�T� t�:B�y�l��,"���S���\��������A��
C�N��1n��,@y2�nRy���2S}8SS�S��t�s������~J���qw( �@�v�Cs=���������xp��9�{��U��)��u���2���{��jh��rB�{i���.^<l��
oV���I~�=��Hj
B��>�>I9)�NQe::�F��Vc�kG��������S�>�*��w�]n��$Z��q�G��DPXaO8�=�h�'B�8�E
j)�	��'0��N��3:��k�0Y�9�Il�#�"x2��7k���rP�%i~A7~AC~A=~!�N�y��4�q������63)dC��K�Z�(C�s������q�����G��,���������v�]R�|���3��gL�2��D)�pP6���[�~w��a���p������m�;wp�

���f������i'�wrmp�]b{w���u)�I�8/�41�R!��g������1j��
u��Z�u�u�w������+�m#�*_�Dp�2�6E���������T�u�4Je��y�4U�j��iDd�U�I1��>��a����z� K��a����(��(�
����oE���y�-,�%���s�O:����j��M���\�N��/�9�n�X��`�0���8C~i	�$�;L���.N��S�6z�����������������N;z�Xf���]�\n�YU��{�q	�+��E�<����1��_��~����+W��X�����6�/j�{��/nb����+����W"/G�8�T�����R�&��T�-����e�����6W������!�m��`��Y�c),�g�w^����|�E������br����A���������3��kU��Gk�X������+�7�'��|VeJU}fA�Y�����K/������f����
��x���7��c�H~*m���(���7�������-(=�������59f"�Ld�	�V�X*,�������m}r�dO�^b��vK�c)O�������@����(�|4�Z1��[�����;�;��fZg�f�g:f:g�����x�����m�%w���1%J�������w�r�Up_�{:>b���6���1�_`Z�f"�L��	�{U�T��T�3U2��K�2��Y��9�r8��&;:d����j'r������������	ECsB���h���) 
�Q�z�T]c	�5v�q`������/> ��-e����+3xF$�"�'bb�����8�Bd$G:8�a�
%R�������%���<$��%����
�s�����S\�=�R�$��%6j�������#�����������{TJ|.������-�a1��6��������b^L��\������>�\�{G������r=B*z��=1q��TW���������q1���>0��8�:���y�O5�>q
��C�C�x|���>'�C�?�'���~?�5�H?t3��P�%����QR��{5���Ij�+��K:S�c����,wN:d������tV��fW�r:D���q��3=&Sa|��eM'��|��M�]�n����\���~��W@���������~���ywx`h�n�w������5�3������o�{��n8��02k��>���/Z6���������?�kN$#��[�s��q/y�v-��ox��X�nh7���k����S=��	;�yN~���d$&�C��v��L�f�{��?"9<Z6d3������v��s������w�e�����"*�M��>��j!O(�a�~0�����G����p_����^�hUC����e�	A�e����S�������r��n(��q�-)�^���Q5/����]��*nD��=\*���v���c���xA���Z��[����re�mk��.L�p�����	�n�
ar����R���F���x
�����,��@�����i��M���d����28���
;6��9
4����
4Y�a��<b��	���)K��	���!�5|2���7�k���LGf��K�t��&EJ�����6S���������(��MX�6��
	��7�^?4Su�*3�������aYs�<.�Z��U���*;��e����������8�h��Y�EV#V�Bi�b���~=Ts�9�W
~�u�
�=��~W���:�>��U������a�~���y��`��hz��n���y�q��s�9����������`�5;�(V�[U��A�Y��\"~�(��!��CH�dC���?�Ya��-��Jsy�l��)�R���3#��.���U
����e���������C)�0�8�B�-�^����}�+�cm��4>�:���a��k��m��W��|��������0�o���4l����l��/�����/�C�8eo��c��^�����������_��������C���U	��C
�����|����"J��8t���8�
�P�P�**5�'��-Vk���j��A�-�T�p v��V��j��$`J��9���D�VMT��w&�j�+�J�b<�������D8�P�P�8^�o���s�r:�Z�t���a��A�*��3�s��i~�9�����kJ	c�i��w����������n�b��E|���PB�/���;'x����[��d��m
)~e�����p�V��/c1�B���h�5�~���i`�7��*U����2j�_�2y�!.�d�5��V�8�y��H�O�|���=1�Cm��h���r�D�(���Iy�����;��{D��Y�0G�l���U^�i�
��"O�e0��`��2@W��m�U���X�>��e0������:i��6Q�P��-��V�{�H��.�����[*��#V`�0���uR0��X�^�U���|�1�����rD#��Y��+�].�o�t��b�Ad!
��P�S�y����/���L�on�&�1�B����>���e��0�5[K��I%P��,D:b8?��A�p�P>���G���By
�<�q�T�`=��$��`�t
�����{�;��|wB��"�%�W�`)�W|�2�|�����(���%�G��V �b__��Ds��e��#�����~$�?�K5�J�_DsN���4�a��Xg<���D�������:�r.�-X�6��c��4�	Ag��|O�"2!� f!�#"
�o�~%A�H3D��>�6��pql�f�o� �S�������,�i�e ��I�B4�c�n��x�h��}�t�}'�TS��'�i�������L�p/nVa���f�Y�����9A�0��f��Q��@�A�7��9M�tx�9Q�e�& /���p�|�Hm�����{�n
�FX�A	��P�?�"��`9�f*��;���<
������l�(S����)�^V��7����`��g�?�O��;�o*O��|*��r4�����x������1��"���8��:��[p+��B���]I@Wy�O�y�,���J��j�h�-V
��(����$~/�4���	7���mFG��\KZ2c�^[�$�
��b�"��f�����o��"M�H7�|�e4b�N������ex�[M�lA��Z���%]���nA�n�)�c���$I���$9Gr���2n�~
��tLr��3�:���8��G9��}I2��K>��%��|�'�bL��$��^��S�&
}����z98L=���lC�="��7�+�G�����m�T9���2P�w���8�8�Y�D�����#$�D~�r�x��A���{p�I�7KC{����W��
����(C�J1�)���(�G���Sh��;h<������OC9q:�Ob�4�c�Mb���j;M*��d� ��:����M|�|<"�B�G[�h������aO|��p	��fK5lFK�/��w<��F�X�]X��{�R��U(�V��A���g���{�\GH�8GOAP��9�%��������?�V�'Q�A9L��=�F�C_u��e�)('��[���������}����}��rj[A���/�����1�S��3�,]���{Z��yX����h�Dt�!�7X�C�iz��$
n�r^�I����e0C�R'�]/������<(y`�|��a-��(�j��������0����1?�����*�J���H{o�]��k�����Nr��7�^�C'�A�Z����OS=�G]��<��v� �j����@���qMq��>o�8��q�c�������b;�#?H����;�<=n�o������[��'�{p^�����y�3[�� w�g�0���!��y��:�������x��t��6 �G�b>k�����CIO�9/�u
��J�!�������"_��C@Z���.�T�"H��`y&�k�W���vB�����~����<&�����
��,�Ql���������]������4n�f����7A��FD`�=�S��4�	���-�ih�[����\�_��0�9L:h����A�����y��������^�a^~�0
��4&��6?��C�
���X��y��?�2AuE{�'�x��M�;C_B�y�B�*�����c�K����%���"��1.�x$��qs�m��-�LY�cuZ�F��z��K@�9�x	q���� �"4���!hGE��R������l�1�C�0����nl�pa��eW`�0��o1=���Hr96ve�vm���F������S�mz�3[31���3�b�{�����c�[0����a�!�����1��0N��"�������>�=�������2�%��KZ�!���\�_�[�5��������E����>@��������8f������d��N����%�Y��F,�o���~R��lg�_�v&����g�����}���7��Vv
6 4D���:��6��P�x������#��ct$�������r�[�a>�oM�f����_�i����TG�7tj��	-�S�&����R���%�����?������i���&l=��`I$�Z��?�~!�Kv��oiw���v��o�<oI{�=�pZ��
�[�;����Z�q�y�����r����[P^����@��J��e7Z�B��(��N���O�god��;����0���D����=��[���}�s&��:?!.@������=$��G�K�\y\�[�5D���0��=���,NU�(��8��1�clG�>���/��^'�\,|�a�������J�>�F�X���f��Q�O��4�
Yb�/I�����7�/A}h#����gB�d��~�����!���d�O��R;�&����B'y<�AT��9�h��H�����w������V�}+l���S��u�8o�Wzn���,��Cj\���6�*�����/3���4���M �7�?�y�f;k?���?�����u����"y�����W
��tC�/j��-���a�t#��L��c�	��+�����K�O�B�m��K��O?�!TJ�s�~b�?�u���B���'w���\/y1����F��Gq>G�a�%�-��y�*��A�nq�w���|L��R?7SG"*p\���V:;2��!y\
k�_-���&wc<��"�=�Y`H^��O���`P.~�y�q-�s������������|s��U����u-�B?K�����+�A�e��������Y�2�n0Y����|�}�1Z��9�����������e������3�V@g�����e���F:COcY�0��k���q��
��Sx=��	���H�_`��
���/A;�[�R\�XNq^K����r�{�����q�T���O������/������X��3��X>�?�O�snK`y���cHK`����@������7}����3�a�������HN3�5�����o�������K�x%�_"p_�~�/c��a��$�F"��}Q������>w������-�~��z�o��{0�A�7�o��o�>�����o���(���CRB���Q�G�C�N�0�
���/��t�1��������^^�2c"��T�V=���AB�>OW���CxB��$��r(V]h�<��n �L�oU��n�O����Y�1P�!�|��������(o�q.��&�M6��#�Y�8������^'�
�����x���g[����UK��C��+��u���W=�c=
�P_E�	P�����:l��#���-��|+���!b[�v��s��������X���_��8Gc���&��:��7�T��L�"������|���"(����*�U� lRO�w�h�����4c�;����
����|s���yv3��8����fy��}�\��4��;�����]��v�iG5������`~��?�������Oa��i�s<�ic�x���{��.�H?��0R���#-/����d�Y,���M:Z�m����k����k�K�
_��A<��#�������F�L���9=K.��
_��������t���j��!�!��v�q�j��sg�D?�~1�7}h��t��G��[�wc<�����>��]�6�PM;�e,��Sn�caR����h�l��q��+?u��g�X�����{/f|��7������=�sq2i�����������G��>�s����S�X�	H�K�����������p��4p>��>��AEMB�\y>;�'�����hK$���c^�#���/l!H�����H��������>��"��uX^�!����X���>��he,_��D2I0���Gs^��>����4f�����u������w�������g�twO��q����q�f+�Pq^�C<�8h�n�J��*IS������Mm~@����`���7����%������?6?��:�Y���$�����G�.���4C�����f��l�d�]�����a��6_r���NnA=�`}����W��V�C�p2�Ge)���/���~�m�=HU��
O6�m3	T�[�x�������������L�+���������_�g��~��/�9�t�$M���3�.h[�����}Y�=�����0��f�M�k�^
B���u:�:�����}��/�B9�7���?�L���w�]'��$�Q(C�.�c���?I�c<���W�x��~����~��;iX>1�A�=�J�C�E�I��A:��#$��b���^�����A��~{LA�x
�O�wT����������q�)IhQHiFZ�����^}�N~�/��ul��('��}�SS��{���=�3(��������k��� ;�0g+��y�{�9�+���Bh�4�}���1(WN��J��P�=
DK ho7�����G����o�XH���q
��o�1��|�w��G�������t�L���yM���>p3�q?����4�|mP�=Y��Z ?
�G{�F��$��H�
M�W��N��ab�������Mv��
>��k������$��� ��C����%��>��-��~�����f��]����3��w7~�.�/�[����y�*���P��U�&�b�Y��(_!��L
��n�����qz�>Q��f�����
��
�}�������O�5��A�S�7Gv�6�������7���-��$kI�
�Aw�q���f
��2����bG�d��K��1��H�BC��/�o�[��I�,d�[�Y���I�����U���C�A�=�c��s:�����S�}�'�n�^��B����Q����x�����%��|�E����]h�y�h�����
����_��t��i�P"�F,�+�9� ��|��.�	�H?�g-�t�Ckk��u�Y���	:���y��2;��A��q��gA��1N����Nom���{9s�p����@[�#�I���f��G��������1��zt�!t���#�?!�B��������yi�m�����wp���:B��^��a>��	�w��S&�"_�_@��S��`�y��9���FA�m�������x���iO_���@f��:�l����������s��t@}�q1����P����s��QH��kE+Z��V���hE+Z��V���hE+Z��V���hE+Z��V���hE+Z��V���hE+Z��V���hE+Z�)���(���4(�������w��
����������p�������n���Q{A4Q/����{z��b��"�0�������a�����K��m���#CzC�AlD�'R���E��m��
�=R�F$D1,BEL@����PE=*��X���8)�$�@�]%8�@��"�1��b���g�W���K*�x�p=�s�^��^�S�^����i�����j����}=��?����!����Da��5.�FIB����/��W��I\b0��}�uy�{�y�
>����	�	?���-���b�lC�EH���;�;,��h�1�@lD�EF|�P�1�}������E�
��F�^��C��C�"BJW 8C��
?�oz���z���C{��kY�n���h���	���������HQ���HQ�I��J����N�z)X[>#Z�?��G7���������=�1�0�D�\���71�&T#�!6!jHej�?�x�&tD$�V~�����k�{E{��k�%��������_�/��e�31>��fF�����0�0.��
���\_4������E1,BT �"& �@�|/�����K���V�������-VH��&�{#�(��~!�0������>�Y
�o�S���S�_�S�_�S�O��)
��M��CGa
�z����6��Cg�XO�g���kp���_C���Lc{���gl}"��0Z��U?��G��-�z*���U/c����2Vg�V�����9�
���%����%��� �~�U/`���:�U�������Y������v�$����(}<<g4i>e�^#�"��J�l�r(����z�C��9=����/���$��%/�<V & �!�F$*�����!B�E�
��R��U�k�9�����J9��f�o�Jdh-�
���0O&����]������g�]�t���.��������������.#Z����.�3���2R+�|��q7X ��!b��"�)��k#c���6�]tsS�]��"���E�9&?�<�K�^f���X�����������X�|~=�hOLT��}��������H���
���Y�`�����Kx�#��E���D.�&�;wE+"�E��Z����hGB\O�`�FD�9���������v�{-c-C-],��v�,K��aI��Z}V���:�v���Ze+��5�>y,���c��QD��@i�S���%$gVCM�4���
��7^�9=2������Qrz��@8�WM���zKrDM�����K�ng��J,�����[��T�<���{�n`����t���VY	A���`�����_�	&a��?f��tF��G����QYSL�dF����G������a'�����AQ���R�M�T.��SY9���� �����b�!�YQ1S=�Y3�z��zy���R��l6���l6QOfTo����}����:�,ub����:yy����:��T�����`������U",,��9W�������j�������u\��:�cX'���L���TN�wjN��9}�"&���hz����Xl��Jz���'^>y:����T�L�S39�Ol����xz|AN��0������'���� qA��I}*w�V�����N�]&Xf�����9H�d�9�.s�]��`��i< b#�S�s�����L�	���Nbv��k�(Q'��ZY�����G�i/��tuAR�<��
�C�$���w1��J	����KJ�K$\3] ��B%�������P�u�C�Kq	���`�
��Q����	 /m���qs���"��	23���l@b��P���_
��A���i�fF9���H�
r�!L�� $
�����]���z(�z��Q��f�����K�@~���R(%P���$p
,mx��
d��_
endstream
endobj
16
0
obj
<<
/Type
/Font
/Subtype
/CIDFontType2
/BaseFont
/MUFUZY+ArialMT
/CIDSystemInfo
<<
/Registry
(Adobe)
/Ordering
(UCS)
/Supplement
0
>>
/FontDescriptor
18
0
R
/CIDToGIDMap
/Identity
/DW
556
/W
[
0
[
750
]
1
7
0
8
[
889
]
9
15
0
16
[
333
0
0
]
19
28
556
29
67
0
68
69
556
70
[
500
556
556
277
0
556
222
0
0
222
833
556
556
0
0
333
500
277
556
500
]
]
>>
endobj
18
0
obj
<<
/Type
/FontDescriptor
/FontName
/MUFUZY+ArialMT
/Flags
4
/FontBBox
[
-664
-324
2000
1005
]
/Ascent
728
/Descent
-210
/ItalicAngle
0
/CapHeight
716
/StemV
80
/FontFile2
19
0
R
>>
endobj
20
0
obj
309
endobj
21
0
obj
14392
endobj
22
0
obj
288
endobj
23
0
obj
20174
endobj
1
0
obj
<<
/Type
/Pages
/Kids
[
5
0
R
]
/Count
1
>>
endobj
xref
0 24
0000000002 65535 f 
0000065455 00000 n 
0000000000 00000 f 
0000000016 00000 n 
0000000142 00000 n 
0000000255 00000 n 
0000000420 00000 n 
0000028257 00000 n 
0000028217 00000 n 
0000028238 00000 n 
0000028413 00000 n 
0000028564 00000 n 
0000043561 00000 n 
0000028708 00000 n 
0000043986 00000 n 
0000029093 00000 n 
0000064808 00000 n 
0000044194 00000 n 
0000065174 00000 n 
0000044558 00000 n 
0000065371 00000 n 
0000065391 00000 n 
0000065413 00000 n 
0000065433 00000 n 
trailer
<<
/Size
24
/Root
3
0
R
/Info
4
0
R
>>
startxref
65514
%%EOF

results-relative-speedup-vs-master.pdfapplication/pdf; name=results-relative-speedup-vs-master.pdfDownload

test-scripts.tgzapplication/x-compressed-tar; name=test-scripts.tgzDownload

#22

Tomas Vondra

tomas@vondra.me

7 months ago

In reply to: Tomas Vondra (#21)

Re: strange perf regression with data checksums

Hi,

We had a RMT meeting yesterday, and we briefly discussed whether the v2
patch should go into PG18. And we concluded we probably should try to do
that.

On the one hand, it's not a PG18-specific bug, in the sense that older
releases are affected the same way. But it requires data checksums to be
enabled, and a lot of systems runs with checksums=off, simply because
that's the default. But 18 will (likely) change that.

We didn't get any complaints about this (at least I'm not aware of any),
but I suppose that's simply because people didn't have a comparison, and
treated it as "normal". But with the default changes, it'll be easier to
spot once they upgrade to PG18.

So better to get this in now, otherwise we may have to wait until PG19,
because of ABI (the patch adds a field into BTScanPosData, but maybe
it'd be possible to add it into padding, not sure).

regards

--
Tomas Vondra

#23

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Tomas Vondra (#22)

Re: strange perf regression with data checksums

On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote:

So better to get this in now, otherwise we may have to wait until PG19,
because of ABI (the patch adds a field into BTScanPosData, but maybe
it'd be possible to add it into padding, not sure).

I agree. I can get this into shape for commit today.

Does anybody object to my proceeding with committing the patch on the
master branch/putting it in Postgres 18? (FWIW I could probably fit
the new field into some BTScanPosData alignment padding, but I don't
favor back patching.)

I consider my patch to be low risk. There's a kind of symmetry to how
things work with the patch in place, which IMV makes things simpler.

--
Peter Geoghegan

#24

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#23)

1 attachment(s)

Re: strange perf regression with data checksums

On Wed, Jun 4, 2025 at 10:21 AM Peter Geoghegan <pg@bowt.ie> wrote:

On Wed, Jun 4, 2025 at 7:33 AM Tomas Vondra <tomas@vondra.me> wrote:

So better to get this in now, otherwise we may have to wait until PG19,
because of ABI (the patch adds a field into BTScanPosData, but maybe
it'd be possible to add it into padding, not sure).

I agree. I can get this into shape for commit today.

Attached is v3, which is functionally the same as v2. It has improved
comments, and a couple of extra assertions. Plus there's a real commit
message now.

I also relocated the code that sets so.drop_pin from
_bt_preprocess_keys to btrescan. That seems like the more logical
place for it.

My current plan is to commit this in the next couple of days.

--
Peter Geoghegan

Attachments:

v3-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patchapplication/octet-stream; name=v3-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patchDownload

From 0e1b9d8af5ffd7f26d2b3b8e8304b300f73120ef Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 May 2025 13:05:28 -0400
Subject: [PATCH v3] Avoid BufferGetLSNAtomic() calls during nbtree scans.

Avoid (or at least delay) BufferGetLSNAtomic() calls during nbtree
scans.  Avoid them entirely during both index-only scans and bitmap
index scans.  Selectively avoid them during plain index scans by
limiting the calls to cases where they're truly necessary: only call
BufferGetLSNAtomic() when we finish reading a page that actually
contains items that btgettuple will return to the executor proper.

Currently, when checksums (or wal_log_hints) are enabled, acquiring a
page's LSN in BufferGetLSNAtomic() involves locking the buffer header
(which involves the use of spinlocks).  Testing has shown that enabling
page-level checksums can cause large regressions with certain workloads,
especially on large multi-socket systems.  The underlying issue wasn't
caused by any Postgres 18 commit.  However, Postgres 18 commit 04bec894
made initdb use checksums by default, so on balance it seems prudent to
address the problem now, as an open item for Postgres 18.

Author: Peter Geoghegan <pg@bowt.ie>
Reviewed-By: Tomas Vondra <tomas@vondra.me>
Discussion: https://postgr.es/m/941f0190-e3c6-4622-9ac7-c04e936e5fdb@vondra.me
Discussion: https://postgr.es/m/CAH2-Wzk-Dg5XWs_jDuiHt4_7ryrSY+n=vxmHY51EVqPDFsKXmg@mail.gmail.com
---
 src/include/access/nbtree.h           |  3 +-
 src/backend/access/nbtree/nbtree.c    | 30 +++++++++++++++
 src/backend/access/nbtree/nbtsearch.c | 55 +++++++++++++++------------
 src/backend/access/nbtree/nbtutils.c  | 19 +++++----
 4 files changed, 74 insertions(+), 33 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588..a6c4ed7fd 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -967,7 +967,7 @@ typedef struct BTScanPosData
 	BlockNumber currPage;		/* page referenced by items array */
 	BlockNumber prevPage;		/* currPage's left link */
 	BlockNumber nextPage;		/* currPage's right link */
-	XLogRecPtr	lsn;			/* currPage's LSN */
+	XLogRecPtr	lsn;			/* currPage's LSN (when so.drop_pin) */
 
 	/* scan direction for the saved position's call to _bt_readpage */
 	ScanDirection dir;
@@ -1054,6 +1054,7 @@ typedef struct BTScanOpaqueData
 {
 	/* these fields are set by _bt_preprocess_keys(): */
 	bool		qual_ok;		/* false if qual can never be satisfied */
+	bool		drop_pin;		/* drop leaf pin before btgettuple returns? */
 	int			numberOfKeys;	/* number of preprocessed scan keys */
 	ScanKey		keyData;		/* array of preprocessed scan keys */
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 765659887..e4b94f08b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -228,6 +228,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 	bool		res;
 
+	Assert(scan->heapRelation != NULL);
+
 	/* btree indexes are never lossy */
 	scan->xs_recheck = false;
 
@@ -289,6 +291,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
 	int64		ntids = 0;
 	ItemPointer heapTid;
 
+	Assert(scan->heapRelation == NULL);
+
 	/* Each loop iteration performs another primitive index scan */
 	do
 	{
@@ -393,6 +397,32 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 		BTScanPosInvalidate(so->currPos);
 	}
 
+	/*
+	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
+	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
+	 *
+	 * We cannot safely drop leaf page pins during index-only scans due to a
+	 * race condition involving VACUUM setting pages all-visible in the VM.
+	 * It's also unsafe for plain index scans that use a non-MVCC snapshot.
+	 *
+	 * When we drop pins eagerly, the mechanism that marks so->killedItems[]
+	 * index tuples LP_DEAD has to deal with concurrent TID recycling races.
+	 * The scheme used to detect unsafe TID recycling won't work when scanning
+	 * unlogged relations (since it involves saving an affected page's LSN).
+	 * Opt out of eager pin dropping during unlogged relation scans for now
+	 * (this is preferable to opting out of kill_prior_tuple LP_DEAD setting).
+	 *
+	 * Also opt out of dropping leaf page pins eagerly during bitmap scans.
+	 * Pins cannot be held for more than an instant during bitmap scans either
+	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
+	 *
+	 * See nbtree/README section on making concurrent TID recycling safe.
+	 */
+	so->drop_pin = (!scan->xs_want_itup &&
+					IsMVCCSnapshot(scan->xs_snapshot) &&
+					RelationNeedsWAL(scan->indexRelation) &&
+					scan->heapRelation != NULL);
+
 	so->markItemIndex = -1;
 	so->needPrimScan = false;
 	so->scanBehind = false;
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a38869..1af65ad1f 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,7 @@
 #include "utils/rel.h"
 
 
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so);
 static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
 							Buffer buf, bool forupdate, BTStack stack,
 							int access);
@@ -57,24 +57,29 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
 /*
  *	_bt_drop_lock_and_maybe_pin()
  *
- * Unlock the buffer; and if it is safe to release the pin, do that, too.
- * This will prevent vacuum from stalling in a blocked state trying to read a
- * page when a cursor is sitting on it.
- *
- * See nbtree/README section on making concurrent TID recycling safe.
+ * Unlock so->currPos.buf.  If so->drop_pin has been set, drop the pin, too.
+ * Dropping the pin prevents VACUUM from blocking on acquiring a cleanup lock.
  */
-static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+static inline void
+_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so)
 {
-	_bt_unlockbuf(scan->indexRelation, sp->buf);
-
-	if (IsMVCCSnapshot(scan->xs_snapshot) &&
-		RelationNeedsWAL(scan->indexRelation) &&
-		!scan->xs_want_itup)
+	if (!so->drop_pin)
 	{
-		ReleaseBuffer(sp->buf);
-		sp->buf = InvalidBuffer;
+		/* Just drop the lock (not the pin) */
+		_bt_unlockbuf(rel, so->currPos.buf);
+		return;
 	}
+
+	/*
+	 * Drop both the lock and the pin.
+	 *
+	 * Have to set so->currPos.lsn so that _bt_killitems has a way to detect
+	 * when concurrent heap TID recycling by VACUUM might have taken place.
+	 */
+	Assert(RelationNeedsWAL(rel));
+	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+	_bt_relbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 }
 
 /*
@@ -1610,7 +1615,13 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 	so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
 	so->currPos.prevPage = opaque->btpo_prev;
 	so->currPos.nextPage = opaque->btpo_next;
+	/* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
+	so->currPos.dir = dir;
+	so->currPos.nextTupleOffset = 0;
 
+	/* either moreRight or moreLeft should be set now (may be unset later) */
+	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
+		   so->currPos.moreLeft);
 	Assert(!P_IGNORE(opaque));
 	Assert(BTScanPosIsPinned(so->currPos));
 	Assert(!so->needPrimScan);
@@ -1626,14 +1637,6 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
 								 so->currPos.currPage);
 	}
 
-	/* initialize remaining currPos fields related to current page */
-	so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
-	so->currPos.dir = dir;
-	so->currPos.nextTupleOffset = 0;
-	/* either moreLeft or moreRight should be set now (may be unset later) */
-	Assert(ScanDirectionIsForward(dir) ? so->currPos.moreRight :
-		   so->currPos.moreLeft);
-
 	PredicateLockPage(rel, so->currPos.currPage, scan->xs_snapshot);
 
 	/* initialize local variables */
@@ -2251,12 +2254,14 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	 */
 	if (_bt_readpage(scan, dir, offnum, true))
 	{
+		Relation	rel = scan->indexRelation;
+
 		/*
 		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
 		 * so->currPos.buf in preparation for btgettuple returning tuples.
 		 */
 		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+		_bt_drop_lock_and_maybe_pin(rel, so);
 		return true;
 	}
 
@@ -2413,7 +2418,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 	 */
 	Assert(so->currPos.currPage == blkno);
 	Assert(BTScanPosIsPinned(so->currPos));
-	_bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+	_bt_drop_lock_and_maybe_pin(rel, so);
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7..b63d1fc57 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3364,9 +3364,10 @@ _bt_killitems(IndexScanDesc scan)
 	int			i;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
-	bool		droppedpin PG_USED_FOR_ASSERTS_ONLY;
 
+	Assert(numKilled > 0);
 	Assert(BTScanPosIsValid(so->currPos));
+	Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
 
 	/*
 	 * Always reset the scan state, so we don't look for same items on other
@@ -3374,7 +3375,7 @@ _bt_killitems(IndexScanDesc scan)
 	 */
 	so->numKilled = 0;
 
-	if (BTScanPosIsPinned(so->currPos))
+	if (!so->drop_pin)
 	{
 		/*
 		 * We have held the pin on this page since we read the index tuples,
@@ -3382,7 +3383,7 @@ _bt_killitems(IndexScanDesc scan)
 		 * re-use of any TID on the page, so there is no need to check the
 		 * LSN.
 		 */
-		droppedpin = false;
+		Assert(BTScanPosIsPinned(so->currPos));
 		_bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
 
 		page = BufferGetPage(so->currPos.buf);
@@ -3390,13 +3391,17 @@ _bt_killitems(IndexScanDesc scan)
 	else
 	{
 		Buffer		buf;
+		XLogRecPtr	latestlsn;
 
-		droppedpin = true;
 		/* Attempt to re-read the buffer, getting pin and lock. */
+		Assert(!BTScanPosIsPinned(so->currPos));
 		buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
 
 		page = BufferGetPage(buf);
-		if (BufferGetLSNAtomic(buf) == so->currPos.lsn)
+		latestlsn = BufferGetLSNAtomic(buf);
+		Assert(!XLogRecPtrIsInvalid(so->currPos.lsn));
+		Assert(so->currPos.lsn <= latestlsn);
+		if (so->currPos.lsn == latestlsn)
 			so->currPos.buf = buf;
 		else
 		{
@@ -3442,7 +3447,7 @@ _bt_killitems(IndexScanDesc scan)
 				 * correctness.
 				 *
 				 * Note that the page may have been modified in almost any way
-				 * since we first read it (in the !droppedpin case), so it's
+				 * since we first read it (in the !so->drop_pin case), so it's
 				 * possible that this posting list tuple wasn't a posting list
 				 * tuple when we first encountered its heap TIDs.
 				 */
@@ -3458,7 +3463,7 @@ _bt_killitems(IndexScanDesc scan)
 					 * though only in the common case where the page can't
 					 * have been concurrently modified
 					 */
-					Assert(kitem->indexOffset == offnum || !droppedpin);
+					Assert(kitem->indexOffset == offnum || !so->drop_pin);
 
 					/*
 					 * Read-ahead to later kitems here.
-- 
2.49.0

#25

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#24)

Re: strange perf regression with data checksums

On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan <pg@bowt.ie> wrote:

My current plan is to commit this in the next couple of days.

Pushed.

It would be great if we could also teach BufferGetLSNAtomic() to just
call PageGetLSN() (at least on PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
platforms), but my sense is that that's not going to happen in
Postgres 18.

--
Peter Geoghegan

#26

Alexander Lakhin

exclusion@gmail.com

7 months ago

In reply to: Peter Geoghegan (#25)

Re: strange perf regression with data checksums

Hello Peter,

06.06.2025 19:33, Peter Geoghegan wrote:

On Wed, Jun 4, 2025 at 1:39 PM Peter Geoghegan<pg@bowt.ie> wrote:

My current plan is to commit this in the next couple of days.

Pushed.

Please look at the following script, which falsifies an assertion
introduced with e6eed40e4:
create user u;
grant all on schema public to u;
set session authorization u;
create table tbl1 (a int);
create function f1() returns int language plpgsql as $$ invalid $$;

select sum(1) over (partition by 1 order by objid)
from pg_shdepend
left join pg_stat_sys_indexes on refclassid = relid
left join tbl1 on true
limit 1;
(I've simplified an assert-triggering query generated by SQLsmith.)

TRAP: failed Assert("!BTScanPosIsPinned(so->currPos)"), File: "nbtutils.c", Line: 3379, PID: 1621028
ExceptionalCondition at assert.c:52:13
_bt_killitems at nbtutils.c:3380:3
_bt_steppage at nbtsearch.c:2134:5
_bt_next at nbtsearch.c:1560:7
btgettuple at nbtree.c:276:6
index_getnext_tid at indexam.c:640:25
index_getnext_slot at indexam.c:729:10
IndexNext at nodeIndexscan.c:131:9
ExecScanFetch at execScan.h:126:10
...

Best regards,
Alexander

#27

Mihail Nikalayeu

mihailnikalayeu@gmail.com

7 months ago

In reply to: Peter Geoghegan (#24)

Re: strange perf regression with data checksums

Hello, Peter!

I also relocated the code that sets so.drop_pin from
_bt_preprocess_keys to btrescan. That seems like the more logical
place for it.

I was rebasing [0]https://commitfest.postgresql.org/patch/5151/ and noticed dropPin is not initialized in
btbeginscan, which seems to be suspicious for me.
Is it intended?

[0]: https://commitfest.postgresql.org/patch/5151/

#28

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Alexander Lakhin (#26)

1 attachment(s)

Re: strange perf regression with data checksums

Hi Alexander,

On Mon, Jun 9, 2025 at 2:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:

Please look at the following script, which falsifies an assertion

TRAP: failed Assert("!BTScanPosIsPinned(so->currPos)"), File: "nbtutils.c", Line: 3379, PID: 1621028

I can reproduce this. I think that it's an old bug.

The problem is related to mark/restore (used by merge joins), where
nbtree can sometimes independently hold a second pin on leaf pages
(albeit very briefly, inside _bt_steppage). The cause of the failure
of my recently-added assertion is that it assumes that so->dropPin
always implies !BTScanPosIsPinned(so->currPos) at the start of
_bt_killitems. I don't think that that assumption is unreasonable,
even in light of this assertion problem.

I think that the real problem is the way that _bt_killitems releases
resources. _bt_killitems doesn't try to leave so->currPos as it was at
the start of its call -- this is nothing new. The overall effect is
that during so->dropPin _bt_steppage calls, we *generally* don't call
IncrBufferRefCount in the "bump pin on current buffer for assignment
to mark buffer" path, but *will* call IncrBufferRefCount when we
happened to need to call _bt_killitems. Calling _bt_killitems
shouldn't have these side-effects.

I can make the assertion failure go away by teaching _bt_killitems to
leave so->currPos in the same state that it was in when _bt_killitems
was called. If there was no pin on the page when _bt_killitems was
called, then there should be no pin held when it returns. See the
attached patch.

As I said, I think that this is actually an old bug. After all,
_bt_killitems is required to test so->currPos.lsn against the same
page's current LSN if the page pin was dropped since _bt_readpage was
first called -- regardless of any other details. I don't think that
it'll consistently do that in this hard-to-test mark/restore path on
any Postgres version, so there might be a risk of failing to detect an
unsafe concurrent TID recycling hazard.

I've always thought that the whole idiom of testing
BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
it makes sense to totally replace those calls/tests with similar tests
of the new so->dropPin field.

--
Peter Geoghegan

Attachments:

v1-0001-Fix-issue-with-mark-restore-nbtree-pins.patchapplication/octet-stream; name=v1-0001-Fix-issue-with-mark-restore-nbtree-pins.patchDownload

From 0d911fd9dd8f5fba2208a1222978f558034bef41 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 9 Jun 2025 09:57:30 -0400
Subject: [PATCH v1] Fix issue with mark/restore nbtree pins.

---
 src/backend/access/nbtree/nbtutils.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 29f0dca1b..6ccdd0aab 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3323,9 +3323,9 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the so->currPos page, but must hold a
+ * buffer pin when !so->dropPin.  When we return, it still won't be locked.
+ * It'll continue to hold the same pins that were held when we were called.
  *
  * We match items by heap TID before assuming they are the right ones to
  * delete.  We cope with cases where items have moved right due to insertions.
@@ -3353,6 +3353,7 @@ _bt_killitems(IndexScanDesc scan)
 	OffsetNumber maxoff;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
+	Buffer		buf;
 
 	Assert(numKilled > 0);
 	Assert(BTScanPosIsValid(so->currPos));
@@ -3369,11 +3370,11 @@ _bt_killitems(IndexScanDesc scan)
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
 		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_lockbuf(rel, so->currPos.buf, BT_READ);
+		buf = so->currPos.buf;
+		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
-		Buffer		buf;
 		XLogRecPtr	latestlsn;
 
 		Assert(!BTScanPosIsPinned(so->currPos));
@@ -3391,10 +3392,9 @@ _bt_killitems(IndexScanDesc scan)
 		}
 
 		/* Unmodified, hinting is safe */
-		so->currPos.buf = buf;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -3511,10 +3511,17 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
 
-	_bt_unlockbuf(rel, so->currPos.buf);
+	/*
+	 * so->currPos retains no locks, but retains whatever pins it held when we
+	 * were called
+	 */
+	if (!so->dropPin)
+		_bt_unlockbuf(rel, buf);
+	else
+		_bt_relbuf(rel, buf);
 }
 
 
-- 
2.49.0

#29

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Mihail Nikalayeu (#27)

Re: strange perf regression with data checksums

On Mon, Jun 9, 2025 at 8:48 AM Mihail Nikalayeu
<mihailnikalayeu@gmail.com> wrote:

I was rebasing [0] and noticed dropPin is not initialized in
btbeginscan, which seems to be suspicious for me.
Is it intended?

That's pretty normal. We don't have access to the scan descriptor
within btbeginscan.

This is also why we do things like allocate so->currTuples within
btrescan. We don't yet know if the scan will be an index-only scan
when btbeginscan is called.

--
Peter Geoghegan

#30

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#28)

1 attachment(s)

Re: strange perf regression with data checksums

On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan <pg@bowt.ie> wrote:

I've always thought that the whole idiom of testing
BTScanPosIsPinned() in a bunch of places was very brittle. I wonder if
it makes sense to totally replace those calls/tests with similar tests
of the new so->dropPin field.

It's definitely possible to avoid testing BTScanPosIsPinned like this
entirely. Attached v2 patch shows how this is possible.

v2 fully removes BTScanPosUnpinIfPinned, and limits the use of
BTScanPosIsPinned to assertions. I think that this approach makes our
buffer pin resource management strategy much clearer, particularly in
cases involving mark/restore. This might help with Tomas Vondra's
index prefetching patch -- I know that Tomas found the mark/restore
aspects of his latest approach challenging.

In v2, I added an assertion at the top of _bt_steppage that
independently verify the same conditions to the ones that are already
tested whenever _bt_steppage calls _bt_killitems (these include the
assertion that Alexander showed could fail):

static bool
_bt_steppage(IndexScanDesc scan, ScanDirection dir)
{
BTScanOpaque so = (BTScanOpaque) scan->opaque;
BlockNumber blkno,
lastcurrblkno;

Assert(BTScanPosIsValid(so->currPos));
if (!so->dropPin)
Assert(BTScanPosIsPinned(so->currPos));
else
Assert(!BTScanPosIsPinned(so->currPos));
...

That way if there are remaining issues like the ones that Alexander
reported, we don't rely on reaching _bt_killitems to get an assertion
failure.

(Note that adding these assertions necessitated making _bt_firstpage
call _bt_readnextpage instead of its _bt_steppage wrapper.
_bt_steppage is now "Only called when _bt_drop_lock_and_maybe_pin was
called for so->currPos", which doesn't apply to the case where
_bt_firstpage previously called _bt_steppage. This brings
_bt_firstpage in line with _bt_readnextpage itself -- now we only call
_bt_steppage for a page that we returned to the scan via btgettuple/a
page that might have had its pin released by
_bt_drop_lock_and_maybe_pin.)

v2 also completely removes all IncrBufferRefCount() calls from nbtree.
These were never really necessary, since we don't actually need to
hold more than one pin at a time on the same page at any point in
cases involving mark/restore (nor any other case). We can always get
by with no more than one buffer pin on the same page.

There were 2 IncrBufferRefCount() calls total, both of which v2 removes:

The first IncrBufferRefCount call removed was in _bt_steppage. We copy
so->currPos into so->markPos when leaving the page in _bt_steppage.
We're invalidating so->currPos in passing here, so we don't *need* to
play games with IncrBufferRefCount -- we can just not release the
original pin, based on the understanding that so->markPos has the pin
"transferred" to (so->currPos can therefore be invalidated without
having its pin unpinned).

The second IncrBufferRefCount call removed was in btrestrpos. When we
restore a mark, we need to keep around the original mark, in case it
needs to be restored a little later -- the so->markPos mark must not
go away. But that in itself doesn't necessitate holding multiple
buffer pins on the same page at once. We can keep the old mark around,
without keeping a separate buffer pin for so->markPos. We can just set
so->markItemIndex instead -- that's an alternative (and more
efficient) way of representing the mark (the so->markPos
representation isn't truly needed). This is okay, since our new
so->currPos is for the same page as the mark -- that's how
so->markItemIndex is supposed to be used.

--
Peter Geoghegan

Attachments:

v2-0001-Fix-issue-with-mark-restore-nbtree-pins.patchapplication/octet-stream; name=v2-0001-Fix-issue-with-mark-restore-nbtree-pins.patchDownload

From c8aec56cfadcffac78a61568cf5903426c82b97d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 9 Jun 2025 09:57:30 -0400
Subject: [PATCH v2] Fix issue with mark/restore nbtree pins.

---
 src/include/access/nbtree.h           |  5 --
 src/backend/access/nbtree/nbtree.c    | 75 +++++++++++++++++----------
 src/backend/access/nbtree/nbtsearch.c | 72 ++++++++++++++++++-------
 src/backend/access/nbtree/nbtutils.c  | 25 +++++----
 4 files changed, 115 insertions(+), 62 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0a..fe0b805d4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1012,11 +1012,6 @@ typedef BTScanPosData *BTScanPos;
 		ReleaseBuffer((scanpos).buf); \
 		(scanpos).buf = InvalidBuffer; \
 	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
 
 #define BTScanPosIsValid(scanpos) \
 ( \
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 03a1d7b02..04dab0703 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -387,16 +387,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
 	/*
 	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
 	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
@@ -416,6 +406,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	 * Pins cannot be held for more than an instant during bitmap scans either
 	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
 	 *
+	 * Note: so->dropPin should never change across rescans.
+	 *
 	 * See nbtree/README section on making concurrent TID recycling safe.
 	 */
 	so->dropPin = (!scan->xs_want_itup &&
@@ -423,12 +415,27 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				   RelationNeedsWAL(scan->indexRelation) &&
 				   scan->heapRelation != NULL);
 
-	so->markItemIndex = -1;
+	/* we aren't holding any read locks, but gotta drop the pins */
+	if (BTScanPosIsValid(so->currPos))
+	{
+		/* Before leaving current page, deal with any killed items */
+		if (so->numKilled > 0)
+			_bt_killitems(scan);
+		if (!so->dropPin)
+			BTScanPosUnpin(so->currPos);
+		BTScanPosInvalidate(so->currPos);
+	}
+
 	so->needPrimScan = false;
 	so->scanBehind = false;
 	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
+	so->markItemIndex = -1;
+	if (BTScanPosIsValid(so->markPos))
+	{
+		if (!so->dropPin)
+			BTScanPosUnpin(so->markPos);
+		BTScanPosInvalidate(so->markPos);
+	}
 
 	/*
 	 * Allocate tuple workspace arrays, if needed for an index-only scan and
@@ -475,11 +482,16 @@ btendscan(IndexScanDesc scan)
 		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
+		if (!so->dropPin)
+			BTScanPosUnpin(so->currPos);
 	}
 
 	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
+	if (BTScanPosIsValid(so->markPos))
+	{
+		if (!so->dropPin)
+			BTScanPosUnpin(so->markPos);
+	}
 
 	/* No need to invalidate positions, the RAM is about to be freed. */
 
@@ -505,8 +517,14 @@ btmarkpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	/* Always invalidate any existing mark */
+	if (BTScanPosIsValid(so->markPos))
+	{
+		if (!so->dropPin)
+			BTScanPosUnpin(so->markPos);
+		BTScanPosInvalidate(so->markPos);
+	}
+	so->markItemIndex = -1;
 
 	/*
 	 * Just record the current itemIndex.  If we later step to next page
@@ -516,11 +534,6 @@ btmarkpos(IndexScanDesc scan)
 	 */
 	if (BTScanPosIsValid(so->currPos))
 		so->markItemIndex = so->currPos.itemIndex;
-	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
 }
 
 /*
@@ -555,14 +568,13 @@ btrestrpos(IndexScanDesc scan)
 			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
+			if (!so->dropPin)
+				BTScanPosUnpin(so->currPos);
+			BTScanPosInvalidate(so->currPos);
 		}
 
 		if (BTScanPosIsValid(so->markPos))
 		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
 			memcpy(&so->currPos, &so->markPos,
 				   offsetof(BTScanPosData, items[1]) +
 				   so->markPos.lastItem * sizeof(BTScanPosItem));
@@ -575,9 +587,16 @@ btrestrpos(IndexScanDesc scan)
 				_bt_start_array_keys(scan, so->currPos.dir);
 				so->needPrimScan = false;
 			}
+
+			/*
+			 * We need to keep the mark around, in case we need to restore it
+			 * again.  But we don't need to use the so->markPos representation
+			 * (the so->markItemIndex representation suffices, now that the
+			 * page that so->currPos references is this mark's page).
+			 */
+			so->markItemIndex = so->currPos.itemIndex;
+			BTScanPosInvalidate(so->markPos);
 		}
-		else
-			BTScanPosInvalidate(so->currPos);
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 070f14c8b..d443b1abd 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2109,6 +2109,7 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * Wrapper on _bt_readnextpage that performs final steps for the current page.
+ * Only called when _bt_drop_lock_and_maybe_pin was called for so->currPos.
  *
  * On entry, so->currPos must be valid.  Its buffer will be pinned, though
  * never locked. (Actually, when so->dropPin there won't even be a pin held,
@@ -2122,6 +2123,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				lastcurrblkno;
 
 	Assert(BTScanPosIsValid(so->currPos));
+	if (!so->dropPin)
+		Assert(BTScanPosIsPinned(so->currPos));
+	else
+		Assert(!BTScanPosIsPinned(so->currPos));
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
@@ -2133,9 +2138,6 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (so->markItemIndex >= 0)
 	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
 		memcpy(&so->markPos, &so->currPos,
 			   offsetof(BTScanPosData, items[1]) +
 			   so->currPos.lastItem * sizeof(BTScanPosItem));
@@ -2145,6 +2147,16 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		so->markPos.itemIndex = so->markItemIndex;
 		so->markItemIndex = -1;
 
+		/*
+		 * If there's a pin held on so->currPos, we transfer the pin over to
+		 * so->markPos (without actually releasing it)
+		 */
+		Assert(BTScanPosIsValid(so->markPos));
+		if (!so->dropPin)
+			Assert(BTScanPosIsPinned(so->markPos));
+		else
+			Assert(!BTScanPosIsPinned(so->markPos));
+
 		/*
 		 * If we're just about to start the next primitive index scan
 		 * (possible with a scan that has arrays keys, and needs to skip to
@@ -2172,15 +2184,14 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 		/* mark/restore not supported by parallel scans */
 		Assert(!scan->parallel_scan);
 	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
-	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+	else if (!so->dropPin)
+	{
+		/*
+		 * Not saving so->currPos position in so->markPos, so release the
+		 * position's pin for ourselves
+		 */
+		BTScanPosUnpin(so->currPos);
+	}
 
 	/*
 	 * Cancel primitive index scans that were scheduled when the call to
@@ -2191,6 +2202,19 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	if (so->currPos.dir != dir)
 		so->needPrimScan = false;
 
+	/*
+	 * so->currPos no longer holds a buffer pin (independent of whether we
+	 * released the pin, or just transferred it over to so->markPos)
+	 */
+	so->currPos.buf = InvalidBuffer;
+
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = so->currPos.nextPage;
+	else
+		blkno = so->currPos.prevPage;
+	lastcurrblkno = so->currPos.currPage;
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
@@ -2220,7 +2244,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 static bool
 _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 {
+	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
@@ -2253,8 +2280,6 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	 */
 	if (_bt_readpage(scan, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
 		/*
 		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
 		 * so->currPos.buf in preparation for btgettuple returning tuples.
@@ -2265,14 +2290,20 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	}
 
 	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	Assert(so->numKilled == 0);
+	Assert(so->markItemIndex < 0);
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	_bt_relbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = so->currPos.nextPage;
+	else
+		blkno = so->currPos.prevPage;
+	lastcurrblkno = so->currPos.currPage;
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2408,6 +2439,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		/* no matching tuples on this page */
 		_bt_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
 		seized = false;			/* released by _bt_readpage (or by us) */
 	}
 
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 29f0dca1b..731916765 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3323,9 +3323,9 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the so->currPos page, but must hold a
+ * buffer pin when !so->dropPin.  When we return, it still won't be locked.
+ * It'll continue to hold the same pins that were held when we were called.
  *
  * We match items by heap TID before assuming they are the right ones to
  * delete.  We cope with cases where items have moved right due to insertions.
@@ -3353,6 +3353,7 @@ _bt_killitems(IndexScanDesc scan)
 	OffsetNumber maxoff;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
+	Buffer		buf;
 
 	Assert(numKilled > 0);
 	Assert(BTScanPosIsValid(so->currPos));
@@ -3369,11 +3370,11 @@ _bt_killitems(IndexScanDesc scan)
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
 		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_lockbuf(rel, so->currPos.buf, BT_READ);
+		buf = so->currPos.buf;
+		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
-		Buffer		buf;
 		XLogRecPtr	latestlsn;
 
 		Assert(!BTScanPosIsPinned(so->currPos));
@@ -3391,10 +3392,9 @@ _bt_killitems(IndexScanDesc scan)
 		}
 
 		/* Unmodified, hinting is safe */
-		so->currPos.buf = buf;
 	}
 
-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -3511,10 +3511,17 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		MarkBufferDirtyHint(buf, true);
 	}
 
-	_bt_unlockbuf(rel, so->currPos.buf);
+	/*
+	 * so->currPos retains no locks, but does retain whatever pins were held
+	 * just before we were called
+	 */
+	if (!so->dropPin)
+		_bt_unlockbuf(rel, buf);
+	else
+		_bt_relbuf(rel, buf);
 }
 
 
-- 
2.49.0

#31

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#28)

2 attachment(s)

Re: strange perf regression with data checksums

On Mon, Jun 9, 2025 at 10:47 AM Peter Geoghegan <pg@bowt.ie> wrote:

As I said, I think that this is actually an old bug. After all,
_bt_killitems is required to test so->currPos.lsn against the same
page's current LSN if the page pin was dropped since _bt_readpage was
first called -- regardless of any other details. I don't think that
it'll consistently do that in this hard-to-test mark/restore path on
any Postgres version, so there might be a risk of failing to detect an
unsafe concurrent TID recycling hazard.

I have confirmed that this flavor of the problem has existed for a
long time. We'll need to backpatch the fix to all supported branches.

Attached v3 breaks this part out into its own commit. v3 has a proper
commit message now, which explains the implications of the bug on
earlier releases. I won't repeat that here, except to say that it's
just about possible that kill_prior_tuple LP_DEAD bit setting will
incorrectly set LP_DEAD bits due to this bug.

I've been thinking about also committing the second patch against
master/Postgres 18 only -- but I've since had cold feet about that. It
is really just refactoring to make the code more robust against these
sorts of issues by conditioning everything on so->dropPin. Right now,
on 18, we're only doing that in _bt_killitems. That was enough to
detect this bug, but it might not work out that way in the future, if
there's a new (or an old) bug of the same general nature somewhere
else.

Current plan: Commit 0001 on all supported branches in the next couple
of days. Unless I hear objections.

I will wait until Postgres 18 branches from master before committing
0002. It's important refactoring work, but on reflection I find it
hard to justify not just waiting.

--
Peter Geoghegan

Attachments:

v3-0002-Stop-conditioning-nbtree-actions-on-pin-being-hel.patchapplication/octet-stream; name=v3-0002-Stop-conditioning-nbtree-actions-on-pin-being-hel.patchDownload

From 7f2a4c11f3c1a46c3713ef337b7a4a05be5adfd6 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Tue, 10 Jun 2025 12:07:01 -0400
Subject: [PATCH v3 2/2] Stop conditioning nbtree actions on pin being held.

Commit 2ed5b87f, which taught nbtree to avoid holding onto a leaf page
buffer pin across the btgettuple calls for certain index scans, made it
possible for a scan position (either so->currPos or so->markPos) to be a
valid position without having a buffer pin held/a valid value in 'buf'.
The same commit introduced an idiom around buffer pin management: at the
point when a scan position was invalidated, a call to the function-like
macro BTScanPosUnpinIfPinned() released the buffer pin iff the relevant
position's 'buf' field was set to a non-InvalidBuffer value.  Recent
experience with bugfix commit XXX shows this idiom to be rather brittle.

Finish off the work started by recent commit e6eed40e: fully move over
to conditioning releasing buffer pins on whether the scan in question is
a so->dropPin scan.  If it is, then _bt_drop_lock_and_maybe_pin must
have already dropped the buffer pin earlier on.  If it's not, then
_bt_drop_lock_and_maybe_pin certainly won't have dropped the pin when
called, so it is the responsibility of the pos-invalidating code.
BTScanPosUnpinIfPinned() is no longer needed, so get rid of it.  Keep
BTScanPosIsPinned(), but only call it from assertions going forward.

Now that the rules around buffer pins are a lot clearer, we can remove
all use of IncrBufferRefCount() from nbtree.  In the case of the
_bt_steppage call site, we were incremented and then immediately
decrementing the backend-local pin count in cases where so->currPos was
to be saved in so->markPos; now we'll do neither in such cases.  The
second IncrBufferRefCount() call site (which was in btrestrpos) now
takes this same approach, though in reverse:  instead of keeping around
the original so->markPos, we can invalidate it (but not unpin it) right
away, and store the new so->currPos.itemIndex in so->markItemIndex.
That way there is never any need to hold more than a single buffer pin
on the same page during mark and restore (so->currPos and so->markPos
might each hold their own buffer pins, but never on the same page).

Master-only follow-up to bugfix commit XXXXXXXX.

Author: Peter Geoghegan <pg@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzm3a8i31aBXi6oQt9uG7m1-xpX-MXjMMYoDJH=sBj1jnA@mail.gmail.com
---
 src/include/access/nbtree.h           |  17 ++--
 src/backend/access/nbtree/nbtree.c    | 117 ++++++++++++++++----------
 src/backend/access/nbtree/nbtsearch.c |  70 ++++++++++-----
 3 files changed, 128 insertions(+), 76 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index e709d2e0a..a448c2c9f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1007,23 +1007,18 @@ typedef BTScanPosData *BTScanPos;
 				!BufferIsValid((scanpos).buf)), \
 	BufferIsValid((scanpos).buf) \
 )
-#define BTScanPosUnpin(scanpos) \
-	do { \
-		ReleaseBuffer((scanpos).buf); \
-		(scanpos).buf = InvalidBuffer; \
-	} while (0)
-#define BTScanPosUnpinIfPinned(scanpos) \
-	do { \
-		if (BTScanPosIsPinned(scanpos)) \
-			BTScanPosUnpin(scanpos); \
-	} while (0)
-
 #define BTScanPosIsValid(scanpos) \
 ( \
 	AssertMacro(BlockNumberIsValid((scanpos).currPage) || \
 				!BufferIsValid((scanpos).buf)), \
 	BlockNumberIsValid((scanpos).currPage) \
 )
+
+#define BTScanPosUnpin(scanpos) \
+	do { \
+		ReleaseBuffer((scanpos).buf); \
+		(scanpos).buf = InvalidBuffer; \
+	} while (0)
 #define BTScanPosInvalidate(scanpos) \
 	do { \
 		(scanpos).buf = InvalidBuffer; \
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index fdff960c1..446a2f918 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -387,16 +387,6 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
-	if (BTScanPosIsValid(so->currPos))
-	{
-		/* Before leaving current page, deal with any killed items */
-		if (so->numKilled > 0)
-			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
-		BTScanPosInvalidate(so->currPos);
-	}
-
 	/*
 	 * We prefer to eagerly drop leaf page pins before btgettuple returns.
 	 * This avoids making VACUUM wait to acquire a cleanup lock on the page.
@@ -425,12 +415,27 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 				   RelationNeedsWAL(scan->indexRelation) &&
 				   scan->heapRelation != NULL);
 
+	/*
+	 * Before leaving the current position, perform final steps, since we'll
+	 * never call _bt_steppage for its page
+	 */
+	if (BTScanPosIsValid(so->currPos))
+	{
+		if (so->numKilled > 0)
+			_bt_killitems(scan);
+		if (!so->dropPin)
+			BTScanPosUnpin(so->currPos);
+		BTScanPosInvalidate(so->currPos);
+	}
+
+	/* Always invalidate any existing mark */
 	so->markItemIndex = -1;
-	so->needPrimScan = false;
-	so->scanBehind = false;
-	so->oppositeDirCheck = false;
-	BTScanPosUnpinIfPinned(so->markPos);
-	BTScanPosInvalidate(so->markPos);
+	if (BTScanPosIsValid(so->markPos))
+	{
+		if (!so->dropPin)
+			BTScanPosUnpin(so->markPos);
+		BTScanPosInvalidate(so->markPos);
+	}
 
 	/*
 	 * Allocate tuple workspace arrays, if needed for an index-only scan and
@@ -459,6 +464,9 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	 */
 	if (scankey && scan->numberOfKeys > 0)
 		memcpy(scan->keyData, scankey, scan->numberOfKeys * sizeof(ScanKeyData));
+	so->needPrimScan = false;
+	so->scanBehind = false;
+	so->oppositeDirCheck = false;
 	so->numberOfKeys = 0;		/* until _bt_preprocess_keys sets it */
 	so->numArrayKeys = 0;		/* ditto */
 }
@@ -471,19 +479,27 @@ btendscan(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* we aren't holding any read locks, but gotta drop the pins */
+	/*
+	 * Before leaving the current position, perform final steps, since we'll
+	 * never call _bt_steppage for its page
+	 */
 	if (BTScanPosIsValid(so->currPos))
 	{
-		/* Before leaving current page, deal with any killed items */
 		if (so->numKilled > 0)
 			_bt_killitems(scan);
-		BTScanPosUnpinIfPinned(so->currPos);
+		if (!so->dropPin)
+			BTScanPosUnpin(so->currPos);
+		BTScanPosInvalidate(so->currPos);	/* unnecessary, but be consistent */
 	}
 
-	so->markItemIndex = -1;
-	BTScanPosUnpinIfPinned(so->markPos);
-
-	/* No need to invalidate positions, the RAM is about to be freed. */
+	/* Always invalidate any existing mark */
+	so->markItemIndex = -1;		/* unnecessary, but be consistent */
+	if (BTScanPosIsValid(so->markPos))
+	{
+		if (!so->dropPin)
+			BTScanPosUnpin(so->markPos);
+		BTScanPosInvalidate(so->markPos);	/* unnecessary, but be consistent */
+	}
 
 	/* Release storage */
 	if (so->keyData != NULL)
@@ -507,8 +523,14 @@ btmarkpos(IndexScanDesc scan)
 {
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
 
-	/* There may be an old mark with a pin (but no lock). */
-	BTScanPosUnpinIfPinned(so->markPos);
+	/* Always invalidate any existing mark */
+	so->markItemIndex = -1;
+	if (BTScanPosIsValid(so->markPos))
+	{
+		if (!so->dropPin)
+			BTScanPosUnpin(so->markPos);
+		BTScanPosInvalidate(so->markPos);
+	}
 
 	/*
 	 * Just record the current itemIndex.  If we later step to next page
@@ -518,11 +540,6 @@ btmarkpos(IndexScanDesc scan)
 	 */
 	if (BTScanPosIsValid(so->currPos))
 		so->markItemIndex = so->currPos.itemIndex;
-	else
-	{
-		BTScanPosInvalidate(so->markPos);
-		so->markItemIndex = -1;
-	}
 }
 
 /*
@@ -536,41 +553,55 @@ btrestrpos(IndexScanDesc scan)
 	if (so->markItemIndex >= 0)
 	{
 		/*
-		 * The scan has never moved to a new page since the last mark.  Just
+		 * The mark position is on the same page we are currently on.  Just
 		 * restore the itemIndex.
-		 *
-		 * NB: In this case we can't count on anything in so->markPos to be
-		 * accurate.
 		 */
 		so->currPos.itemIndex = so->markItemIndex;
 	}
 	else
 	{
 		/*
-		 * The scan moved to a new page after last mark or restore, and we are
-		 * now restoring to the marked page.  We aren't holding any read
-		 * locks, but if we're still holding the pin for the current position,
-		 * we must drop it.
+		 * The scan moved to a page beyond that of the last mark we took.
+		 *
+		 * Before leaving the current position, perform final steps, since
+		 * we'll never call _bt_steppage for its page.
 		 */
 		if (BTScanPosIsValid(so->currPos))
 		{
-			/* Before leaving current page, deal with any killed items */
 			if (so->numKilled > 0)
 				_bt_killitems(scan);
-			BTScanPosUnpinIfPinned(so->currPos);
+			if (!so->dropPin)
+				BTScanPosUnpin(so->currPos);
+			BTScanPosInvalidate(so->currPos);
 		}
 
+		/*
+		 * We're completely done with our old so->currPos.  Copy so->markPos
+		 * into so->currPos to actually restore the mark.
+		 */
 		if (BTScanPosIsValid(so->markPos))
 		{
-			/* bump pin on mark buffer for assignment to current buffer */
-			if (BTScanPosIsPinned(so->markPos))
-				IncrBufferRefCount(so->markPos.buf);
 			memcpy(&so->currPos, &so->markPos,
 				   offsetof(BTScanPosData, items[1]) +
 				   so->markPos.lastItem * sizeof(BTScanPosItem));
 			if (so->currTuples)
 				memcpy(so->currTuples, so->markTuples,
 					   so->markPos.nextTupleOffset);
+
+			/*
+			 * We need to keep the mark around, in case it gets restored
+			 * again.  We use the so->markItemIndex representation for this.
+			 * We generally only need to keep around a separate BTScanPosData
+			 * (which will have its own buffer pin when !so->dropPin) for a
+			 * mark whose page doesn't match so->currPos.currPage.
+			 *
+			 * See also: _bt_steppage, which converts the so->markItemIndex
+			 * representation into the so->markPos representation, in order to
+			 * safely keep around a mark after so->currPos.currPage changes.
+			 */
+			BTScanPosInvalidate(so->markPos);
+			so->markItemIndex = so->currPos.itemIndex;
+
 			/* Reset the scan's array keys (see _bt_steppage for why) */
 			if (so->numArrayKeys)
 			{
@@ -578,8 +609,6 @@ btrestrpos(IndexScanDesc scan)
 				so->needPrimScan = false;
 			}
 		}
-		else
-			BTScanPosInvalidate(so->currPos);
 	}
 }
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 070f14c8b..596fc14b9 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -2096,6 +2096,7 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
 
 	/* Most recent _bt_readpage must have succeeded */
 	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsPinned(so->currPos) == !so->dropPin);
 	Assert(so->currPos.itemIndex >= so->currPos.firstItem);
 	Assert(so->currPos.itemIndex <= so->currPos.lastItem);
 
@@ -2109,6 +2110,8 @@ _bt_returnitem(IndexScanDesc scan, BTScanOpaque so)
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * Wrapper on _bt_readnextpage that performs final steps for the current page.
+ * Call here when leaving a so->currPos that _bt_readpage returned 'true' for
+ * (just call _bt_readnextpage directly when _bt_readpage returned 'false').
  *
  * On entry, so->currPos must be valid.  Its buffer will be pinned, though
  * never locked. (Actually, when so->dropPin there won't even be a pin held,
@@ -2121,7 +2124,12 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	BlockNumber blkno,
 				lastcurrblkno;
 
+	/* Assert that _bt_readpage returned true for so->currPos */
+	Assert(so->currPos.firstItem <= so->currPos.lastItem);
+
+	/* Assert that _bt_drop_lock_and_maybe_pin left so->currPos as expected */
 	Assert(BTScanPosIsValid(so->currPos));
+	Assert(BTScanPosIsPinned(so->currPos) == !so->dropPin);
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
@@ -2133,9 +2141,9 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	 */
 	if (so->markItemIndex >= 0)
 	{
-		/* bump pin on current buffer for assignment to mark buffer */
-		if (BTScanPosIsPinned(so->currPos))
-			IncrBufferRefCount(so->currPos.buf);
+		/* mark/restore not supported by parallel scans */
+		Assert(!scan->parallel_scan);
+
 		memcpy(&so->markPos, &so->currPos,
 			   offsetof(BTScanPosData, items[1]) +
 			   so->currPos.lastItem * sizeof(BTScanPosItem));
@@ -2169,18 +2177,23 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 				so->markPos.moreLeft = true;
 		}
 
-		/* mark/restore not supported by parallel scans */
-		Assert(!scan->parallel_scan);
+		/*
+		 * Saving so->currPos into so->markPos, so just unset its 'buf' field
+		 * (avoid unpinning the buffer, since it is now owned by so->markPos).
+		 * _bt_readnextpage expects so->currPos to always look like this.
+		 */
+		so->currPos.buf = InvalidBuffer;
 	}
-
-	BTScanPosUnpinIfPinned(so->currPos);
-
-	/* Walk to the next page with data */
-	if (ScanDirectionIsForward(dir))
-		blkno = so->currPos.nextPage;
 	else
-		blkno = so->currPos.prevPage;
-	lastcurrblkno = so->currPos.currPage;
+	{
+		/*
+		 * Not saving so->currPos into so->markPos, so release its buffer pin
+		 * (except in so->dropPin case, where _bt_drop_lock_and_maybe_pin must
+		 * have done so earlier)
+		 */
+		if (!so->dropPin)
+			BTScanPosUnpin(so->currPos);
+	}
 
 	/*
 	 * Cancel primitive index scans that were scheduled when the call to
@@ -2191,6 +2204,13 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 	if (so->currPos.dir != dir)
 		so->needPrimScan = false;
 
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = so->currPos.nextPage;
+	else
+		blkno = so->currPos.prevPage;
+	lastcurrblkno = so->currPos.currPage;
+
 	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
@@ -2220,7 +2240,10 @@ _bt_steppage(IndexScanDesc scan, ScanDirection dir)
 static bool
 _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 {
+	Relation	rel = scan->indexRelation;
 	BTScanOpaque so = (BTScanOpaque) scan->opaque;
+	BlockNumber blkno,
+				lastcurrblkno;
 
 	so->numKilled = 0;			/* just paranoia */
 	so->markItemIndex = -1;		/* ditto */
@@ -2253,8 +2276,6 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	 */
 	if (_bt_readpage(scan, dir, offnum, true))
 	{
-		Relation	rel = scan->indexRelation;
-
 		/*
 		 * _bt_readpage succeeded.  Drop the lock (and maybe the pin) on
 		 * so->currPos.buf in preparation for btgettuple returning tuples.
@@ -2265,14 +2286,20 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
 	}
 
 	/* There's no actually-matching data on the page in so->currPos.buf */
-	_bt_unlockbuf(scan->indexRelation, so->currPos.buf);
+	Assert(so->numKilled == 0);
+	Assert(so->markItemIndex < 0);
 
-	/* Call _bt_readnextpage using its _bt_steppage wrapper function */
-	if (!_bt_steppage(scan, dir))
-		return false;
+	_bt_relbuf(rel, so->currPos.buf);
+	so->currPos.buf = InvalidBuffer;
 
-	/* _bt_readpage for a later page (now in so->currPos) succeeded */
-	return true;
+	/* Walk to the next page with data */
+	if (ScanDirectionIsForward(dir))
+		blkno = so->currPos.nextPage;
+	else
+		blkno = so->currPos.prevPage;
+	lastcurrblkno = so->currPos.currPage;
+
+	return _bt_readnextpage(scan, blkno, lastcurrblkno, dir, false);
 }
 
 /*
@@ -2408,6 +2435,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
 
 		/* no matching tuples on this page */
 		_bt_relbuf(rel, so->currPos.buf);
+		so->currPos.buf = InvalidBuffer;
 		seized = false;			/* released by _bt_readpage (or by us) */
 	}
 
-- 
2.49.0

v3-0001-Make-_bt_killitems-drop-pins-it-acquired-itself.patchapplication/octet-stream; name=v3-0001-Make-_bt_killitems-drop-pins-it-acquired-itself.patchDownload

From f5d8a31a8a1eaf3761d02b79ce721d399c0860ac Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 9 Jun 2025 09:57:30 -0400
Subject: [PATCH v3 1/2] Make _bt_killitems drop pins it acquired itself.

Teach nbtree's _bt_killitems to leave the so->currPos page that it sets
LP_DEAD items on in whatever state it was in when _bt_killitems was
called.  In particular, make sure that so->dropPin scans don't acquire a
pin whose reference is saved in so->currPos.buf.

Allowing _bt_killitems to change so->currPos.buf like this is wrong.
The immediate consequence of allowing it is that code in _bt_steppage
(that copies so->currPos into so->markPos) will behave as if the scan is
a !so->dropPin scan.  so->markPos will therefore retain the buffer pin
indefinitely, even though _bt_killitems only needs to acquire a pin
(along with a lock) for long enough to mark known-dead items LP_DEAD.

This issue came to light following a report of a failure of an assertion
from recent commit e6eed40e.  The test case in question involves the use
of mark and restore.  An initial call to _bt_killitems takes place that
leaves so->currPos.buf in a state that is inconsistent with the scan
being so->dropPin.  A subsequent call to _bt_killitems for the same
position (following so->currPos being saved in so->markPos, and then
restored as so->currPos) resulted in the failure of an assertion that
tests that so->currPos.buf is InvalidBuffer when the scan is so->dropPin
(on non-assert builds, we get a buffer pin leak instead).

The same problem exists on earlier releases, though the problem is far
more subtle there.  Recent commit e6eed40e introduced the so->dropPin
field as a partial replacement for testing so->currPos.buf directly.
Earlier releases won't get an assertion failure (or buffer pin leak),
but they will allow the second _bt_killitems call from the test case to
behave as if a buffer pin was consistently held since the original call
to _bt_readpage.  This is wrong; there will have been an initial window
during which no pin was held on the so->currPos page, and yet the second
confused _bt_killitems call neglects to check that so->currPos.lsn still
matches the current page LSN.  As a result of all this, it's just about
possible that _bt_killitems will mark the wrong index tuples LP_DEAD.
This could only happen with merge joins (which are the only user of
nbtree's mark/restore support), and only when a concurrently inserted
index tuple happened to use a recently-recycled TID, and happened to be
inserted onto the same leaf page as a concurrently-VACUUM'd dead tuple.

A follow-up commit will make nbtree completely stop conditioning whether
or not a position's pin needs to be dropped on whether the 'buf' field
is set.  All call sites that might need to drop a still-held pin should
be taught to rely on the scan-level so->dropPin field introduced by
recent commit e6eed40e.  That approach would make bugs like this one
impossible (or, at worst, make them much easier to detect).

Author: Peter Geoghegan <pg@bowt.ie>
Reported-By: Alexander Lakhin <exclusion@gmail.com>
Discussion: https://postgr.es/m/545be1e5-3786-439a-9257-a90d30f8b849@gmail.com
Backpatch-through: 13
---
 src/backend/access/nbtree/nbtree.c   |  2 ++
 src/backend/access/nbtree/nbtutils.c | 21 ++++++++++++---------
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 03a1d7b02..fdff960c1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -417,6 +417,8 @@ btrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	 * way, so we might as well avoid wasting cycles on acquiring page LSNs.
 	 *
 	 * See nbtree/README section on making concurrent TID recycling safe.
+	 *
+	 * Note: so->dropPin should never change across rescans.
 	 */
 	so->dropPin = (!scan->xs_want_itup &&
 				   IsMVCCSnapshot(scan->xs_snapshot) &&
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 29f0dca1b..f3d0556fc 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3323,9 +3323,9 @@ _bt_checkkeys_look_ahead(IndexScanDesc scan, BTReadPageState *pstate,
  * current page and killed tuples thereon (generally, this should only be
  * called if so->numKilled > 0).
  *
- * The caller does not have a lock on the page and may or may not have the
- * page pinned in a buffer.  Note that read-lock is sufficient for setting
- * LP_DEAD status (which is only a hint).
+ * Caller should not have a lock on the so->currPos page, but must hold a
+ * buffer pin when !so->dropPin.  When we return, it still won't be locked.
+ * It'll continue to hold whatever pins were held before calling here.
  *
  * We match items by heap TID before assuming they are the right ones to
  * delete.  We cope with cases where items have moved right due to insertions.
@@ -3353,6 +3353,7 @@ _bt_killitems(IndexScanDesc scan)
 	OffsetNumber maxoff;
 	int			numKilled = so->numKilled;
 	bool		killedsomething = false;
+	Buffer		buf;

 	Assert(numKilled > 0);
 	Assert(BTScanPosIsValid(so->currPos));
@@ -3369,11 +3370,11 @@ _bt_killitems(IndexScanDesc scan)
 		 * concurrent VACUUMs from recycling any of the TIDs on the page.
 		 */
 		Assert(BTScanPosIsPinned(so->currPos));
-		_bt_lockbuf(rel, so->currPos.buf, BT_READ);
+		buf = so->currPos.buf;
+		_bt_lockbuf(rel, buf, BT_READ);
 	}
 	else
 	{
-		Buffer		buf;
 		XLogRecPtr	latestlsn;

 		Assert(!BTScanPosIsPinned(so->currPos));
@@ -3391,10 +3392,9 @@ _bt_killitems(IndexScanDesc scan)
 		}

 		/* Unmodified, hinting is safe */
-		so->currPos.buf = buf;
 	}

-	page = BufferGetPage(so->currPos.buf);
+	page = BufferGetPage(buf);
 	opaque = BTPageGetOpaque(page);
 	minoff = P_FIRSTDATAKEY(opaque);
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -3511,10 +3511,13 @@ _bt_killitems(IndexScanDesc scan)
 	if (killedsomething)
 	{
 		opaque->btpo_flags |= BTP_HAS_GARBAGE;
-		MarkBufferDirtyHint(so->currPos.buf, true);
+		MarkBufferDirtyHint(buf, true);
 	}

-	_bt_unlockbuf(rel, so->currPos.buf);
+	if (!so->dropPin)
+		_bt_unlockbuf(rel, buf);
+	else
+		_bt_relbuf(rel, buf);
 }

-- 
2.49.0

#32

Peter Geoghegan

pg@bowt.ie

7 months ago

In reply to: Peter Geoghegan (#31)

Re: strange perf regression with data checksums

On Tue, Jun 10, 2025 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote:

I have confirmed that this flavor of the problem has existed for a
long time. We'll need to backpatch the fix to all supported branches.

Current plan: Commit 0001 on all supported branches in the next couple
of days. Unless I hear objections.

Pushed 0001 earlier today, backpatching to all supported releases.
Thanks for the report, Alexander!

I should acknowledge that, on further reflection, I have significant
doubts about whether it's actually possible for the bug to cause data
corruption on earlier releases (by incorrectly setting LP_DEAD bits).
The reasons for my doubts are complicated (much more complicated than
the fix itself). I think that backpatching was still the right call --
I just don't want to be needlessly alarmist here.

When I ran Alexander's test case on an earlier version of Postgres
(anything before my recent commit e6eed40e), I did indeed observe that
there were 2 _bt_killitems calls, the first of which behaved like what
we now call a !so->dropPin scan, and the second of which (against the
same page/scan position following restore of a mark) behaved like a
so->dropPin scan. This was obviously never intended by the 2015 commit
where the problem originated (commit 2ed5b87f), and seems wrong on its
face. We're not doing what we're supposed to be doing during the
second _bt_killitems call, which I imagined opened up a window for
data corruption.

However, it has since occured to me that the first _bt_killitems is
only able to spuriously set so->curPoss.buf and confuse its
_bt_steppage *when it succeeded* in setting at least one LP_DEAD bit.
And in order to succeed, it will have to have correctly checked
so->currPos.lsn against the page's LSN at that time first. The LSN
can't have changed at that point -- if it did then there'd be no pin
set in so->currPos.buf (just because that's how that aspect was
handled), and so no confusion, and so no second confused call to
_bt_killitems later on.

The pin held between the first and second _bt_killitems calls *is*
sufficient as an interlock against VACUUM, I think, since we *also*
checked the LSN during the first _bt_killitems.

As Tom would say, it accidentally failed to fail. There is no reason
to allow it, especially given that it can be fully avoided by making
dropped-pin calls to _bt_killitems consistently release both their
lock and their pin (which, as I touched upon, is how _bt_killitems
always did it in cases where the page LSN changed since _bt_readpage
was called).

I will wait until Postgres 18 branches from master before committing
0002. It's important refactoring work, but on reflection I find it
hard to justify not just waiting.

I will start a new, dedicated thread for this.

--
Peter Geoghegan