strange perf regression with data checksums
Hi,
While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.
I've only observed this on large machines with 96+ cores, on azure, both
with Intel (Xeon 8370C) and AMD (EPYC 9V33X). I've not been successful
in reproducing it on the puny machines I have at home.
Let me demonstrate the issue.
1) Create a cluster, with an increased connection limit:
pg_ctl -D data init
echo 'max_connections = 100' >> data/postgresql.conf
pg_ctl -D data -l pg.log start
Now the benchmark itself - it's fairly trivial, regular pgbench on scale
1, with an extra index on "pgbench_accounts.bid" column:
createdb test
pgbench -i -s 1 test
psql test -c "create index on pgbench_accounts (bid)"
and a script with a simple query using the index
select count(*) from pgbench_accounts where bid = 0
Cool, now let's get some numbers for 32-160 clients:
for c in 32 64 96 128 160; do
pgbench -n -f select.sql -T 10 -M prepared -c $c -j $c test | grep
'tps';
done;
Which produces this:
tps = 752464.727387 (without initial connection time)
tps = 1062150.015297 (without initial connection time)
tps = 572205.386159 (without initial connection time)
tps = 568579.663980 (without initial connection time)
tps = 561360.493639 (without initial connection time)
Clearly, at 96 clients the throughput just tanks. Now let's disable
checksums on the cluster:
pg_ctl -D data -l pg.log stop
pg_checksums --disable data
pg_ctl -D data -l pg.log start
and run the script again
tps = 753484.468817 (without initial connection time)
tps = 1083752.631856 (without initial connection time)
tps = 1862008.466802 (without initial connection time)
tps = 1826484.489433 (without initial connection time)
tps = 1818400.279069 (without initial connection time)
Clearly, the throughput does not drop, and it's ~3.5x higher. This is
from the Xeon machine, but I've seen the same thing on the EPYC boxes.
The impact varies, but in general it's 70-80%.
I'm not suggesting this is caused by 04bec894a04c, or even specific to
PG 18. I see the same issue on 17, except that 17 does not enable
checksums by default. For example on EPYC 9V74 the 17 does this:
32 762187.724859
64 1284731.279766
96 2978828.264373
128 2991561.835178
160 2971689.161136
and with checksums
32 874266.549870
64 1286382.426281
96 569647.384735
128 562128.010301
160 561826.908181
So, similar regression ...
I find this quite bizarre / puzzling, because this is a read-only
workload, with absolutely no writes, and tiny data set (~15MB), i.e.
everything fits into shared buffers. Why would that be affected by
checksums at all?
I spent some time profiling this, without much success. This is what I
get from perf top:
Samples: 6M of event 'cycles:P', 4000 Hz, Event count (approx.):
5270683795302 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
50.94% postgres [.] pg_atomic_read_u32_impl
17.32% postgres [.] pg_atomic_compare_exchange_u32_impl
10.17% postgres [.] spin_delay
5.83% postgres [.] pg_atomic_fetch_or_u32_impl
1.64% postgres [.] pg_atomic_compare_exchange_u32_impl
1.20% postgres [.] BufferDescriptorGetBuffer
0.92% postgres [.] perform_spin_delay
and the report with backtraces says most of the time is spent here:
--97.00%--btgettuple
|
--96.98%--_bt_first
|
|--48.82%--_bt_readfirstpage
| |
| |--44.57%--_bt_steppage
| | |
| | --44.45%--ReleaseBuffer
| | |
| | --44.43%--UnpinBuffer
| | |
| | ...
| |...
|
--48.11%--_bt_search
|
--47.89%--_bt_relandgetbuf
The atomic ops come from pinning/unpinning buffers. I realize it's
possible it gets much more expensive under concurrency (the clients
simply have to compete when updating the same counter, and with enough
clients there'll be more conflicts and retries). Kinda unfortunate, and
maybe we should do something about it, not sure.
But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?
regards
--
Tomas Vondra
Hi Tomas,
While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.[...]
But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?
These are interesting results.
Just wanted to clarify: did you make sure that all the hint bits were
set before executing the benchmark?
I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.
--
Best regards,
Aleksander Alekseev
On 5/9/25 14:53, Aleksander Alekseev wrote:
Hi Tomas,
While running some benchmarks comparing 17 and 18, I ran into a simple
workload where 18 throughput drops by ~80%. After pulling my hair for a
couple hours I realized the change that triggered this is 04bec894a04c,
which set checksums on by default. Which is very bizarre, because the
workload is read-only and fits into shared buffers.[...]
But why would it depend on checksums at all? This read-only test should
be entirely in-memory, so how come it's affected?These are interesting results.
Just wanted to clarify: did you make sure that all the hint bits were
set before executing the benchmark?I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.
Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.
Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.
regards
--
Tomas Vondra
Hi,
I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.
There is one more thing I would check. As I recall perf shows only
on-CPU time while actually the backends may be sleeping on the locks
most of the time. If this is the case perf will not show you the
accurate picture.
In order to check this personally I create gdb.script with a single GDB command:
```
bt
```
And execute:
```
gdb --batch --command=gdb.script -p (backend_pid_here)
```
... 10+ times or so. If what you are observing is actually a lock
contention and the backend sleeps on a lock most of the time, 8/10 or
so stacktraces will show you this.
I assume of course that the benchmark is done on release builds with
disabled Asserts, etc.
BTW do you believe this is a problem related exclusively to NUMA CPUs
with 90+ cores or I can reproduce it on SMT as well?
--
Best regards,
Aleksander Alekseev
On 5/9/25 15:19, Aleksander Alekseev wrote:
Hi,
I'm not claiming that hint bits are necessarily the reason for the
observed behavior but when something is off with presumably read-only
queries this is the first reason that comes to mind. At least we
should make sure hint bits are excluded from the equation. If memory
serves, VACUUM FULL and CHECKPOINT after filling the table and
creating the index should do the trick.Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.There is one more thing I would check. As I recall perf shows only
on-CPU time while actually the backends may be sleeping on the locks
most of the time. If this is the case perf will not show you the
accurate picture.
Right, perf only shows on-cpu time (at least by default). But the
backends really are consuming 100% CPU (or close to that, the pgbench
needs some CPU too), there are no lock waits that I can see.
I forgot to share flamegraphs I collected on the EPYC machine, for cases
with 96 clients. So here they are.
In order to check this personally I create gdb.script with a single GDB command:
```
bt
```And execute:
```
gdb --batch --command=gdb.script -p (backend_pid_here)
```... 10+ times or so. If what you are observing is actually a lock
contention and the backend sleeps on a lock most of the time, 8/10 or
so stacktraces will show you this.I assume of course that the benchmark is done on release builds with
disabled Asserts, etc.
Yes, there are regular builds with just --enable-debug --enable-depend,
nothing else (certainly not asserts).
BTW do you believe this is a problem related exclusively to NUMA CPUs
with 90+ cores or I can reproduce it on SMT as well?
No idea. I couldn't reproduce this on my ryzen machine at all, but that
only has 12 cores. The xeon (with 2x22 cores) seems to reproduce it, but
the difference is much smaller (1.2M vs. 1.5M tps), the pin/unpin have
only ~5% CPU, not 50%. Assuming it's the same issue, of course.
regards
--
Tomas Vondra
Hi,
I looked at this again, and I think the reason is mostly obvious. Both
why it's trashing, and why it happens with checksums=on ...
The reason why it happens is that PinBuffer does this:
old_buf_state = pg_atomic_read_u32(&buf->state);
for (;;)
{
if (old_buf_state & BM_LOCKED)
old_buf_state = WaitBufHdrUnlocked(buf);
buf_state = old_buf_state;
... modify state ...
if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state,
buf_state))
{
...
break;
}
}
So, we read the buffer state (which is where pins are tracked), possibly
waiting for it to get unlocked. Then we modify the state, and update it,
but only if it didn't change. If it did change, we retry.
Of course, as the number of sessions grows, the probability of something
updating the state in between increases. Another session might have
pinned the buffer, for example. This causes retries.
I added a couple counters to track how many loops are needed, and with
96 clients this needs about 800k retries per 100k calls, so about 8
retries per call. With 32 clients, this needs only about 25k retries, so
0.25 retry / call. That's a huge difference.
I believe enabling data checksums simply makes it more severe, because
the BufferGetLSNAtomic() has to obtain header lock, which uses the same
"state" field, with exactly the same retry logic. It can probably happen
even without it, but as the lock is exclusive, it also "serializes" the
access, making the conflicts more likely.
BufferGetLSNAtomic does this:
bufHdr = GetBufferDescriptor(buffer - 1);
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
AFAICS the lock is needed simply to read a consistent value from the
page header, but maybe we could have an atomic variable with a copy of
the LSN in the buffer descriptor?
regards
--
Tomas Vondra
Attachments:
On Fri, May 9, 2025 at 9:06 AM Tomas Vondra <tomas@vondra.me> wrote:
Good question. I haven't checked that explicitly, but it's a tiny data
set (15MB) and I observed this even on long benchmarks with tens of
millions of queries. So the hint bits should have been set.Also, I should have mentioned the query does an index-only scan, and the
pin/unpin calls are on index pages, not on the heap.
We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.
In general, we call BufferGetLSNAtomic() to stash the page LSN within
so->currPos.lsn, for later. so->currPos.lsn provides us with a way to
detect whether the page was modified during the period in which we
dropped our pin on the leaf page -- plain index scans cannot set
LP_DEAD bits on dead index tuples within _bt_killitems() if the page
has changed. But, index-only scans never drop the pin on the leaf page
to begin with, and so don't even use so->currPos.lsn (bitmap index
scans *never* call _bt_killitems(), and so obviously have no possible
use for so->currPos.lsn, either).
--
Peter Geoghegan
On Mon, May 19, 2025 at 12:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.
Attached quick-and-dirty prototype patch shows how this could work. It
fully avoids calling BufferGetLSNAtomic() from _bt_readpage() during
index-only scans. I find that "meson test" passes with the patch
applied (I'm reasonably confident that this general approach is
correct).
Does this patch of mine actually fix the regressions that you're
concerned about?
--
Peter Geoghegan
Attachments:
v1-0001-Avoid-BufferGetLSNAtomic-calls-during-index-only-.patchapplication/octet-stream; name=v1-0001-Avoid-BufferGetLSNAtomic-calls-during-index-only-.patchDownload
From dc023aa6ece08eba610e5b7a0a69f82c62113be0 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 May 2025 13:05:28 -0400
Subject: [PATCH v1] Avoid BufferGetLSNAtomic() calls during index-only scans.
---
src/include/access/nbtree.h | 1 +
src/backend/access/nbtree/nbtpreprocesskeys.c | 8 +++++++
src/backend/access/nbtree/nbtsearch.c | 22 ++++++++++---------
src/backend/access/nbtree/nbtutils.c | 11 +++++-----
4 files changed, 26 insertions(+), 16 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588..933507ec3 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1054,6 +1054,7 @@ typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
bool qual_ok; /* false if qual can never be satisfied */
+ bool drop_pin; /* true if it's safe to drop leaf page pins */
int numberOfKeys; /* number of preprocessed scan keys */
ScanKey keyData; /* array of preprocessed scan keys */
diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..b38fe013a 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -205,6 +205,14 @@ _bt_preprocess_keys(IndexScanDesc scan)
/* initialize result variables */
so->qual_ok = true;
+
+ /*
+ * Can only safely drop leaf page pins during plain index scans of logged
+ * relations (XXX What about bitmap index scans?)
+ */
+ so->drop_pin = (IsMVCCSnapshot(scan->xs_snapshot) &&
+ RelationNeedsWAL(scan->indexRelation) &&
+ !scan->xs_want_itup);
so->numberOfKeys = 0;
if (numberOfKeys < 1)
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a38869..7d6361949 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
#include "utils/rel.h"
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so,
+ BTScanPos sp);
static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access);
@@ -63,14 +64,12 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
*
* See nbtree/README section on making concurrent TID recycling safe.
*/
-static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+static inline void
+_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so, BTScanPos sp)
{
- _bt_unlockbuf(scan->indexRelation, sp->buf);
+ _bt_unlockbuf(rel, sp->buf);
- if (IsMVCCSnapshot(scan->xs_snapshot) &&
- RelationNeedsWAL(scan->indexRelation) &&
- !scan->xs_want_itup)
+ if (so->drop_pin)
{
ReleaseBuffer(sp->buf);
sp->buf = InvalidBuffer;
@@ -1627,7 +1626,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
}
/* initialize remaining currPos fields related to current page */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+ if (so->drop_pin)
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
so->currPos.dir = dir;
so->currPos.nextTupleOffset = 0;
/* either moreLeft or moreRight should be set now (may be unset later) */
@@ -2251,12 +2251,14 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
*/
if (_bt_readpage(scan, dir, offnum, true))
{
+ Relation rel = scan->indexRelation;
+
/*
* _bt_readpage succeeded. Drop the lock (and maybe the pin) on
* so->currPos.buf in preparation for btgettuple returning tuples.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
return true;
}
@@ -2413,7 +2415,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
*/
Assert(so->currPos.currPage == blkno);
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
return true;
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7..7f72a1a77 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3364,7 +3364,6 @@ _bt_killitems(IndexScanDesc scan)
int i;
int numKilled = so->numKilled;
bool killedsomething = false;
- bool droppedpin PG_USED_FOR_ASSERTS_ONLY;
Assert(BTScanPosIsValid(so->currPos));
@@ -3374,7 +3373,7 @@ _bt_killitems(IndexScanDesc scan)
*/
so->numKilled = 0;
- if (BTScanPosIsPinned(so->currPos))
+ if (!so->drop_pin)
{
/*
* We have held the pin on this page since we read the index tuples,
@@ -3382,7 +3381,7 @@ _bt_killitems(IndexScanDesc scan)
* re-use of any TID on the page, so there is no need to check the
* LSN.
*/
- droppedpin = false;
+ Assert(BTScanPosIsPinned(so->currPos));
_bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
page = BufferGetPage(so->currPos.buf);
@@ -3391,8 +3390,8 @@ _bt_killitems(IndexScanDesc scan)
{
Buffer buf;
- droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
+ Assert(!BTScanPosIsPinned(so->currPos));
buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
page = BufferGetPage(buf);
@@ -3442,7 +3441,7 @@ _bt_killitems(IndexScanDesc scan)
* correctness.
*
* Note that the page may have been modified in almost any way
- * since we first read it (in the !droppedpin case), so it's
+ * since we first read it (in the !so->drop_pin case), so it's
* possible that this posting list tuple wasn't a posting list
* tuple when we first encountered its heap TIDs.
*/
@@ -3458,7 +3457,7 @@ _bt_killitems(IndexScanDesc scan)
* though only in the common case where the page can't
* have been concurrently modified
*/
- Assert(kitem->indexOffset == offnum || !droppedpin);
+ Assert(kitem->indexOffset == offnum || !so->drop_pin);
/*
* Read-ahead to later kitems here.
--
2.49.0
On 5/19/25 19:20, Peter Geoghegan wrote:
On Mon, May 19, 2025 at 12:42 PM Peter Geoghegan <pg@bowt.ie> wrote:
We don't actually need to call BufferGetLSNAtomic() from _bt_readpage
during index-only scans (nor during bitmap index scans). We can just
not call BufferGetLSNAtomic() at all (except during plain index
scans), with no possible downside.Attached quick-and-dirty prototype patch shows how this could work. It
fully avoids calling BufferGetLSNAtomic() from _bt_readpage() during
index-only scans. I find that "meson test" passes with the patch
applied (I'm reasonably confident that this general approach is
correct).Does this patch of mine actually fix the regressions that you're
concerned about?
For index-only scans, yes. The numbers I used to see were
64 clients: 1.2M tps
96 clients: 0.6M tps
and with the patch it's
64 clients: 1.2M tps
96 clients: 2.9M tps
I'm aware it more than doubles, even if the client count grew only 1.5x.
I believe this is due to a VM config issue I saw earlier, but it wasn't
fixed on this particular VM (unfortunate, but out of my control).
The regular index scan however still have this issue, although it's not
as visible as for IOS. If I disable IOS, the throughput is
64 clients: 0.7M tps
96 clients: 0.6M tps
And the profile looks like this:
Overhead Shared Object Symbol
28.85% postgres [.] PinBuffer
22.61% postgres [.] UnpinBufferNoOwner
9.67% postgres [.] BufferGetLSNAtomic
4.92% [kernel] [k] finish_task_switch.isra.0
3.78% [kernel] [k] _raw_spin_unlock_irqrestore
...
In an earlier message I mentioned maybe we could add an atomic variable
tracking the page LSN, so that we don't have to obtain the header lock.
I didn't have time to try yet.
regards
--
Tomas Vondra
On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:
For index-only scans, yes.
Great.
The regular index scan however still have this issue, although it's not
as visible as for IOS.
We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.
We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface). In other words, we only
need to call BufferGetLSNAtomic for pages that _bt_readpage returns
"true" for when called. There are plenty of leaf pages that
_bt_readpage will return "false" for, especially during large range
scans, and skip scans.
It's easy to see why this extension to my v1 POC is correct: the whole
point of dropping the leaf page pin is that we don't block VACUUM when
btgettuple returns -- but btgettuple isn't going to return until the
next call to _bt_readpage that returns "true" actually takes place (or
until the whole scan ends).
--
Peter Geoghegan
On Mon, May 19, 2025 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:
The regular index scan however still have this issue, although it's not
as visible as for IOS.We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface).
Attached is v2, which does things this way. What do you think?
v2 also manages to avoid calling BufferGetLSNAtomic during all bitmap
index scans. You didn't complain about any regressions in bitmap index
scans, but I see no reason to take any chances (it's easy to just not
call BufferGetLSNAtomic there).
--
Peter Geoghegan
Attachments:
v2-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patchapplication/octet-stream; name=v2-0001-Avoid-BufferGetLSNAtomic-calls-during-nbtree-scan.patchDownload
From 957019074b42bcf0332a353634e8d37b52bd974a Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 19 May 2025 13:05:28 -0400
Subject: [PATCH v2] Avoid BufferGetLSNAtomic() calls during nbtree scans.
Avoid BufferGetLSNAtomic() calls during nbtree index-only scans, bitmap
index scans, and page reads of pages that won't need to return items to
the scan via the btgettuple interface.
---
src/include/access/nbtree.h | 3 +-
src/backend/access/nbtree/nbtpreprocesskeys.c | 13 +++++++
src/backend/access/nbtree/nbtree.c | 4 ++
src/backend/access/nbtree/nbtsearch.c | 39 ++++++++++++-------
src/backend/access/nbtree/nbtutils.c | 12 +++---
5 files changed, 50 insertions(+), 21 deletions(-)
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index ebca02588..1b0f143cf 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -967,7 +967,7 @@ typedef struct BTScanPosData
BlockNumber currPage; /* page referenced by items array */
BlockNumber prevPage; /* currPage's left link */
BlockNumber nextPage; /* currPage's right link */
- XLogRecPtr lsn; /* currPage's LSN */
+ XLogRecPtr lsn; /* currPage's LSN (iff so.drop_pin) */
/* scan direction for the saved position's call to _bt_readpage */
ScanDirection dir;
@@ -1054,6 +1054,7 @@ typedef struct BTScanOpaqueData
{
/* these fields are set by _bt_preprocess_keys(): */
bool qual_ok; /* false if qual can never be satisfied */
+ bool drop_pin; /* true if it's safe to drop leaf page pins */
int numberOfKeys; /* number of preprocessed scan keys */
ScanKey keyData; /* array of preprocessed scan keys */
diff --git a/src/backend/access/nbtree/nbtpreprocesskeys.c b/src/backend/access/nbtree/nbtpreprocesskeys.c
index a136e4bbf..ae48f49d7 100644
--- a/src/backend/access/nbtree/nbtpreprocesskeys.c
+++ b/src/backend/access/nbtree/nbtpreprocesskeys.c
@@ -205,6 +205,19 @@ _bt_preprocess_keys(IndexScanDesc scan)
/* initialize result variables */
so->qual_ok = true;
+
+ /*
+ * We prefer to eagerly drop a leaf page pin to avoid blocking concurrent
+ * heap TID recycling by VACUUM. But we cannot safely drop leaf page pins
+ * during index-only scans, nor scans of logged relations, where checking
+ * if the page's LSN changed while the pin was dropped isn't sufficient.
+ * (Setting so->drop_pin=true doesn't meaningfully affect the behavior of
+ * bitmap index scans, so they always set it 'false' to avoid needlessly
+ * calling BufferGetLSNAtomic from _bt_readpage.)
+ */
+ so->drop_pin = (IsMVCCSnapshot(scan->xs_snapshot) &&
+ RelationNeedsWAL(scan->indexRelation) &&
+ !scan->xs_want_itup && scan->heapRelation);
so->numberOfKeys = 0;
if (numberOfKeys < 1)
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 765659887..aad896327 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -228,6 +228,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
BTScanOpaque so = (BTScanOpaque) scan->opaque;
bool res;
+ Assert(scan->heapRelation != NULL);
+
/* btree indexes are never lossy */
scan->xs_recheck = false;
@@ -289,6 +291,8 @@ btgetbitmap(IndexScanDesc scan, TIDBitmap *tbm)
int64 ntids = 0;
ItemPointer heapTid;
+ Assert(scan->heapRelation == NULL);
+
/* Each loop iteration performs another primitive index scan */
do
{
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index fe9a38869..851825a64 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -25,7 +25,8 @@
#include "utils/rel.h"
-static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
+static inline void _bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so,
+ BTScanPos sp);
static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
Buffer buf, bool forupdate, BTStack stack,
int access);
@@ -63,18 +64,26 @@ static bool _bt_endpoint(IndexScanDesc scan, ScanDirection dir);
*
* See nbtree/README section on making concurrent TID recycling safe.
*/
-static void
-_bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
+static inline void
+_bt_drop_lock_and_maybe_pin(Relation rel, BTScanOpaque so, BTScanPos sp)
{
- _bt_unlockbuf(scan->indexRelation, sp->buf);
-
- if (IsMVCCSnapshot(scan->xs_snapshot) &&
- RelationNeedsWAL(scan->indexRelation) &&
- !scan->xs_want_itup)
+ if (!so->drop_pin)
{
- ReleaseBuffer(sp->buf);
- sp->buf = InvalidBuffer;
+ /* Just drop the lock (not the pin) */
+ _bt_unlockbuf(rel, sp->buf);
+ return;
}
+
+ /*
+ * Drop the lock and the pin.
+ *
+ * Have to establish so->currPos.lsn to provide _bt_killitems with an
+ * alternative mechanism to ensure that no unsafe concurrent heap TID
+ * recycling has taken place.
+ */
+ so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+ _bt_relbuf(rel, so->currPos.buf);
+ sp->buf = InvalidBuffer;
}
/*
@@ -1610,6 +1619,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
so->currPos.currPage = BufferGetBlockNumber(so->currPos.buf);
so->currPos.prevPage = opaque->btpo_prev;
so->currPos.nextPage = opaque->btpo_next;
+ /* delay setting so->currPos.lsn until _bt_drop_lock_and_maybe_pin */
Assert(!P_IGNORE(opaque));
Assert(BTScanPosIsPinned(so->currPos));
@@ -1626,8 +1636,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum,
so->currPos.currPage);
}
- /* initialize remaining currPos fields related to current page */
- so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf);
+ /* initialize most remaining currPos fields related to current page */
so->currPos.dir = dir;
so->currPos.nextTupleOffset = 0;
/* either moreLeft or moreRight should be set now (may be unset later) */
@@ -2251,12 +2260,14 @@ _bt_readfirstpage(IndexScanDesc scan, OffsetNumber offnum, ScanDirection dir)
*/
if (_bt_readpage(scan, dir, offnum, true))
{
+ Relation rel = scan->indexRelation;
+
/*
* _bt_readpage succeeded. Drop the lock (and maybe the pin) on
* so->currPos.buf in preparation for btgettuple returning tuples.
*/
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
return true;
}
@@ -2413,7 +2424,7 @@ _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno,
*/
Assert(so->currPos.currPage == blkno);
Assert(BTScanPosIsPinned(so->currPos));
- _bt_drop_lock_and_maybe_pin(scan, &so->currPos);
+ _bt_drop_lock_and_maybe_pin(rel, so, &so->currPos);
return true;
}
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 1a15dfcb7..8b02f2c7f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -3364,9 +3364,9 @@ _bt_killitems(IndexScanDesc scan)
int i;
int numKilled = so->numKilled;
bool killedsomething = false;
- bool droppedpin PG_USED_FOR_ASSERTS_ONLY;
Assert(BTScanPosIsValid(so->currPos));
+ Assert(scan->heapRelation != NULL); /* can't be a bitmap index scan */
/*
* Always reset the scan state, so we don't look for same items on other
@@ -3374,7 +3374,7 @@ _bt_killitems(IndexScanDesc scan)
*/
so->numKilled = 0;
- if (BTScanPosIsPinned(so->currPos))
+ if (!so->drop_pin)
{
/*
* We have held the pin on this page since we read the index tuples,
@@ -3382,7 +3382,7 @@ _bt_killitems(IndexScanDesc scan)
* re-use of any TID on the page, so there is no need to check the
* LSN.
*/
- droppedpin = false;
+ Assert(BTScanPosIsPinned(so->currPos));
_bt_lockbuf(scan->indexRelation, so->currPos.buf, BT_READ);
page = BufferGetPage(so->currPos.buf);
@@ -3391,8 +3391,8 @@ _bt_killitems(IndexScanDesc scan)
{
Buffer buf;
- droppedpin = true;
/* Attempt to re-read the buffer, getting pin and lock. */
+ Assert(!BTScanPosIsPinned(so->currPos));
buf = _bt_getbuf(scan->indexRelation, so->currPos.currPage, BT_READ);
page = BufferGetPage(buf);
@@ -3442,7 +3442,7 @@ _bt_killitems(IndexScanDesc scan)
* correctness.
*
* Note that the page may have been modified in almost any way
- * since we first read it (in the !droppedpin case), so it's
+ * since we first read it (in the !so->drop_pin case), so it's
* possible that this posting list tuple wasn't a posting list
* tuple when we first encountered its heap TIDs.
*/
@@ -3458,7 +3458,7 @@ _bt_killitems(IndexScanDesc scan)
* though only in the common case where the page can't
* have been concurrently modified
*/
- Assert(kitem->indexOffset == offnum || !droppedpin);
+ Assert(kitem->indexOffset == offnum || !so->drop_pin);
/*
* Read-ahead to later kitems here.
--
2.49.0
Hi,
On 2025-05-19 18:13:02 +0200, Tomas Vondra wrote:
I believe enabling data checksums simply makes it more severe, because
the BufferGetLSNAtomic() has to obtain header lock, which uses the same
"state" field, with exactly the same retry logic. It can probably happen
even without it, but as the lock is exclusive, it also "serializes" the
access, making the conflicts more likely.BufferGetLSNAtomic does this:
bufHdr = GetBufferDescriptor(buffer - 1);
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);AFAICS the lock is needed simply to read a consistent value from the
page header, but maybe we could have an atomic variable with a copy of
the LSN in the buffer descriptor?
I think we can do better - something like
#ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
lsn = PageGetLSN(page);
#else
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
#endif
All perf relevant systems support reading 8 bytes without a chance of
tearing...
Greetings,
Andres Freund
On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:
I think we can do better - something like
#ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
lsn = PageGetLSN(page);
#else
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
#endifAll perf relevant systems support reading 8 bytes without a chance of
tearing...
Even assuming that this scheme works perfectly, I'd still like to
pursue the idea I had about fixing this in nbtree.
The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.
--
Peter Geoghegan
On 5/19/25 20:44, Peter Geoghegan wrote:
On Mon, May 19, 2025 at 2:19 PM Peter Geoghegan <pg@bowt.ie> wrote:
On Mon, May 19, 2025 at 2:01 PM Tomas Vondra <tomas@vondra.me> wrote:
The regular index scan however still have this issue, although it's not
as visible as for IOS.We can do somewhat better with plain index scans than my initial v1
prototype, without any major difficulties. There's more low-hanging
fruit.We could also move the call to BufferGetLSNAtomic (that currently
takes place inside _bt_readpage) over to _bt_drop_lock_and_maybe_pin.
That way we'd only need to call BufferGetLSNAtomic for those leaf
pages that will actually need to have some index tuples returned to
the scan (through the btgettuple interface).Attached is v2, which does things this way. What do you think?
v2 also manages to avoid calling BufferGetLSNAtomic during all bitmap
index scans. You didn't complain about any regressions in bitmap index
scans, but I see no reason to take any chances (it's easy to just not
call BufferGetLSNAtomic there).
Same effect as v1 for IOS, with regular index scans I see this:
64 clients: 0.7M tps
96 clients: 1.5M tps
So very similar improvement as for IOS.
regards
--
Tomas Vondra
On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:
Same effect as v1 for IOS, with regular index scans I see this:
64 clients: 0.7M tps
96 clients: 1.5M tpsSo very similar improvement as for IOS.
Just to be clear, you're saying my v2 appears to fix the problem
fully, for both plain index scans and index-only scans? (You haven't
mentioned bitmap index scans at all, I think, even though they are
relevant to v2.)
I'd be surprised if my v2 *fully* fixed the issue with plain index
scans. It's only going to avoid calls to BufferGetLSNAtomic()
following _bt_readpage calls that return false, but I doubt that your
particular pgbench-variant test queries really look like that.
*Looks again* Oh, wait. This is another one of those benchmarks with
an index scan that returns no rows whatsoever (just like on the other
recent thread involving regressions tied to the introduction of a new
skip support support function to nbtree). Fully makes sense that my v2
would fix that particular problem, even with plain index scans.
But...what about slightly different queries, that actually do return
some rows? Those will look different.
--
Peter Geoghegan
On 5/19/25 22:29, Peter Geoghegan wrote:
On Mon, May 19, 2025 at 4:17 PM Tomas Vondra <tomas@vondra.me> wrote:
Same effect as v1 for IOS, with regular index scans I see this:
64 clients: 0.7M tps
96 clients: 1.5M tpsSo very similar improvement as for IOS.
Just to be clear, you're saying my v2 appears to fix the problem
fully, for both plain index scans and index-only scans? (You haven't
mentioned bitmap index scans at all, I think, even though they are
relevant to v2.)
With v2 the regression disappears, both for index-only scans and regular
index scans. I haven't tried anything with bitmap scans - I hit the
regression mostly by accident, on a workload that does IOS only. I may
try constructing something with bitmap scans, but I didn't have time for
that right now.
I'd be surprised if my v2 *fully* fixed the issue with plain index
scans. It's only going to avoid calls to BufferGetLSNAtomic()
following _bt_readpage calls that return false, but I doubt that your
particular pgbench-variant test queries really look like that.
Not sure. My workload is pretty simple, and similar to what I used in
the other nbtree regression (with malloc overhead), except that it's not
using partitioning.
pgbench -i -s 1 test
psql test <<EOF
update pgbench_accounts set bid = aid;
create index on pgbench_accounts(bid);
vacuum full;
analyze;
EOF
# select.sql
set enable_indexonlyscan = off;
select count(*) from pgbench_accounts where bid = 1
I don't know what "fully fix" means in this context. I see a massive
improvement with v2, I have no idea if that's the best we could do.
*Looks again* Oh, wait. This is another one of those benchmarks with
an index scan that returns no rows whatsoever (just like on the other
recent thread involving regressions tied to the introduction of a new
skip support support function to nbtree). Fully makes sense that my v2
would fix that particular problem, even with plain index scans.
But...what about slightly different queries, that actually do return
some rows? Those will look different.
Actually, this particular query does return rows (one, to be precise).
But you're right - it seems sensitive to how many rows are returned, and
at some point the contention goes away and there's no regression.
I need to do proper automated testing, to get reliable results. I've
been doing manual testing, but it's easy to make mistakes that way.
Do you have any suggestions what cases you'd like me to test?
regards
--
Tomas Vondra
On Mon, May 19, 2025 at 8:21 PM Tomas Vondra <tomas@vondra.me> wrote:
With v2 the regression disappears, both for index-only scans and regular
index scans. I haven't tried anything with bitmap scans - I hit the
regression mostly by accident, on a workload that does IOS only. I may
try constructing something with bitmap scans, but I didn't have time for
that right now.
I imagine bitmap index scans will be similar to plain index scans.
I don't know what "fully fix" means in this context. I see a massive
improvement with v2, I have no idea if that's the best we could do.
You expected there to be *zero* performance impact from enabling
checksums for this workload, since it is a pure read-only workload.
That's what I meant by "fully fix".
But you're right - it seems sensitive to how many rows are returned, and
at some point the contention goes away and there's no regression.I need to do proper automated testing, to get reliable results. I've
been doing manual testing, but it's easy to make mistakes that way.Do you have any suggestions what cases you'd like me to test?
Nothing comes to mind. Again, just be aware that we can only
completely avoid calling BufferGetLSNAtomic is only possible when:
* Scan is an index-only scan
OR
* Scan is a bitmap index scan
OR
* Scan is a plain index scan, reading a page that _bt_readpage()
returned "false" for when called.
In other words, plain index scans that read a lot of tuples might
receive no benefit whatsoever. It's possible that it already matters
less there anyway. It's also possible that there is some unforeseen
benefit from merely *delaying* the call to BufferGetLSNAtomic. But in
all likelihood these "unfixed" plain index scans are just as fast with
v2 as they are when run on master/baseline.
--
Peter Geoghegan
On Tue, May 20, 2025 at 8:43 AM Peter Geoghegan <pg@bowt.ie> wrote:
I imagine bitmap index scans will be similar to plain index scans.
To be clear, I meant that the magnitude of the problem will be
similar. I do not mean that they aren't fixed by my v2. They should be
fully fixed by v2.
--
Peter Geoghegan
Hi,
On 2025-05-19 15:45:01 -0400, Peter Geoghegan wrote:
On Mon, May 19, 2025 at 3:37 PM Andres Freund <andres@anarazel.de> wrote:
I think we can do better - something like
#ifdef PG_HAVE_8BYTE_SINGLE_COPY_ATOMICITY
lsn = PageGetLSN(page);
#else
buf_state = LockBufHdr(bufHdr);
lsn = PageGetLSN(page);
UnlockBufHdr(bufHdr, buf_state);
#endifAll perf relevant systems support reading 8 bytes without a chance of
tearing...Even assuming that this scheme works perfectly, I'd still like to
pursue the idea I had about fixing this in nbtree.The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.
I'm not against that - ISTM we should do both.
Greetings,
Andres Freund
On Wed, May 21, 2025 at 12:29 PM Andres Freund <andres@anarazel.de> wrote:
The relevant nbtree code seems more elegant if we avoid calling
BufferGetLSNAtomic() unless and until its return value might actually
be useful. It's quite a lot easier to understand the true purpose of
so->currPos.lsn with this new structure.I'm not against that - ISTM we should do both.
Agreed.
Long term, we should move all of this stuff out of index AMs, which
don't have any good reason for doing their own thing. In a sense, the
known bugs in GiST and SP-GiST index-only scans (the "must not drop
the index page pin during index-only scans to avoid unsafe concurrent
TID recycling" bugs) exist because those index AMs couldn't just opt
in to some generic scheme, used by all index AMs that support
amgettuple-based scans.
Technically, the LSN is only needed to make kill_prior_tuple LP_DEAD
bit setting safe when the page pin is dropped. But it still seems
related to those GiST + SP-GiST IOS bugs, since we also need to
consider when dropping the pin is categorically unsafe (only nbtree
gets that aspect right currently).
--
Peter Geoghegan
Hi,
I finally had time to do more rigorous testing on the v1/v2 patches.
Attached is a .tgz with test script that initializes a pgbench scale 1,
and then:
* Modifies the data to have different patterns / number of matching
rows, etc. This is dobe by scripts in init/ directory.
* Runs queries that either match or do not match any rows. This is
done by scripts in select/ directory.
* 32, 64 and 96 clients (the system has ~96 cores)
The scripts also force a particular scan type (bitmap/index/index-only),
and may also pin the processes to CPUs in different ways:
* default = no pinning, it's up to scheduler
* colocated = pgbench/backend always on the same core
* random = pgbench/backend always on a different random core
This is done by a custom pgbench patch (can share, if needed). I found
the pinning may have *massive* impact in some cases.
There's also CSV with raw results, and two PDF files with a summary of
the results:
* results-relative-speedup-vs-master.pdf - Shows throughput relative
to master (for the same client count), 100% means no difference.
* results-relative-speedup-vs-32.pdf - Slightly different view on the
data, showing "scalability" for a given build. It compares
throughput to "expected" multiple of the result we got for 32
clients. 100% means linear scalability.
As usual, green=good, red=bad. My observation is that v2 performs better
than v1 (more green, darker green). v2 helps even in cases where v1 did
not make any difference (e.g. some of the "nomatch" cases).
It's also interesting how much impact the pinnig has - the "colocated"
results are much better. It's also interesting that in a couple cases we
scale superlinearly, i.e. 96 has better throughput than 3x that of 32
clients.
I've seen this before, and I believe it's due to behavior of the
hardware, and some kernel optimizations. Perhaps there's something we
could learn from this, not sure.
Anyway, as a comparison of v1 and v2 I think this is enough.
regards
--
Tomas Vondra
Attachments:
results-relative-speedup-vs-32.pdfapplication/pdf; name=results-relative-speedup-vs-32.pdfDownload
%PDF-1.4
% ����
3
0
obj
<<
/Type
/Catalog
/Names
<<
>>
/PageLabels
<<
/Nums
[
0
<<
/S
/D
/St
1
>>
]
>>
/Outlines
2
0
R
/Pages
1
0
R
>>
endobj
4
0
obj
<<
/Creator
(�� G o o g l e S h e e t s)
/Title
(�� U n t i t l e d s p r e a d s h e e t)
>>
endobj
5
0
obj
<<
/Type
/Page
/Parent
1
0
R
/MediaBox
[
0
0
595
842
]
/Contents
6
0
R
/Resources
7
0
R
/Annots
9
0
R
/Group
<<
/S
/Transparency
/CS
/DeviceRGB
>>
>>
endobj
6
0
obj
<<
/Filter
/FlateDecode
/Length
8
0
R
>>
stream
x������Kn'v ��d����q K�����\_���Y���iIO��@w�������df2��P}�u���`�c,\�����2����]������������1�r>��n=�r8��e*��"o
�����r�e�]N���r8���~w��w���pB�_���������������_,��o����t(��=�?�+�q�k�������W����i��H>���w��/���8]/�?]���t�/���p��w�:#����g�����������*�a: ��@@�k+�gh�0���z�����p�Nw����A������Rk��s=�n�t�#����?�������b(���Mk��u����8�i/;�q(� z�I�����������R�w��Q'���p��$�u9 ����`�If��t��������r9T��t�`����o-5��lG:p�)�T�
�|X�����{m�)��jp������g��;�� '�����[KMy�yR����$���U)�^��9��U�u��r�h��z\�<6�(<Ed}Nd(�����"��u��c���@Xh!X/Ej��HM�MG�| mQI�O�V���H`��V)���v�C#�'F
V�*"X��{m��?��>w7��]�B��HU$��iGv#ce����"?8�F�k��p�"�EOb�3��
Gt�c3 E�#�����^�b�S�����VLj����X �P4rR:�����s�Q�����Cj��02Mg �;R#����u�Q�>Dk������^�OBA u���NB@��t���d7�"��7�J2U�P��5���RD�<(=%=�EM9��.:r�����������Qi��H#g�� �q��M�����
�*���`d�� w��c�K���?
=EDTL2]�w���Z�x��d�v w��#s������;��z�ZJ���d�.��-j�7-0s�?I�-&���$=�EM��*HAZ�����+��I#O�� ��[ Q��@��,�qGj���!��I)����AA�����~�����BI��l�N@���F���A��.b�XI ���P��5��
��9cE< @�hQ���W�jwV:zBC���U)eGBc43xll(u���3RG �@��,�qGj<{�R�jwV:z"�uP�qGv#�N���N���t�D�k'�qGz#t��"&����EOiQ��1i���$�)-j�i������^.JG/hG*���}������G��S������RB�1��:K���#;Dq`���<
%E@�J2]�Pw�CT�Q�B�$�����p��~lo������E"=�E���l(��>^�z2����f�{wJ�:����d-t��7Hh�=e�6
Bl$B@������������BNd�
�5�H�'��������h�DFk'�qGv�{5�ELi)#6!����&���ZS���$� -j�GH�QCI�����i��.^�Y�]1+)m�G&��!��:K h���=E<�I�\�w�;�������MJGY��D���
��Ay�(b�XE ���P��5�*z���QQ�C�����������:�����Ni��H#�����Qzf�ztjwN�R���w��$�,�Yb@�����.}Q�vJK�d�
�5��w�(vk�y;��G2]; �;��q��xC����d�.��Y-j�������2����&������!5�����F=yV��w�{B�A(OA@������R�C�T�$h�;R�Js�U�c��Q�C�T%������Mws�:J�����d���+RD�T�%F�Z�t�{7����K�$--;�Q�Hl�����`��D �qGvx��(��R*7����OX�v{_,��[()�V*W!����1c���WI(z>���^G9bAx��0�P�k$���Y�d?\=���m��YAo?�
�HV���K=�;��4��s���x�D h����7j,[�PD�T�&Fh�;R����Y_P������D���#������V
#&�7VH�f"�������@������
>!�N*?C� -j�U�?��Z������G�V���x}��JP;�gb�F"4�H���#D�Sn&�Z��T�,H9�����)7�� ��#�7�"&��� �EOiQ�0@�����2�Q�� ��5��}��v���H����*��Qy����N`y�� Ay�w�|1@T:�c"< �5�H�'.���-n�Q������D h���_3�
Jx�D������������oTh(�J*#��,j��A����(4��\L�d
�*�RO���@8z9R���)6�qGj@k�����r0���5��n|h8���9�}��r0� �;R��h�35L�:�LQ��5�������<�j��-1o=�EM�����v^�s������?���|cO�v���)�CV�0Z��� �{<1D�9�\"�F2@m���+�J�\L�HBh�;�{,b�X? ��P��5��'.
��/B��)"6�!��5�o�����������S����.�����#�6l$"@����/n�@��rGv�� Z����j�
�����;�[; �;�����4RSl$"=�EM���+b�HK HB�SZ�������H�R���Ayx3znjZ����y����������#5�;k=��Z*'�o�;R�������y�� j�r28���3�g���@)�������A*<��%�1c��$���
���'���I9��*�!���#O�[�*��N������w�|���)���'����|Z�v��D�@r�1Ho��7X����\#9`2H'�\���P+���:�L��'���~�D�S��F��1��5�Hll��-~�V���RR��A(a���A����Z�<L�Pwd�u�t1 j�r19
�<����t1i��W�#� L���{
�����9c-E�F2������;��8)�R#YK1�w����>��|q�q�*�x0�_"