Re: Adding skip scan (including MDAM style range skip scan) to nbtree

Started by BharatDB4 months ago2 messages
#1BharatDB
bharatdbpg@gmail.com

Dear Team,

With reference to the conversation ongoing in message ID :
c562dc2a-6e36-46f3-a5ea-cd42eebd7118, I am writing to express my interest
in contributing to the ongoing work on fixing the bug related to Adding
skip scan (including MDAM style range skip scan) to nbtree.

I have been following this discussion on the regression related to commit
92fe23d93aa (skip scan in nbtree), and I ran some tests on my side to
understand it better.
Observations :

-

I reproduced Tomas’s pgbench test with a simple workload on a
single-column index,

SELECT count(*) FROM pgbench_accounts WHERE bid = 0;

-

Throughput with the skip-scan build was consistently ~40–50% lower
compared to pre-patch builds.
-

After setting MALLOC_TOP_PAD_= 64MB, the performance gap disappeared
almost entirely, confirming that the issue is allocator overhead from
frequent malloc/free calls rather than the skip-scan logic itself.

Reproduction steps :

Here is the exact setup I used (very close to Tomas’s):

# init database
pg_ctl -D data init
pg_ctl -D data -l pg.log start
createdb test
# create table and index
psql test -c 'CREATE TABLE pgbench_accounts (aid int, bid int,
abalance int, filler text);'
psql test -c 'CREATE INDEX ON pgbench_accounts(bid);'
# load pgbench data (scale 1)
pgbench -i -s 1 test
# custom query file (select.sql)echo "SELECT count(*) FROM
pgbench_accounts WHERE bid = 0;" > select.sql
# run benchmarksfor m in simple prepared; do
for c in 1 4 32; do
pgbench -n -f select.sql -M $m -T 10 -c $c -j $c test | grep tps;
done;done

When running the above, the skip-scan build consistently showed ~50% lower
tps compared to pre-patch, unless MALLOC_TOP_PAD_ was increased.
Thoughts on causes :

-

The increase in IndexAmRoutine size seems to push the cache structures
past glibc’s small-heap thresholds, forcing more system allocations.
-

As Tomas noted, this is fragile: even if we drop the unused options
support proc, future extensions to the struct could trigger the same issue
again.

Suggestions / possible directions :

1.

*Short term (PG18) *:
-

If we want a low-risk change, removing the unused options support
function may be acceptable, but I agree it feels like a
temporary band-aid.
-

Alternatively, shipping PG18 as-is with a release note warning about
allocator sensitivity might be the safest option.
2.

*Longer term (PG19) *:
-

Explore *static allocation of IndexAmRoutine* instead of per-AM
dynamic allocation. This should eliminate repeated malloc churn.
-

Add a micro-benchmark or regression test that stresses catalog cache
growth and malloc behavior (similar to pgbench with many partitions), so
allocator-driven regressions are detected earlier.
-

Consider documenting allocator tuning (MALLOC_TOP_PAD_) as a
workaround until the structural fix lands.

Closing :

I don’t have a final patch proposal at this stage, but I would like to help
test any candidate fixes or prototypes. If there’s interest, I can also
contribute a self-contained benchmark script for regression testing.

Regards,
Athiyaman

#2BharatDB
bharatdbpg@gmail.com
In reply to: BharatDB (#1)

Dear Team,

With reference to the conversation ongoing in message ID :
c562dc2a-6e36-46f3-a5ea-cd42eebd7118, I am writing to express my interest
in contributing to the ongoing work on fixing the bug related to Adding
skip scan (including MDAM style range skip scan) to nbtree.

I tried to replicate the performance regression reported earlier in this
thread, by running pgbench with the same setup (pgbench scale=1, 100
partitions, extra index on bid, single-count query). I built both before
skip scan (commit 3ba2cdaa454) and after skip scan (commit 92fe23d93aa)
versions, and compared the throughput:

--- BEFORE (3ba2cdaa454) ---
Mode=simple Clients=1   tps = 23890
Mode=simple Clients=4   tps = 82791
Mode=simple Clients=32  tps = 129877
Mode=prepared Clients=1 tps = 26404
Mode=prepared Clients=4 tps = 87116
Mode=prepared Clients=32 tps = 140881
--- AFTER (92fe23d93aa) ---
Mode=simple Clients=1   tps = 22551
Mode=simple Clients=4   tps = 76844
Mode=simple Clients=32  tps = 129445
Mode=prepared Clients=1 tps = 25880
Mode=prepared Clients=4 tps = 84876
Mode=prepared Clients=32 tps = 137812

In my environment the regression is smaller than Tomas originally
observed (*~5–8%
vs. ~50%*), but it still shows up consistently, especially at higher
concurrency.

This suggests that the extra malloc/free activity in the skip scan code
path is indeed introducing overhead, though the impact seems to vary
depending on glibc/memory allocator behavior.

*Proposal:*

-

For PG18, a safe short-term fix could be to *remove the unused “options”
support function*, as Peter suggested, or replace it with a lighter path
that avoids repeated allocations.
-

Longer term, we may want to *revisit skip scan memory management* (e.g.,
static allocation, memory pool, or reducing per-call overhead) so that the
optimization does not regress performance in micro-benchmarks.

I am currently working on these proposed methods and will continue
experimenting to provide further results and possible patches.

Regards,
Athiyaman