New access method for b-tree.

Started by Alexandre Felipeabout 1 month ago10 messages
Jump to latest
#1Alexandre Felipe
alexandre.felipe@tpro.io

Hello Hackers,

Please check this out,

It is an access method to scan a table sorting by the second column of an index, filtered on the first.
Queries like
SELECT x, y FROM grid
WHERE x in (array of Nx elements)
ORDER BY y, x
LIMIT M

Can execute streaming the rows directly from disk instead of loading everything.

Using btree index on (x, y)

On a grid with N x N will run by fetching only what is necessary
A skip scal will run with O(N * Nx) I/O, O(N x Nx) space, O(N x Nx * log( N * Nx)) comput (assuming a generic in memory sort)

The proposed access method does it O(M + Nx) I/O, O(Nx) space, and O(M * log(Nx)) compute.

Kind Regards,

Alexandre Felipe

Research & Development Engineer

Attachments:

btree_merge_rebased.patchapplication/octet-stream; name=btree_merge_rebased.patchDownload+1689-0
#2Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Alexandre Felipe (#1)
Re: New access method for b-tree.

Hello Felipe,

On 2/1/26 11:02, Alexandre Felipe wrote:

Hello Hackers,

Please check this out,

It is an access method to scan a table sorting by the second column of
an index, filtered on the first.
Queries like
SELECT x, y FROM grid
WHERE x in (array of Nx elements)
ORDER BY y, x
LIMIT M

Can execute streaming the rows directly from disk instead of loading
everything.

Using  btree index on (x, y)

On a grid with N x N will run by fetching only what is necessary
A skip scal will run with O(N * Nx) I/O, O(N x Nx) space, O(N x Nx *
log( N * Nx)) comput (assuming a generic in memory sort)

The proposed access method does it O(M + Nx) I/O, O(Nx) space, and O(M *
log(Nx)) compute.

So how does this compare to skip scan in practice? It's hard to compare,
as the patch does not implement an actual access path, but I tried this:

CREATE TABLE merge_test_int (
prefix_col int4,
suffix_col int4
);

INSERT INTO merge_test_int
SELECT p, s
FROM generate_series(1, 10000) AS p,
generate_series(1, 1000) AS s;
CREATE INDEX merge_test_int_idx
ON merge_test_int (prefix_col, suffix_col);

and then

1) master

SELECT * FROM merge_test_int
WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
AND suffix_col >= 900
ORDER BY suffix_col LIMIT 100;

vs.

2) merge scan

SELECT * FROM test_btree_merge_scan_int(

'merge_test_int',
'merge_test_int_idx',
ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
900,
100);

And with explain analyze we get this:

1) Buffers: shared hit=26 read=25

2) Buffers: shared hit=143 read=17

So it seems to access many more buffers, even if the number of reads is
lower. Presumably the merge scan is not always better than skip scan,
probably depending on number of prefixes in the query etc. What is the
cost model to decide between those two?

If you had to construct the best case and worst cases (vs. skip scan),
what would that look like?

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

regards

--
Tomas Vondra

#3Matthias van de Meent
boekewurm+postgres@gmail.com
In reply to: Tomas Vondra (#2)
Re: New access method for b-tree.

On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote:

Hello Felipe,

On 2/1/26 11:02, Alexandre Felipe wrote:

Hello Hackers,

Please check this out,

It is an access method to scan a table sorting by the second column of
an index, filtered on the first.
Queries like
SELECT x, y FROM grid
WHERE x in (array of Nx elements)
ORDER BY y, x
LIMIT M

Can execute streaming the rows directly from disk instead of loading
everything.

+1 for the idea, it does sound interesting. I haven't looked in depth
at the patch, so no comments on the execution yet.

So how does this compare to skip scan in practice? It's hard to compare,
as the patch does not implement an actual access path, but I tried this:

[...]

1) master

SELECT * FROM merge_test_int
WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)

[...]

2) merge scan

SELECT * FROM test_btree_merge_scan_int(

[...]

ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],

[...]

And with explain analyze we get this:

1) Buffers: shared hit=26 read=25
2) Buffers: shared hit=143 read=17

(FYI; your first query was missing "2" from it's IN list while it was
present in the merge scan input; this makes the difference worse by a
few pages)

So it seems to access many more buffers, even if the number of reads is
lower. Presumably the merge scan is not always better than skip scan,
probably depending on number of prefixes in the query etc. What is the
cost model to decide between those two?

Skip scan always returns data in index order, while this merge scan
would return tuples a suffix order. The cost model would thus weigh
the cost of sorting the result of an index skipscan against the cost
of doing a merge join on n_in_list_items distinct (partial) index
scans.

As for when you would benefit in buffers accessed: The merge scan
would mainly benefit in number of buffers accessed when the selected
prefix values are non-sequential, and the prefixes cover multiple
pages at a time, and when there is a LIMIT clause on the scan. Normal
btree index skip scan infrastructure efficiently prevents new index
descents into the index when the selected SAOP key ranges are directly
adjecent, while merge scan would generally do at least one index
descent for each of its N scan heads (*) - which in the proposed
prototype patch guarantees O(index depth * num scan heads) buffer
accesses.

(*) It is theoretically possible to reuse an earlier index descent if
the SAOP entry's key range of the last descent starts and ends on the
leaf page that the next SAOP entry's key range also starts on
(applying the ideas of 5bf748b86b to this new multi-headed index scan
mode), but that infrastructure doesn't seem to be in place in the
current patch. That commit is also why your buffer access count for
master is so low compared to the merge scan's; if your chosen list of
numbers was multiples of 5 (so that matching tuples are not all
sequential) you'd probably see much more comparable buffer access
counts.

If you had to construct the best case and worst cases (vs. skip scan),
what would that look like?

Presumably the best case would be:

-- mytable.a has very few distinct values (e.g. bool or enum);
mytable.b many distinct values (e.g. uuid)
SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;

which the index's merge scan would turn into an index scan that
behaves similar to the following, possibly with the merge join pushed
down into the index:

SELECT * FROM (
SELECT ... FROM mytable WHERE a = 1
UNION
SELECT ... FROM mytable WHERE a = 2
) ORDER BY b.

The worst case would be the opposite:

-- mytable.a has many distinct values (random uuid); mytable.b few
(e.g. boolean; enum)
SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b

As the merge scan maintains one internal indexscan head per SAOP array
element, it'd have significant in-memory and scan startup overhead,
while few values are produced for each of those scan heads.

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I'm not sure, but it seems like it might be common/useful in
queue-like access patterns:

With an index on (state, updated_timestamp) you're probably interested
in all messages in just a subset of states, ordered by recent state
transitions. An index on (updated_timestamp, state) might be
considered more optimal, but won't be able to efficiently serve
queries that only want data on uncommon states: The leaf pages would
mainly contain data on common states, reducing the value of those leaf
pages.

Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
using UNION, or add an index for each percievable IN list, but it'd be
great if the user didn't have to rewrite their query or create
n_combinations indexes with their respective space usage to get this
more efficient query execution.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)

#4Ants Aasma
ants.aasma@cybertec.at
In reply to: Tomas Vondra (#2)
Re: New access method for b-tree.

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

The main problem I can see is that at planning time the cardinality of
the prefix array might not be known, and in theory could be in the
millions. Having millions of index scans open at the same time is not
viable, so the method needs to somehow degrade gracefully. The idea I
had is to pick some limit, based on work_mem and/or benchmarking, and
one the limit is hit, populate the first batch and then run the next
batch of index scans, merging with the first result. Or something like
that, I can imagine a few different ways to handle it with different
tradeoffs.

I can imagine that this would really nicely benefit from ReadStream'ification.

One other connection I see is with block nested loops. In a perfect
future PostgreSQL could run the following as a set of merged index
scans that terminate early:

SELECT posts.*
FROM follows f
JOIN posts p ON f.followed_id = p.user_id
WHERE f.follower_id = :userid
ORDER BY p.tstamp DESC LIMIT 100;

In practice this is not a huge issue - it's not that hard to transform
this to array_agg and = ANY subqueries.

Regards,
Ants Aasma

#5Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Matthias van de Meent (#3)
Re: New access method for b-tree.

On 2/3/26 17:01, Matthias van de Meent wrote:

On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote:

Hello Felipe,

On 2/1/26 11:02, Alexandre Felipe wrote:

Hello Hackers,

Please check this out,

It is an access method to scan a table sorting by the second column of
an index, filtered on the first.
Queries like
SELECT x, y FROM grid
WHERE x in (array of Nx elements)
ORDER BY y, x
LIMIT M

Can execute streaming the rows directly from disk instead of loading
everything.

+1 for the idea, it does sound interesting. I haven't looked in depth
at the patch, so no comments on the execution yet.

So how does this compare to skip scan in practice? It's hard to compare,
as the patch does not implement an actual access path, but I tried this:

[...]

1) master

SELECT * FROM merge_test_int
WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)

[...]

2) merge scan

SELECT * FROM test_btree_merge_scan_int(

[...]

ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],

[...]

And with explain analyze we get this:

1) Buffers: shared hit=26 read=25
2) Buffers: shared hit=143 read=17

(FYI; your first query was missing "2" from it's IN list while it was
present in the merge scan input; this makes the difference worse by a
few pages)

So it seems to access many more buffers, even if the number of reads is
lower. Presumably the merge scan is not always better than skip scan,
probably depending on number of prefixes in the query etc. What is the
cost model to decide between those two?

Skip scan always returns data in index order, while this merge scan
would return tuples a suffix order. The cost model would thus weigh
the cost of sorting the result of an index skipscan against the cost
of doing a merge join on n_in_list_items distinct (partial) index
scans.

Makes sense.

As for when you would benefit in buffers accessed: The merge scan
would mainly benefit in number of buffers accessed when the selected
prefix values are non-sequential, and the prefixes cover multiple
pages at a time, and when there is a LIMIT clause on the scan. Normal
btree index skip scan infrastructure efficiently prevents new index
descents into the index when the selected SAOP key ranges are directly
adjecent, while merge scan would generally do at least one index
descent for each of its N scan heads (*) - which in the proposed
prototype patch guarantees O(index depth * num scan heads) buffer
accesses.

Do we have sufficient information to reliably make the right decision?
Can we actually cost the two cases well enough?

(*) It is theoretically possible to reuse an earlier index descent if
the SAOP entry's key range of the last descent starts and ends on the
leaf page that the next SAOP entry's key range also starts on
(applying the ideas of 5bf748b86b to this new multi-headed index scan
mode), but that infrastructure doesn't seem to be in place in the
current patch. That commit is also why your buffer access count for
master is so low compared to the merge scan's; if your chosen list of
numbers was multiples of 5 (so that matching tuples are not all
sequential) you'd probably see much more comparable buffer access
counts.

If you had to construct the best case and worst cases (vs. skip scan),
what would that look like?

Presumably the best case would be:

-- mytable.a has very few distinct values (e.g. bool or enum);
mytable.b many distinct values (e.g. uuid)
SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;

which the index's merge scan would turn into an index scan that
behaves similar to the following, possibly with the merge join pushed
down into the index:

SELECT * FROM (
SELECT ... FROM mytable WHERE a = 1
UNION
SELECT ... FROM mytable WHERE a = 2
) ORDER BY b.

The worst case would be the opposite:

-- mytable.a has many distinct values (random uuid); mytable.b few
(e.g. boolean; enum)
SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b

As the merge scan maintains one internal indexscan head per SAOP array
element, it'd have significant in-memory and scan startup overhead,
while few values are produced for each of those scan heads.

OK. It'll be interesting to see how this performs in practice for the
whole gamut between the best and worst case.

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I'm not sure, but it seems like it might be common/useful in
queue-like access patterns:

With an index on (state, updated_timestamp) you're probably interested
in all messages in just a subset of states, ordered by recent state
transitions. An index on (updated_timestamp, state) might be
considered more optimal, but won't be able to efficiently serve
queries that only want data on uncommon states: The leaf pages would
mainly contain data on common states, reducing the value of those leaf
pages.

Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
using UNION, or add an index for each percievable IN list, but it'd be
great if the user didn't have to rewrite their query or create
n_combinations indexes with their respective space usage to get this
more efficient query execution.

I think the examples presented by Ants (with timeline view) are quite
plausible in practice.

regards

--
Tomas Vondra

#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Ants Aasma (#4)
Re: New access method for b-tree.

On 2/3/26 22:42, Ants Aasma wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

Makes sense. I guess filtering products by category + order by price
could also produce this query pattern.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

True. It's useful to think about the query this way, and it may be
better than full select + sort, but it has issues too.

The main problem I can see is that at planning time the cardinality of
the prefix array might not be known, and in theory could be in the
millions. Having millions of index scans open at the same time is not
viable, so the method needs to somehow degrade gracefully. The idea I
had is to pick some limit, based on work_mem and/or benchmarking, and
one the limit is hit, populate the first batch and then run the next
batch of index scans, merging with the first result. Or something like
that, I can imagine a few different ways to handle it with different
tradeoffs.

Doesn't the proposed merge scan have a similar issue? Because that will
also have to keep all the index scans open (even if only internally).
Indeed, it needs to degrade gracefully, in some way. I'm afraid the
proposed batches execution will be rather complex, so I'd say v1 should
simply have a threshold, and do the full scan + sort for more items.

I can imagine that this would really nicely benefit from ReadStream'ification.

Not sure, maybe.

One other connection I see is with block nested loops. In a perfect
future PostgreSQL could run the following as a set of merged index
scans that terminate early:

SELECT posts.*
FROM follows f
JOIN posts p ON f.followed_id = p.user_id
WHERE f.follower_id = :userid
ORDER BY p.tstamp DESC LIMIT 100;

In practice this is not a huge issue - it's not that hard to transform
this to array_agg and = ANY subqueries.

Automating that transformation seems quite non-trivial (to me).

regards

--
Tomas Vondra

#7Michał Kłeczek
michal@kleczek.org
In reply to: Ants Aasma (#4)
Re: New access method for b-tree.

On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports multiple sort orders.
The only problem is that GIST does not support ORDER BY column.
One possible workaround is [1]/messages/by-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F@kleczek.org but as described there it does not play well with partitioning.
I’ve started drafting support for ORDER BY column in GIST - see [2]/messages/by-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91@kleczek.org.
I think it would be easier to implement and maintain than a new IAM (but I don’t have enough knowledge and experience to implement it myself)

[1]: /messages/by-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F@kleczek.org
[2]: /messages/by-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91@kleczek.org


Kind regards,
Michal

#8Alexandre Felipe
o.alexandre.felipe@gmail.com
In reply to: Michał Kłeczek (#7)
Re: New access method for b-tree.

Thank you for looking into this.

Now we can execute a, still narrow, family queries!

Maybe it helps to see this as a *social network feeds*. Imagine a social
network, you have a few friends, or follow a few people, and you want to
see their updates ordered by date. For each user we have a different
combination of users that we have to display. But maybe, even having
hundreds of users we will only show the first 10.

There is a low hanging fruit on the skip scan, if we need N rows, and one
group already has M rows we could stop there.
If Nx is the number of friends, and M is the number of posts to show.
This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
instead of (Nx * N) followed by an (Nx * N) sort.
Where M = 10 and N is 1000 this is a significant improvement.
But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
heap operations.
If everything is on the same page the skip scan would win.

The cost estimation is probably far off.
I am also not considering the filters applied after this operator, and I
don't know if the planner infrastructure is able to adjust it by itself.
This is where I would like reviewer's feedback. I think that the planner
costs are something to be determined experimentally.

Next I will make it slightly more general handling
* More index columns: Index (a, b, s...) could support WHERE a IN (...)
ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
(a, b, c)

---
Kind Regards,
Alexandre

On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <michal@kleczek.org> wrote:

Show quoted text

On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports multiple
sort orders.
The only problem is that GIST does not support ORDER BY column.
One possible workaround is [1] but as described there it does not play
well with partitioning.
I’ve started drafting support for ORDER BY column in GIST - see [2].
I think it would be easier to implement and maintain than a new IAM (but I
don’t have enough knowledge and experience to implement it myself)

[1]
/messages/by-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F@kleczek.org
[2]
/messages/by-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91@kleczek.org


Kind regards,
Michal

Attachments:

0002-MERGE-SCAN-Access-method.patchapplication/octet-stream; name=0002-MERGE-SCAN-Access-method.patchDownload+1476-1
0003-MERGE-SCAN-Planner-integration.patchapplication/octet-stream; name=0003-MERGE-SCAN-Planner-integration.patchDownload+437-7
0001-MERGE-SCAN-Test-the-baseline.patchapplication/octet-stream; name=0001-MERGE-SCAN-Test-the-baseline.patchDownload+213-1
#9Alexandre Felipe
o.alexandre.felipe@gmail.com
In reply to: Alexandre Felipe (#8)
Re: New access method for b-tree.

Hello again hackers!

+pt@bowt.ie <pt@bowt.ie>: That seems to be the one that is probably the
most familiar with the index scan (based on the commits).
+michael@paquier.xyz <michael@paquier.xyz> , +tgl@sss.pgh.pa.us
<tgl@sss.pgh.pa.us> , +peter@eisentraut.org <peter@eisentraut.org> as the
top 3 committers to nbtree over the last ~6 months.

I have made substantial progress on adding a few features. I have
questions, but I will let you go first :)

Motivation:
*In technical terms:* this proposal is to take advantage of a btree index
when the query is filtered by a few distinct prefixes and ordered by a
suffix and has a limit.
*In non technical:* This could help to efficiently render a social network
feed, where each user can select a list of users whose posts they want to
see, and the posts must be ordered from newest to oldest.

*Performance Comparison*
I did a test with a toy table, please find more details below.

With limit 100

| Method | Shared Hit | Shared Read | Exec Time |
|------------|-----------:|------------:|----------:|
| Merge | 13 | 119 | 13 ms |
| IndexScan | 15,308 | 525,310 | 3,409 ms |

With limit 1,000,000

| Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
|------------|-----------:|---------:|-------:|-------:|----------:|
| Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
| Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
| Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
| IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |

Sequential scans and bitmap scans in this case reduces significantly the
number of
accessed buff because the table has only four integer columns, and these
methods
can read all the lines on a given page at a time.

However that comes at the cost of resorting to an in-disk sort method.
For the query with limit 100 we get no temp files as we are using a
top-100 sort.

make check passes

*Experiment details*

Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100

```sql
CREATE TABLE grid AS (
SELECT a, b, c, d, FROM
generate_series(1, 100) AS a,
generate_series(1, 100) AS b,
generate_series(1, 100) AS c,
generate_series(1, 100) AS d
);

CREATE INDEX grid_index ON grid (a, b, c);
ANALYSE grid;
```

Now let's say that we need to find certain number of rows filtered by a and
ordered by b;
```sql
PREPARE grid_query(int) AS
SELECT sum(d) FROM (
SELECT * FROM grid
WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
ORDER BY b
LIMIT $1) t;
```

---

Now with limit 100, with index merge scan (notice Index Prefixes in the
plan).

```sql
SET enable_indexmergescan = on;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```

```text
Buffers: shared hit=13 read=119
-> Limit (cost=0.57..87.29 rows=100 width=16) (actual
time=5.528..12.999 rows=100.00 loops=1)
Buffers: shared hit=13 read=119
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes: *(a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=13 read=119
Planning:
Buffers: shared hit=59 read=23
Planning Time: 4.619 ms
Execution Time: 13.055 ms
```

```sql
SET enable_indexmergescan = off;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```

```text
Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
time=3406.624..3408.710 rows=1.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
time=3406.601..3408.702 rows=100.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15308 read=525310
-> Sort (cost=1602575.14..1610720.98 rows=3258333
width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
Sort Key: grid.b
Sort Method: top-N heapsort Memory: 32kB
Buffers: shared hit=15308 read=525310
Worker 0: Sort Method: top-N heapsort Memory: 32kB
Worker 1: Sort Method: top-N heapsort Memory: 32kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15234 read=525310
Planning Time: 0.370 ms
Execution Time: 3409.134 ms
```

Now queries with limit 1,000,000

```sql
SET enable_indexmergescan = on;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```

Query executed with the proposed access method. Notice in the plan Index
Prefixes and Index Cond.
```text
Buffers: shared hit=980318 read=19721
-> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
time=2.854..2103.438 rows=1000000.00 loops=1)
Buffers: shared hit=980318 read=19721
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..867265.91
rows=1000007 width=16) (actual time=2.852..2066.205 rows=1000000.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes:* (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=980318 read=19721
Planning Time: 0.328 ms
Execution Time: 2127.811 ms
```

If we disable index_mergescan we naturally we fall into a sequential scan.

```sql
SET enable_indexmergescan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
-> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
time=3467.012..3712.044 rows=1000000.00 loops=1)
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Sort (cost=1941895.62..1950041.45 rows=3258333
width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47016kB*
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
Worker 0: Sort Method: external merge Disk: 46976kB
Worker 1: Sort Method: external merge Disk: 47000kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15134 read=525410
Planning Time: 0.332 ms
Execution Time: 3761.866 ms
```

If we disable sequential scans, then we get a bitmap scan

```sql
SET enable_seqscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
time=5170.456..5453.433 rows=1000000.00 loops=1)
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
-> Sort (cost=1997199.75..2005345.59 rows=3258333
width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
Sort Key: grid.b
Sort Method: external merge Disk: 47032kB
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
Worker 0: Sort Method: external merge Disk: 47280kB
Worker 1: Sort Method: external merge Disk: 46680kB
-> Parallel Bitmap Heap Scan on grid
(cost=107691.54..1533348.13 rows=3258333 width=16) (actual
time=299.891..4489.787 rows=2666666.67 loops=3)
Recheck Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Rows *Removed by Index Recheck*: 2410242
Heap Blocks: exact=13100 lossy=22639
Buffers: shared hit=615 read=113759 written=2
Worker 0: Heap Blocks: exact=13077 lossy=22755
Worker 1: Heap Blocks: exact=13036 lossy=22421
-> *Bitmap Index Scan* on grid_a_b_c_idx
(cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
rows=8000000.00 loops=1)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=13 read=7293 written=2
Planning Time: 0.165 ms
Execution Time: 5487.213 ms
```

If we disable bitmap scans we finally get an index scan

```sql
SET enable_bitmapscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```
Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
-> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
time=4414.478..4674.400 rows=1000000.00 loops=1)
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Sort (cost=7200203.05..7208348.88 rows=3258333
width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47304kB*
Buffers: shared hit=7883221 read=124111, temp
read=20699 written=35385
Worker 0: Sort Method: external merge Disk: 47304kB
Worker 1: Sort Method: external merge Disk: 46384kB
-> *Parallel Index Scan* using grid_a_b_c_idx on grid
(cost=0.57..6736351.43 rows=3258333 width=16) (actual
time=46.925..3796.915 rows=2666666.67 loops=3)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=7883208 read=124110
Planning Time: 0.385 ms
Execution Time: 4713.325 ms
```

On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
o.alexandre.felipe@gmail.com> wrote:

Show quoted text

Thank you for looking into this.

Now we can execute a, still narrow, family queries!

Maybe it helps to see this as a *social network feeds*. Imagine a social
network, you have a few friends, or follow a few people, and you want to
see their updates ordered by date. For each user we have a different
combination of users that we have to display. But maybe, even having
hundreds of users we will only show the first 10.

There is a low hanging fruit on the skip scan, if we need N rows, and one
group already has M rows we could stop there.
If Nx is the number of friends, and M is the number of posts to show.
This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
instead of (Nx * N) followed by an (Nx * N) sort.
Where M = 10 and N is 1000 this is a significant improvement.
But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
heap operations.
If everything is on the same page the skip scan would win.

The cost estimation is probably far off.
I am also not considering the filters applied after this operator, and I
don't know if the planner infrastructure is able to adjust it by itself.
This is where I would like reviewer's feedback. I think that the planner
costs are something to be determined experimentally.

Next I will make it slightly more general handling
* More index columns: Index (a, b, s...) could support WHERE a IN (...)
ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
(a, b, c)

---
Kind Regards,
Alexandre

On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <michal@kleczek.org> wrote:

On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports multiple
sort orders.
The only problem is that GIST does not support ORDER BY column.
One possible workaround is [1] but as described there it does not play
well with partitioning.
I’ve started drafting support for ORDER BY column in GIST - see [2].
I think it would be easier to implement and maintain than a new IAM (but
I don’t have enough knowledge and experience to implement it myself)

[1]
/messages/by-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F@kleczek.org
[2]
/messages/by-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91@kleczek.org


Kind regards,
Michal

Attachments:

0003-MERGE-SCAN-Planner-integration.patchapplication/octet-stream; name=0003-MERGE-SCAN-Planner-integration.patchDownload+437-7
0004-MERGE-SCAN-Multi-column.patchapplication/octet-stream; name=0004-MERGE-SCAN-Multi-column.patchDownload+1058-341
0001-MERGE-SCAN-Test-the-baseline.patchapplication/octet-stream; name=0001-MERGE-SCAN-Test-the-baseline.patchDownload+213-1
0002-MERGE-SCAN-Access-method.patchapplication/octet-stream; name=0002-MERGE-SCAN-Access-method.patchDownload+1476-1
#10Alexandre Felipe
o.alexandre.felipe@gmail.com
In reply to: Alexandre Felipe (#9)
Re: New access method for b-tree.

Hi Hackers,

Do you think that MERGE-SCAN was a terrible name? I wanted a name that
wouldn't

require much explanation. I named it like this because it relies on a k-way

merge to combine several segments of an index in one result. But we already

have a MERGE statement. Even in the example plan above we can see
an external

merge that has nothing to do with the new feature, and now as I am doing
joins,

I started doing it on the NestedLoop trying to follow the same conditions
that

lead to a memoize. But I added so many fields to the NestedLoop state that I

think it is good to have a separate structure, and maybe a separate node,
and

MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
could

use the Ants nickname TimeLineScan, but of course it is not limited to time

lines (even though realistically, that will probably be the most common use
of

this). Another one I considered was TransposedIndexScan, because it orders

output on (suffix, prefix) instead of (prefix, suffix).

On Fri, Feb 6, 2026 at 10:52 AM Alexandre Felipe <
o.alexandre.felipe@gmail.com> wrote:

Show quoted text

Hello again hackers!

+pt@bowt.ie <pt@bowt.ie>: That seems to be the one that is probably the
most familiar with the index scan (based on the commits).
+michael@paquier.xyz <michael@paquier.xyz> , +tgl@sss.pgh.pa.us
<tgl@sss.pgh.pa.us> , +peter@eisentraut.org <peter@eisentraut.org> as the
top 3 committers to nbtree over the last ~6 months.

I have made substantial progress on adding a few features. I have
questions, but I will let you go first :)

Motivation:
*In technical terms:* this proposal is to take advantage of a btree index
when the query is filtered by a few distinct prefixes and ordered by a
suffix and has a limit.
*In non technical:* This could help to efficiently render a social
network feed, where each user can select a list of users whose posts they
want to see, and the posts must be ordered from newest to oldest.

*Performance Comparison*
I did a test with a toy table, please find more details below.

With limit 100

| Method | Shared Hit | Shared Read | Exec Time |
|------------|-----------:|------------:|----------:|
| Merge | 13 | 119 | 13 ms |
| IndexScan | 15,308 | 525,310 | 3,409 ms |

With limit 1,000,000

| Method | SharedHit | SharRead | Temp I | Temp O | Exec Time |
|------------|-----------:|---------:|-------:|-------:|----------:|
| Merge | 980,318 | 19,721 | 0 | 0 | 2,128 ms |
| Sequential | 15,208 | 525,410 | 20,207 | 35,384 | 3,762 ms |
| Bitmap | 629 | 113,759 | 20,207 | 35,385 | 5,487 ms |
| IndexScan | 7,880,619 | 126,706 | 20,945 | 35,386 | 5,874 ms |

Sequential scans and bitmap scans in this case reduces significantly the
number of
accessed buff because the table has only four integer columns, and these
methods
can read all the lines on a given page at a time.

However that comes at the cost of resorting to an in-disk sort method.
For the query with limit 100 we get no temp files as we are using a
top-100 sort.

make check passes

*Experiment details*

Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100

```sql
CREATE TABLE grid AS (
SELECT a, b, c, d, FROM
generate_series(1, 100) AS a,
generate_series(1, 100) AS b,
generate_series(1, 100) AS c,
generate_series(1, 100) AS d
);

CREATE INDEX grid_index ON grid (a, b, c);
ANALYSE grid;
```

Now let's say that we need to find certain number of rows filtered by a
and ordered by b;
```sql
PREPARE grid_query(int) AS
SELECT sum(d) FROM (
SELECT * FROM grid
WHERE a IN (2,3,5,8,13,21,34,55) AND b >= 0
ORDER BY b
LIMIT $1) t;
```

---

Now with limit 100, with index merge scan (notice Index Prefixes in the
plan).

```sql
SET enable_indexmergescan = on;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```

```text
Buffers: shared hit=13 read=119
-> Limit (cost=0.57..87.29 rows=100 width=16) (actual
time=5.528..12.999 rows=100.00 loops=1)
Buffers: shared hit=13 read=119
-> Index Scan using grid_a_b_c_idx on grid (cost=0.57..93.36
rows=107 width=16) (actual time=5.528..12.994 rows=100.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes: *(a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=13 read=119
Planning:
Buffers: shared hit=59 read=23
Planning Time: 4.619 ms
Execution Time: 13.055 ms
```

```sql
SET enable_indexmergescan = off;
EXPLAIN (ANALYSE) EXECUTE grid_query(100);
```

```text
Aggregate (cost=1603588.06..1603588.07 rows=1 width=8) (actual
time=3406.624..3408.710 rows=1.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Limit (cost=1603575.17..1603586.81 rows=100 width=16) (actual
time=3406.601..3408.702 rows=100.00 loops=1)
Buffers: shared hit=15308 read=525310
-> Gather Merge (cost=1603575.17..2514342.92 rows=7819999
width=16) (actual time=3406.598..3408.695 rows=100.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15308 read=525310
-> Sort (cost=1602575.14..1610720.98 rows=3258333
width=16) (actual time=3393.782..3393.784 rows=100.00 loops=3)
Sort Key: grid.b
Sort Method: top-N heapsort Memory: 32kB
Buffers: shared hit=15308 read=525310
Worker 0: Sort Method: top-N heapsort Memory: 32kB
Worker 1: Sort Method: top-N heapsort Memory: 32kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=0.944..3129.896
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15234 read=525310
Planning Time: 0.370 ms
Execution Time: 3409.134 ms
```

Now queries with limit 1,000,000

```sql
SET enable_indexmergescan = on;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```

Query executed with the proposed access method. Notice in the plan Index
Prefixes and Index Cond.
```text
Buffers: shared hit=980318 read=19721
-> Limit (cost=0.57..867259.84 rows=1000000 width=16) (actual
time=2.854..2103.438 rows=1000000.00 loops=1)
Buffers: shared hit=980318 read=19721
-> Index Scan using grid_a_b_c_idx on grid
(cost=0.57..867265.91 rows=1000007 width=16) (actual time=2.852..2066.205
rows=1000000.00 loops=1)
Index Cond: (b >= 0)
*Index Prefixes:* (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[]))
Index Searches: 8
Buffers: shared hit=980318 read=19721
Planning Time: 0.328 ms
Execution Time: 2127.811 ms
```

If we disable index_mergescan we naturally we fall into a sequential scan.

```sql
SET enable_indexmergescan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=15208 read=525410, temp read=20207 written=35384
-> Limit (cost=1942895.64..2059362.12 rows=1000000 width=16) (actual
time=3467.012..3712.044 rows=1000000.00 loops=1)
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Gather Merge (cost=1942895.64..2853663.39 rows=7819999
width=16) (actual time=3467.010..3671.220 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=15208 read=525410, temp read=20207
written=35384
-> Sort (cost=1941895.62..1950041.45 rows=3258333
width=16) (actual time=3455.852..3476.358 rows=334576.33 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47016kB*
Buffers: shared hit=15208 read=525410, temp
read=20207 written=35384
Worker 0: Sort Method: external merge Disk: 46976kB
Worker 1: Sort Method: external merge Disk: 47000kB
-> *Parallel Seq Scan* on grid
(cost=0.00..1478044.00 rows=3258333 width=16) (actual time=2.789..2779.483
rows=2666666.67 loops=3)
Filter: ((b >= 0) AND (a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])))
Rows Removed by Filter: 30666667
Buffers: shared hit=15134 read=525410
Planning Time: 0.332 ms
Execution Time: 3761.866 ms
```

If we disable sequential scans, then we get a bitmap scan

```sql
SET enable_seqscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```text
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Limit (cost=1998199.78..2114666.26 rows=1000000 width=16) (actual
time=5170.456..5453.433 rows=1000000.00 loops=1)
Buffers: shared hit=629 read=113759 written=2, temp read=20207
written=35385
-> Gather Merge (cost=1998199.78..2908967.53 rows=7819999
width=16) (actual time=5170.455..5413.254 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
-> Sort (cost=1997199.75..2005345.59 rows=3258333
width=16) (actual time=5156.929..5177.507 rows=334500.67 loops=3)
Sort Key: grid.b
Sort Method: external merge Disk: 47032kB
Buffers: shared hit=629 read=113759 written=2, temp
read=20207 written=35385
Worker 0: Sort Method: external merge Disk: 47280kB
Worker 1: Sort Method: external merge Disk: 46680kB
-> Parallel Bitmap Heap Scan on grid
(cost=107691.54..1533348.13 rows=3258333 width=16) (actual
time=299.891..4489.787 rows=2666666.67 loops=3)
Recheck Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Rows *Removed by Index Recheck*: 2410242
Heap Blocks: exact=13100 lossy=22639
Buffers: shared hit=615 read=113759 written=2
Worker 0: Heap Blocks: exact=13077 lossy=22755
Worker 1: Heap Blocks: exact=13036 lossy=22421
-> *Bitmap Index Scan* on grid_a_b_c_idx
(cost=0.00..105736.54 rows=7820000 width=0) (actual time=297.651..297.651
rows=8000000.00 loops=1)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=13 read=7293 written=2
Planning Time: 0.165 ms
Execution Time: 5487.213 ms
```

If we disable bitmap scans we finally get an index scan

```sql
SET enable_bitmapscan = off;
EXPLAIN ANALYSE EXECUTE grid_query(1000000);
```
```
Buffers: shared hit=7883221 read=124111, temp read=20699 written=35385
-> Limit (cost=7201203.08..7317669.55 rows=1000000 width=16) (actual
time=4414.478..4674.400 rows=1000000.00 loops=1)
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Gather Merge (cost=7201203.08..8111970.83 rows=7819999
width=16) (actual time=4414.476..4633.982 rows=1000000.00 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=7883221 read=124111, temp read=20699
written=35385
-> Sort (cost=7200203.05..7208348.88 rows=3258333
width=16) (actual time=4390.625..4411.896 rows=334567.00 loops=3)
Sort Key: grid.b
Sort Method: *external merge Disk: 47304kB*
Buffers: shared hit=7883221 read=124111, temp
read=20699 written=35385
Worker 0: Sort Method: external merge Disk: 47304kB
Worker 1: Sort Method: external merge Disk: 46384kB
-> *Parallel Index Scan* using grid_a_b_c_idx on
grid (cost=0.57..6736351.43 rows=3258333 width=16) (actual
time=46.925..3796.915 rows=2666666.67 loops=3)
Index Cond: ((a = ANY
('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >= 0))
Index Searches: 7
Buffers: shared hit=7883208 read=124110
Planning Time: 0.385 ms
Execution Time: 4713.325 ms
```

On Thu, Feb 5, 2026 at 6:59 AM Alexandre Felipe <
o.alexandre.felipe@gmail.com> wrote:

Thank you for looking into this.

Now we can execute a, still narrow, family queries!

Maybe it helps to see this as a *social network feeds*. Imagine a social
network, you have a few friends, or follow a few people, and you want to
see their updates ordered by date. For each user we have a different
combination of users that we have to display. But maybe, even having
hundreds of users we will only show the first 10.

There is a low hanging fruit on the skip scan, if we need N rows, and one
group already has M rows we could stop there.
If Nx is the number of friends, and M is the number of posts to show.
This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
instead of (Nx * N) followed by an (Nx * N) sort.
Where M = 10 and N is 1000 this is a significant improvement.
But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx)
heap operations.
If everything is on the same page the skip scan would win.

The cost estimation is probably far off.
I am also not considering the filters applied after this operator, and I
don't know if the planner infrastructure is able to adjust it by itself.
This is where I would like reviewer's feedback. I think that the planner
costs are something to be determined experimentally.

Next I will make it slightly more general handling
* More index columns: Index (a, b, s...) could support WHERE a IN (...)
ORDER BY b LIMIT N (ignoring s...)
* Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
* Non-leading prefix: WHERE b IN (...) AND a = const ORDER BY c on index
(a, b, c)

---
Kind Regards,
Alexandre

On Wed, Feb 4, 2026 at 7:13 AM Michał Kłeczek <michal@kleczek.org> wrote:

On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:

On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:

I'm also wondering how common is the targeted query pattern? How common
it is to have an IN condition on the leading column in an index, and
ORDER BY on the second one?

I have seen this pattern multiple times. My nickname for it is the
timeline view. Think of the social media timeline, showing posts from
all followed accounts in timestamp order, returned in reasonably sized
batches. The naive SQL query will have to scan all posts from all
followed accounts and pass them through a top-N sort. When the total
number of posts is much larger than the batch size this is much slower
than what is proposed here (assuming I understand it correctly) -
effectively equivalent to running N index scans through Merge Append.

My workarounds I have proposed users have been either to rewrite the
query as a UNION ALL of a set of single value prefix queries wrapped
in an order by limit. This gives the exact needed merge append plan
shape. But repeating the query N times can get unwieldy when the
number of values grows, so the fallback is:

SELECT * FROM unnest(:friends) id, LATERAL (
SELECT * FROM posts
WHERE user_id = id
ORDER BY tstamp DESC LIMIT 100)
ORDER BY tstamp DESC LIMIT 100;

The downside of this formulation is that we still have to fetch a
batch worth of items from scans where we otherwise would have only had
to look at one index tuple.

GIST can be used to handle this kind of queries as it supports multiple
sort orders.
The only problem is that GIST does not support ORDER BY column.
One possible workaround is [1] but as described there it does not play
well with partitioning.
I’ve started drafting support for ORDER BY column in GIST - see [2].
I think it would be easier to implement and maintain than a new IAM (but
I don’t have enough knowledge and experience to implement it myself)

[1]
/messages/by-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F@kleczek.org
[2]
/messages/by-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91@kleczek.org


Kind regards,
Michal