Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

ashutosh.bapat@enterprisedb.com

6 months ago

In reply to: Vivek Gadge (#1)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Mon, Sep 8, 2025 at 4:01 AM Vivek Gadge <vvkgadge56@gmail.com> wrote:

Hi Team,

We are currently experiencing performance issues related to partition scanning on a heavily used table in our PostgreSQL v17.6 database.

The table is partitioned monthly (e.g., transactions_jan25, transactions_feb25, …, transactions_sept25). We’ve observed that PostgreSQL scans these partitions in the order they were attached (January through September).

This behavior is leading to inefficient query performance when accessing recent data (e.g., August or September), as older partitions are being scanned first — adding unnecessary overhead. Since PostgreSQL does not provide a built-in setting to prioritize scanning partitions in descending order, we’re exploring ways to address this.

Could you please advise on:

How to optimize partition scanning so that recent partitions are scanned first, and

Any recommended best practices or workarounds to improve performance in such scenarios.

Looking forward to your guidance.

Thank you

Can you please describe how the query performance is affected because
of the order in which partitions are scanned?

--
Best Wishes,
Ashutosh Bapat

lepihov@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#2)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On 8/9/2025 11:47, Ashutosh Bapat wrote:

On Mon, Sep 8, 2025 at 4:01 AM Vivek Gadge <vvkgadge56@gmail.com> wrote:

Looking forward to your guidance.

Thank you

Can you please describe how the query performance is affected because
of the order in which partitions are scanned?

I guess they mentioned that the Postgres optimiser doesn't care about
the order of Append's subplans. It is a little sad in some cases. The
most critical case is when we have a limitation on the number of tuples
returned. In this case, the optimiser could consider the following
strategies:
1. Prefer scanning local partitions to foreign ones.
2. Pick first partitions with less startup costs and 'high probability'
to obtain all necessary tuples from a minimum set of partitions.

Postgres arranges clauses inside a long expression according to
evaluation cost (see order_qual_clauses). So, why not do similar stuff
for subplans?

Also, I wonder if it would make sense to shuffle partitions a little and
let backends scan partitions one-by-one in different orders just to
reduce any sort of contention in case the queries don't fit the
partitioning expression.

--
regards, Andrei Lepikhov

Adrien Nayrat

adrien.nayrat@anayrat.info

6 months ago

In reply to: Andrei Lepikhov (#3)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On 9/8/25 12:05 PM, Andrei Lepikhov wrote:

On 8/9/2025 11:47, Ashutosh Bapat wrote:

On Mon, Sep 8, 2025 at 4:01 AM Vivek Gadge <vvkgadge56@gmail.com> wrote:

Looking forward to your guidance.

Thank you

Can you please describe how the query performance is affected because
of the order in which partitions are scanned?

I guess they mentioned that the Postgres optimiser doesn't care about
the order of Append's subplans. It is a little sad in some cases. The
most critical case is when we have a limitation on the number of tuples
returned. In this case, the optimiser could consider the following
strategies:
1. Prefer scanning local partitions to foreign ones.
2. Pick first partitions with less startup costs and 'high probability'
to obtain all necessary tuples from a minimum set of partitions.

Postgres arranges clauses inside a long expression according to
evaluation cost (see order_qual_clauses). So, why not do similar stuff
for subplans?

Also, I wonder if it would make sense to shuffle partitions a little and
let backends scan partitions one-by-one in different orders just to
reduce any sort of contention in case the queries don't fit the
partitioning expression.

It reminds me of these threads :

Make the optimiser aware of partitions ordering :
/messages/by-id/2401607.SfZhPQhbS4@ronan_laptop

Allow ordered partition scans in more cases :
/messages/by-id/CAApHDvojKdBR3MR59JXmaCYbyHB6Q_5qPRU+dy93En8wm+XiDA@mail.gmail.com

Ordered Partitioned Table Scans :
/messages/by-id/CAKJS1f-hAqhPLRk_RaSFTgYxd=Tz5hA7kQ2h4-DhJufQk8TGuw@mail.gmail.com

Regards

--
Adrien NAYRAT

Vivek Gadge

vvkgadge56@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#2)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

For example, when a query runs on a partitioned table, PostgreSQL scans
partitions in the order they were created or attached to the parent table.
In our case (monthly partitions from January through September), this means
that queries looking for recent data (e.g., September) may experience
additional overhead. PostgreSQL evaluates the older partitions first,
checking their constraints and in some cases probing their indexes, before
reaching the later partitions that actually contain the needed data.

As a result, while the query results are correct, the execution time
increases due to unnecessary work on irrelevant partitions. This
performance impact is more noticeable when the target partition is at the
end of the scan order and pruning cannot fully eliminate the earlier
partitions.

Thanks.

On Mon, 8 Sept, 2025, 3:17 pm Ashutosh Bapat, <ashutosh.bapat.oss@gmail.com>
wrote:

Show quoted text

On Mon, Sep 8, 2025 at 4:01 AM Vivek Gadge <vvkgadge56@gmail.com> wrote:

Hi Team,

We are currently experiencing performance issues related to partition

scanning on a heavily used table in our PostgreSQL v17.6 database.

The table is partitioned monthly (e.g., transactions_jan25,

transactions_feb25, …, transactions_sept25). We’ve observed that PostgreSQL
scans these partitions in the order they were attached (January through
September).

This behavior is leading to inefficient query performance when accessing

recent data (e.g., August or September), as older partitions are being
scanned first — adding unnecessary overhead. Since PostgreSQL does not
provide a built-in setting to prioritize scanning partitions in descending
order, we’re exploring ways to address this.

Could you please advise on:

How to optimize partition scanning so that recent partitions are scanned

first, and

Any recommended best practices or workarounds to improve performance in

such scenarios.

Looking forward to your guidance.

Thank you

Can you please describe how the query performance is affected because
of the order in which partitions are scanned?

--
Best Wishes,
Ashutosh Bapat

lepihov@gmail.com

6 months ago

In reply to: Vivek Gadge (#5)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On 8/9/2025 13:39, Vivek Gadge wrote:

For example, when a query runs on a partitioned table, PostgreSQL scans
partitions in the order they were created or attached to the parent
table. In our case (monthly partitions from January through September),
this means that queries looking for recent data (e.g., September) may
experience additional overhead. PostgreSQL evaluates the older
partitions first, checking their constraints and in some cases probing
their indexes, before reaching the later partitions that actually
contain the needed data.

As a result, while the query results are correct, the execution time
increases due to unnecessary work on irrelevant partitions. This
performance impact is more noticeable when the target partition is at
the end of the scan order and pruning cannot fully eliminate the earlier
partitions.

The case looks straightforward. But just to be sure that we are on the
same page, could you provide a synthetic DB example and a query
representing the exact problem you are going to resolve?

--
regards, Andrei Lepikhov

ashutosh.bapat@enterprisedb.com

6 months ago

In reply to: Vivek Gadge (#5)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Mon, Sep 8, 2025 at 5:09 PM Vivek Gadge <vvkgadge56@gmail.com> wrote:

For example, when a query runs on a partitioned table, PostgreSQL scans partitions in the order they were created or attached to the parent table. In our case (monthly partitions from January through September), this means that queries looking for recent data (e.g., September) may experience additional overhead. PostgreSQL evaluates the older partitions first, checking their constraints and in some cases probing their indexes, before reaching the later partitions that actually contain the needed data.

As a result, while the query results are correct, the execution time increases due to unnecessary work on irrelevant partitions. This performance impact is more noticeable when the target partition is at the end of the scan order and pruning cannot fully eliminate the earlier partitions.

If you don't want data from certain partitions maybe you should add a
clause that will help partition pruning. If you need data from all
partitions, the order in which they are scanned doesn't matter, those
will be scanned either way.

If partitioning pruning isn't working for you, please report the exact
query. Please provide example queries anyway.

--
Best Wishes,
Ashutosh Bapat

lepihov@gmail.com

6 months ago

In reply to: Adrien Nayrat (#4)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On 8/9/2025 12:32, Adrien Nayrat wrote:

On 9/8/25 12:05 PM, Andrei Lepikhov wrote:

On 8/9/2025 11:47, Ashutosh Bapat wrote:

On Mon, Sep 8, 2025 at 4:01 AM Vivek Gadge <vvkgadge56@gmail.com> wrote:

It reminds me of these threads :

Yes, partially. Actively using foreign tables and touching cheaper
partitions first is crucial in queries with limits. Also, some cases may
provide an implicit limit, like a merge join.

--
regards, Andrei Lepikhov

Vivek Gadge

vvkgadge56@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#7)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

Hi Ashutosh,

Thank you for your feedback regarding this matter.

To provide more context, here is the exact query I am running:

EXPLAIN ANALYZE VERBOSE
SELECT m.txn_date, d.bank_ref
FROM app.main m
JOIN app.detail d ON m.txn_id = d.main_txn_id
WHERE m.txn_id = 9999999999999999999;

Both app.main and app.detail are range partitioned by month (e.g.,
main_202502, detail_202502, etc.) on a date column,

Attached below is the EXPLAIN ANALYZE VERBOSE output for the query. It
shows that PostgreSQL is scanning all partitions in ascending order, even
though the matching record is present in the main_202509 partition.

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------
Nested Loop (cost=1.00..46.19 rows=156 width=346) (actual
time=17.656..18.173 rows=1 loops=1)
Output: m.txn_date, d.bank_ref, m.processor_id, m.txn_code, m.pos_data,
d.issuer_txn_id, m.trans_ref, d.processor_uid, m.sub_txn_type,
d.processor_token
-> Append (cost=0.43..22.05 rows=13 width=238) (actual
time=8.749..9.027 rows=1 loops=1)
-> Index Scan using detail_202502_main_txn_id_key on
app.detail_202502 d_1 (cost=0.43..2.65 rows=1 width=74) (actual
time=1.634..1.634 rows=0 loops=1)
Output: d_1.bank_ref, d_1.issuer_txn_id, d_1.processor_uid,
d_1.processor_token, d_1.main_txn_id
Index Cond: (d_1.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202503_main_txn_id_key on
app.detail_202503 d_2 (cost=0.56..2.78 rows=1 width=74) (actual
time=1.158..1.158 rows=0 loops=1)
Output: d_2.bank_ref, d_2.issuer_txn_id, d_2.processor_uid,
d_2.processor_token, d_2.main_txn_id
Index Cond: (d_2.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202504_main_txn_id_key on
app.detail_202504 d_3 (cost=0.56..2.78 rows=1 width=74) (actual
time=1.105..1.106 rows=0 loops=1)
Output: d_3.bank_ref, d_3.issuer_txn_id, d_3.processor_uid,
d_3.processor_token, d_3.main_txn_id
Index Cond: (d_3.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202505_main_txn_id_key on
app.detail_202505 d_4 (cost=0.56..2.78 rows=1 width=73) (actual
time=1.097..1.097 rows=0 loops=1)
Output: d_4.bank_ref, d_4.issuer_txn_id, d_4.processor_uid,
d_4.processor_token, d_4.main_txn_id
Index Cond: (d_4.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202506_main_txn_id_key on
app.detail_202506 d_5 (cost=0.56..2.78 rows=1 width=74) (actual
time=1.002..1.002 rows=0 loops=1)
Output: d_5.bank_ref, d_5.issuer_txn_id, d_5.processor_uid,
d_5.processor_token, d_5.main_txn_id
Index Cond: (d_5.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202507_main_txn_id_key on
app.detail_202507 d_6 (cost=0.56..2.78 rows=1 width=74) (actual
time=0.913..0.913 rows=0 loops=1)
Output: d_6.bank_ref, d_6.issuer_txn_id, d_6.processor_uid,
d_6.processor_token, d_6.main_txn_id
Index Cond: (d_6.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202508_main_txn_id_key on
app.detail_202508 d_7 (cost=0.56..2.78 rows=1 width=73) (actual
time=0.833..0.833 rows=0 loops=1)
Output: d_7.bank_ref, d_7.issuer_txn_id, d_7.processor_uid,
d_7.processor_token, d_7.main_txn_id
Index Cond: (d_7.main_txn_id = '9999999999999999999'::bigint)
-> Index Scan using detail_202509_main_txn_id_key on
app.detail_202509 d_8 (cost=0.43..2.65 rows=1 width=74) (actual
time=1.001..1.004 rows=1 loops=1)
Output: d_8.bank_ref, d_8.issuer_txn_id, d_8.processor_uid,
d_8.processor_token, d_8.main_txn_id
Index Cond: (d_8.main_txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.detail_202510 d_9 (cost=0.00..0.00 rows=1
width=500) (actual time=0.041..0.042 rows=0 loops=1)
Output: d_9.bank_ref, d_9.issuer_txn_id, d_9.processor_uid,
d_9.processor_token, d_9.main_txn_id
Filter: (d_9.main_txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.detail_202511 d_10 (cost=0.00..0.00 rows=1
width=500) (actual time=0.065..0.066 rows=0 loops=1)
Output: d_10.bank_ref, d_10.issuer_txn_id,
d_10.processor_uid, d_10.processor_token, d_10.main_txn_id
Filter: (d_10.main_txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.detail_202512 d_11 (cost=0.00..0.00 rows=1
width=500) (actual time=0.085..0.085 rows=0 loops=1)
Output: d_11.bank_ref, d_11.issuer_txn_id,
d_11.processor_uid, d_11.processor_token, d_11.main_txn_id
Filter: (d_11.main_txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.detail_202601 d_12 (cost=0.00..0.00 rows=1
width=500) (actual time=0.039..0.039 rows=0 loops=1)
Output: d_12.bank_ref, d_12.issuer_txn_id,
d_12.processor_uid, d_12.processor_token, d_12.main_txn_id
Filter: (d_12.main_txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.detail_default d_13 (cost=0.00..0.00 rows=1
width=500) (actual time=0.037..0.037 rows=0 loops=1)
Output: d_13.bank_ref, d_13.issuer_txn_id,
d_13.processor_uid, d_13.processor_token, d_13.main_txn_id
Filter: (d_13.main_txn_id = '9999999999999999999'::bigint)
-> Materialize (cost=0.56..22.23 rows=12 width=125) (actual
time=8.901..9.137 rows=1 loops=1)
Output: m.txn_date, m.processor_id, m.txn_code, m.pos_data,
m.trans_ref, m.sub_txn_type, m.txn_id
-> Append (cost=0.56..22.17 rows=12 width=125) (actual
time=8.892..9.127 rows=1 loops=1)
-> Index Scan using main_202502_pkey on app.main_202502
m_1 (cost=0.56..2.78 rows=1 width=36) (actual time=0.897..0.897 rows=0
loops=1)
Output: m_1.txn_date, m_1.processor_id, m_1.txn_code,
m_1.pos_data, m_1.trans_ref, m_1.sub_txn_type, m_1.txn_id
Index Cond: (m_1.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202503_pkey on app.main_202503
m_2 (cost=0.56..2.78 rows=1 width=37) (actual time=1.105..1.105 rows=0
loops=1)
Output: m_2.txn_date, m_2.processor_id, m_2.txn_code,
m_2.pos_data, m_2.trans_ref, m_2.sub_txn_type, m_2.txn_id
Index Cond: (m_2.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202504_pkey on app.main_202504
m_3 (cost=0.56..2.78 rows=1 width=37) (actual time=1.114..1.114 rows=0
loops=1)
Output: m_3.txn_date, m_3.processor_id, m_3.txn_code,
m_3.pos_data, m_3.trans_ref, m_3.sub_txn_type, m_3.txn_id
Index Cond: (m_3.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202505_pkey on app.main_202505
m_4 (cost=0.56..2.78 rows=1 width=37) (actual time=1.135..1.135 rows=0
loops=1)
Output: m_4.txn_date, m_4.processor_id, m_4.txn_code,
m_4.pos_data, m_4.trans_ref, m_4.sub_txn_type, m_4.txn_id
Index Cond: (m_4.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202506_pkey on app.main_202506
m_5 (cost=0.56..2.78 rows=1 width=38) (actual time=1.096..1.096 rows=0
loops=1)
Output: m_5.txn_date, m_5.processor_id, m_5.txn_code,
m_5.pos_data, m_5.trans_ref, m_5.sub_txn_type, m_5.txn_id
Index Cond: (m_5.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202507_pkey on app.main_202507
m_6 (cost=0.56..2.78 rows=1 width=38) (actual time=1.285..1.285 rows=0
loops=1)
Output: m_6.txn_date, m_6.processor_id, m_6.txn_code,
m_6.pos_data, m_6.trans_ref, m_6.sub_txn_type, m_6.txn_id
Index Cond: (m_6.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202508_pkey on app.main_202508
m_7 (cost=0.56..2.78 rows=1 width=38) (actual time=1.010..1.010 rows=0
loops=1)
Output: m_7.txn_date, m_7.processor_id, m_7.txn_code,
m_7.pos_data, m_7.trans_ref, m_7.sub_txn_type, m_7.txn_id
Index Cond: (m_7.txn_id =
'9999999999999999999'::bigint)
-> Index Scan using main_202509_pkey on app.main_202509
m_8 (cost=0.43..2.65 rows=1 width=37) (actual time=1.243..1.245 rows=1
loops=1)
Output: m_8.txn_date, m_8.processor_id, m_8.txn_code,
m_8.pos_data, m_8.trans_ref, m_8.sub_txn_type, m_8.txn_id
Index Cond: (m_8.txn_id =
'9999999999999999999'::bigint)
-> Seq Scan on app.main_202510 m_9 (cost=0.00..0.00 rows=1
width=300) (actual time=0.057..0.057 rows=0 loops=1)
Output: m_9.txn_date, m_9.processor_id, m_9.txn_code,
m_9.pos_data, m_9.trans_ref, m_9.sub_txn_type, m_9.txn_id
Filter: (m_9.txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.main_202511 m_10 (cost=0.00..0.00
rows=1 width=300) (actual time=0.045..0.045 rows=0 loops=1)
Output: m_10.txn_date, m_10.processor_id,
m_10.txn_code, m_10.pos_data, m_10.trans_ref, m_10.sub_txn_type, m_10.txn_id
Filter: (m_10.txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.main_202512 m_11 (cost=0.00..0.00
rows=1 width=300) (actual time=0.076..0.076 rows=0 loops=1)
Output: m_11.txn_date, m_11.processor_id,
m_11.txn_code, m_11.pos_data, m_11.trans_ref, m_11.sub_txn_type, m_11.txn_id
Filter: (m_11.txn_id = '9999999999999999999'::bigint)
-> Seq Scan on app.main_default m_12 (cost=0.00..0.00
rows=1 width=300) (actual time=0.047..0.047 rows=0 loops=1)
Output: m_12.txn_date, m_12.processor_id,
m_12.txn_code, m_12.pos_data, m_12.trans_ref, m_12.sub_txn_type, m_12.txn_id
Filter: (m_12.txn_id = '9999999999999999999'::bigint)

On Tue, 9 Sept, 2025, 8:42 am Ashutosh Bapat, <ashutosh.bapat.oss@gmail.com>
wrote:

Show quoted text

On Mon, Sep 8, 2025 at 5:09 PM Vivek Gadge <vvkgadge56@gmail.com> wrote:

For example, when a query runs on a partitioned table, PostgreSQL scans

partitions in the order they were created or attached to the parent table.
In our case (monthly partitions from January through September), this means
that queries looking for recent data (e.g., September) may experience
additional overhead. PostgreSQL evaluates the older partitions first,
checking their constraints and in some cases probing their indexes, before
reaching the later partitions that actually contain the needed data.

As a result, while the query results are correct, the execution time

increases due to unnecessary work on irrelevant partitions. This
performance impact is more noticeable when the target partition is at the
end of the scan order and pruning cannot fully eliminate the earlier
partitions.

If you don't want data from certain partitions maybe you should add a
clause that will help partition pruning. If you need data from all
partitions, the order in which they are scanned doesn't matter, those
will be scanned either way.

If partitioning pruning isn't working for you, please report the exact
query. Please provide example queries anyway.

--
Best Wishes,
Ashutosh Bapat

#10

dgrowleyml@gmail.com

6 months ago

In reply to: Vivek Gadge (#9)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Wed, 10 Sept 2025 at 09:32, Vivek Gadge <vvkgadge56@gmail.com> wrote:

EXPLAIN ANALYZE VERBOSE
SELECT m.txn_date, d.bank_ref
FROM app.main m
JOIN app.detail d ON m.txn_id = d.main_txn_id
WHERE m.txn_id = 9999999999999999999;

Both app.main and app.detail are range partitioned by month (e.g., main_202502, detail_202502, etc.) on a date column,

Attached below is the EXPLAIN ANALYZE VERBOSE output for the query. It shows that PostgreSQL is scanning all partitions in ascending order, even though the matching record is present in the main_202509 partition.

I think you might be under the false impression that the executor is
able to stop searching for the next row to join once it finds the
first matching row. If that were the case, then yes, it might be
faster to somehow scan the partition where the matching row exists
first as then you could forego scanning the remainder. PostgreSQL
does have some ability to do some of this with the "Unique Join"
functionality, but it does not apply to this case since there's no
proof that only a single row can match (this would require some sort
of Unique Index, which cannot exist in your case since we don't have
global indexes that exist over all partitions).

If you're able to patch PostgreSQL and recompile, try the attached
patch away from your production environment to check if there are any
gains to scanning the partitions in another order. I didn't bother to
change all the append paths that were generated, but I think I've
reversed the order of at least the one the planner is using in your
case.

It's probably true that there are some cases (when the row order does
not matter) where scanning partitions that are more present in shared
buffers first would be better as that means using buffers perhaps
before you evict them to make way for the buffers of some other
(normally unused) partition, but I don't believe buffer eviction comes
into effect for your case since you're basically just scanning an
index which doesn't contain any value for the search key. That
requires very few buffers.

A secondary thought here is that perhaps your partitioning strategy
needs revision. What's the reason you opted to partition by date? Does
that benefit some other queries better? Or is it just a case that it
works well for some data retention policy? IMO, partitioning by range
of the transaction ID would fit much better for this particular query.

David

#11

dgrowleyml@gmail.com

6 months ago

In reply to: Andrei Lepikhov (#3)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Mon, 8 Sept 2025 at 22:05, Andrei Lepikhov <lepihov@gmail.com> wrote:

I guess they mentioned that the Postgres optimiser doesn't care about
the order of Append's subplans. It is a little sad in some cases. The
most critical case is when we have a limitation on the number of tuples
returned. In this case, the optimiser could consider the following
strategies:
1. Prefer scanning local partitions to foreign ones.
2. Pick first partitions with less startup costs and 'high probability'
to obtain all necessary tuples from a minimum set of partitions.

Postgres arranges clauses inside a long expression according to
evaluation cost (see order_qual_clauses). So, why not do similar stuff
for subplans?

This seems quite separate from what's being complained about here. It
might be beneficial to reconsider whether we should do some sort of
sorting on startup_subpaths inside add_paths_to_append_rel(). I
imagine that it might make some sense to sort that list so the path
with the cheapest startup cost is first, then put the remainder of the
list in order of cheapest total cost per tuple. I suspect that would
result in Foreign partitions being scanned last...

... However, it's not all that clear to me how often someone would
have a LIMIT without an ORDER BY, as effectively there's nothing there
to determine which rows your query returns, and there's no flexibility
to change which subpaths are first in Append/MergeAppend paths created
in generate_orderedappend_paths().

Also, I wonder if it would make sense to shuffle partitions a little and
let backends scan partitions one-by-one in different orders just to
reduce any sort of contention in case the queries don't fit the
partitioning expression.

I don't follow this part. Are you proposing we randomise subpath list
order? What contention do you aim to fix?

David

#12

ashutosh.bapat@enterprisedb.com

6 months ago

In reply to: David Rowley (#11)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Wed, Sep 10, 2025 at 4:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Mon, 8 Sept 2025 at 22:05, Andrei Lepikhov <lepihov@gmail.com> wrote:

I guess they mentioned that the Postgres optimiser doesn't care about
the order of Append's subplans. It is a little sad in some cases. The
most critical case is when we have a limitation on the number of tuples
returned. In this case, the optimiser could consider the following
strategies:
1. Prefer scanning local partitions to foreign ones.
2. Pick first partitions with less startup costs and 'high probability'
to obtain all necessary tuples from a minimum set of partitions.

Postgres arranges clauses inside a long expression according to
evaluation cost (see order_qual_clauses). So, why not do similar stuff
for subplans?

This seems quite separate from what's being complained about here. It
might be beneficial to reconsider whether we should do some sort of
sorting on startup_subpaths inside add_paths_to_append_rel(). I
imagine that it might make some sense to sort that list so the path
with the cheapest startup cost is first, then put the remainder of the
list in order of cheapest total cost per tuple. I suspect that would
result in Foreign partitions being scanned last...

If there's LIMIT without ORDER BY, we could order the list of subpaths
by the number of rows in descending order or cost per row in ascending
order. That way there are more chances of scanning fewer partitions
quicker.

--
Best Wishes,
Ashutosh Bapat

#13

dgrowleyml@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#12)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Wed, 10 Sept 2025 at 16:26, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Wed, Sep 10, 2025 at 4:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

This seems quite separate from what's being complained about here. It
might be beneficial to reconsider whether we should do some sort of
sorting on startup_subpaths inside add_paths_to_append_rel(). I
imagine that it might make some sense to sort that list so the path
with the cheapest startup cost is first, then put the remainder of the
list in order of cheapest total cost per tuple. I suspect that would
result in Foreign partitions being scanned last...

If there's LIMIT without ORDER BY, we could order the list of subpaths
by the number of rows in descending order or cost per row in ascending
order. That way there are more chances of scanning fewer partitions
quicker.

Wouldn't that amount to favouring scanning some large foreign
partition over a smaller local partition? My interpretation of
Andrei's "Prefer scanning local partitions to foreign ones" statement
is that was what we shouldn't be doing!

David

#14

lepihov@gmail.com

6 months ago

In reply to: David Rowley (#11)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On 10/9/2025 00:57, David Rowley wrote:

1. Prefer scanning local partitions to foreign ones.
2. Pick first partitions with less startup costs and 'high probability'
to obtain all necessary tuples from a minimum set of partitions.

Postgres arranges clauses inside a long expression according to
evaluation cost (see order_qual_clauses). So, why not do similar stuff
for subplans?

This seems quite separate from what's being complained about here.

Maybe. I didn't see the reproduction script, which made it hard for me
to understand the origin of the problem clearly.>

... However, it's not all that clear to me how often someone would
have a LIMIT without an ORDER BY, as effectively there's nothing there
to determine which rows your query returns, and there's no flexibility
to change which subpaths are first in Append/MergeAppend paths created
in generate_orderedappend_paths().

Of course, it should be applied to an Append without pathkeys.
However, I still recall user cases where the subtree scan is stopped,
even without any limit, simply because MergeJoin has reached the end of
the inner/outer subtree or in the case of semi- or anti-joins. I wonder
if other cases may exist.>

Also, I wonder if it would make sense to shuffle partitions a little and
let backends scan partitions one-by-one in different orders just to
reduce any sort of contention in case the queries don't fit the
partitioning expression.

I don't follow this part. Are you proposing we randomise subpath list
order? What contention do you aim to fix?I have no specific case for now. From the top of my mind, it may help to

prevent flushing out some partitions from the shared buffers...

--
regards, Andrei Lepikhov

#15

ashutosh.bapat@enterprisedb.com

6 months ago

In reply to: David Rowley (#13)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Wed, Sep 10, 2025 at 11:50 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 10 Sept 2025 at 16:26, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

On Wed, Sep 10, 2025 at 4:27 AM David Rowley <dgrowleyml@gmail.com> wrote:

This seems quite separate from what's being complained about here. It
might be beneficial to reconsider whether we should do some sort of
sorting on startup_subpaths inside add_paths_to_append_rel(). I
imagine that it might make some sense to sort that list so the path
with the cheapest startup cost is first, then put the remainder of the
list in order of cheapest total cost per tuple. I suspect that would
result in Foreign partitions being scanned last...

If there's LIMIT without ORDER BY, we could order the list of subpaths
by the number of rows in descending order or cost per row in ascending
order. That way there are more chances of scanning fewer partitions
quicker.

Wouldn't that amount to favouring scanning some large foreign
partition over a smaller local partition? My interpretation of
Andrei's "Prefer scanning local partitions to foreign ones" statement
is that was what we shouldn't be doing!

Generally foreign scans will have a higher cost, including startup
cost. So subpaths with local scans will be preferred. But in case
there's a foreign subpath with a lower cost than local subpath, I
think foreign scan should be preferred.

--
Best Wishes,
Ashutosh Bapat

#16

dgrowleyml@gmail.com

6 months ago

In reply to: Andrei Lepikhov (#14)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Wed, 10 Sept 2025 at 19:41, Andrei Lepikhov <lepihov@gmail.com> wrote:

On 10/9/2025 00:57, David Rowley wrote:

... However, it's not all that clear to me how often someone would
have a LIMIT without an ORDER BY, as effectively there's nothing there
to determine which rows your query returns, and there's no flexibility
to change which subpaths are first in Append/MergeAppend paths created
in generate_orderedappend_paths().

Of course, it should be applied to an Append without pathkeys.
However, I still recall user cases where the subtree scan is stopped,
even without any limit, simply because MergeJoin has reached the end of
the inner/outer subtree or in the case of semi- or anti-joins. I wonder
if other cases may exist.>

Cursors and nested loop semi joins are the only thing that come to
mind for me. I don't see how there could be a Merge Join on an
unordered Append path. Merge Join inputs need to be ordered (and have
PathKeys).

David

#17

dgrowleyml@gmail.com

6 months ago

In reply to: Ashutosh Bapat (#15)

Re: Query Performance Degradation Due to Partition Scan Order – PostgreSQL v17.6

On Wed, 10 Sept 2025 at 21:18, Ashutosh Bapat
<ashutosh.bapat.oss@gmail.com> wrote:

If there's LIMIT without ORDER BY, we could order the list of subpaths
by the number of rows in descending order or cost per row in ascending
order. That way there are more chances of scanning fewer partitions
quicker.

Wouldn't that amount to favouring scanning some large foreign
partition over a smaller local partition? My interpretation of
Andrei's "Prefer scanning local partitions to foreign ones" statement
is that was what we shouldn't be doing!

Generally foreign scans will have a higher cost, including startup
cost. So subpaths with local scans will be preferred. But in case
there's a foreign subpath with a lower cost than local subpath, I
think foreign scan should be preferred.

I'm still stuck on why to you'd want to sort on the number of rows in
descending order. What does that have to do with foreign scans?

Otherwise, if the foreign scan comes out cheaper per row, then doing
those first sounds ok to me. It's all about cost per row in my view.
Foreign or local scan has no relevance to what I was proposing.

David

#18