Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

Started by Alan Hodgsonover 4 years ago13 messagesgeneral

ahodgson@lists.simkin.ca

over 4 years ago

I keep running into problems like these:

Devs are using an ORM. It really likes to produce queries like:

SELECT "shipment_import_records".* FROM "shipment_import_records"
WHERE shipment_import_records"."shipment_import_id" = 5090609 ORDER
BY "shipment_import_records"."id" ASC LIMIT 1;

I don't know why they do this. Usually it's more like 50 for
pagination which make more sense. But for whatever reason this keeps
coming up.

The table has nearly 29 million records. 5069 of them match
shipment_import_id =ï¿œ5090609. There is an index
onï¿œshipment_import_id, which the planner happily uses without the
LIMIT specifically. Yet with it the query planner will always do
something like:

# explain SELECT "shipment_import_records".* FROM
"shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1; ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œQUERY PLAN
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ
---------------------------------------------------------------------
--------------------------------------------------------
Limit ï¿œ(cost=0.44..873.35 rows=1 width=243)
ï¿œï¿œ-> ï¿œIndex Scan using shipment_import_records_pkey on
shipment_import_records ï¿œ(cost=0.44..5122227.70 rows=5868 width=243)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œFilter: (shipment_import_id = 5090609)

.. which takes minutes.

I know I can work around this. Generally I would just drop the index
on shipment_import_id and create one on shipment_import_id,id. Or if
I can get the devs to wrap their query in an inner select with a fake
offset to fool the query planner that works too. But both seem hacky.

Just wondering if there's a knob I can turn to make these more likely
to work without constantly implementing workarounds?

Thanks for any help.

Michael Lewis

mlewis@entrata.com

over 4 years ago

In reply to: Alan Hodgson (#1)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

What is your default_statistics_target and how accurate is that estimate of
5668 rows? What is random_page_cost set to by the way?

More importantly, what is the better plan that you'd like the planner to
use with your existing indexes? It would seem logical to me to scan for the
matching shipment_import_id if the estimate is saying 5868 out of 29
million should match and then sort and only get the smallest ID. Doing an
index scan on ID and looking up in the table to see if shipment_import_id
matches when the planner expects that to be about a .0001 chance... I can't
imagine that plan performing well at all.

Certainly a composite index would be very helpful here. Using explain
analyze and sharing the output would give more info to go on.

Show quoted text

Rob Sargent

robjsargent@gmail.com

over 4 years ago

In reply to: Alan Hodgson (#1)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On 12/6/21 10:03 AM, Alan Hodgson wrote:

I keep running into problems like these:

Devs are using an ORM. It really likes to produce queries like:

SELECT "shipment_import_records".* FROM "shipment_import_records"
WHERE shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;

I don't know why they do this. Usually it's more like 50 for
pagination which make more sense. But for whatever reason this keeps
coming up.

To be clear, is it the devs or the ORM that's adding the ORDER and the
LIMIT? I'm betting on devs. Do they need the smallest id (first
occurrance?) or do they need data common to all 5096 entries (Name?) and
any record will do?. For the former they might be better off asking for
just the attributes they need and for the latter you need to provide an
option which gets them that single record. Of course, If they have the
"smallest id" in hand they should request that.

Francisco Olarte

folarte@peoplecall.com

over 4 years ago

In reply to: Alan Hodgson (#1)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On Mon, 6 Dec 2021 at 18:03, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:
...

The table has nearly 29 million records. 5069 of them match shipment_import_id = 5090609. There is an index on shipment_import_id, which the planner happily uses without the LIMIT specifically. Yet with it the query planner will always do something like:

# explain SELECT "shipment_import_records".* FROM "shipment_import_records" WHERE "shipment_import_records"."shipment_import_id" = 5090609 ORDER BY "shipment_import_records"."id" ASC LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..873.35 rows=1 width=243)
-> Index Scan using shipment_import_records_pkey on shipment_import_records (cost=0.44..5122227.70 rows=5868 width=243)
Filter: (shipment_import_id = 5090609)
.. which takes minutes.

Can you post an explain analyze? To me it seems like the planner
thinks shipment_import_id is randomly distributed and the table is
well correlated with it's PK, so scanning it for the first id should
be fast.

But from the names of the field you may have correlation between
shipment_import_id and id hidden somewhere ( like they are two serial
growing together, you query for the latest shipment ids and it scans
all the table ). An explain analyze should show that ( or three, one
for that shipment import id, one for 1, one for a really big one )

Just wondering if there's a knob I can turn to make these more likely to work without constantly implementing workarounds?

You may try a composite index.

Francisco Olarte.

Alan Hodgson

ahodgson@lists.simkin.ca

over 4 years ago

In reply to: Michael Lewis (#2)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On Mon, 2021-12-06 at 10:18 -0700, Michael Lewis wrote:

What is your default_statistics_target and how accurate is that
estimate of 5668 rows? What is random_page_cost set to by the way?

default_statistics_target = 1000
random_page_cost = 2.0 (it's on AWS on a 9000 iops gp2 volume)

Postgresql 13.5 btw.

The estimate was reasonably accurate, there were 5069 actual rows
matching.

More importantly, what is the better plan that you'd like the planner
to use with your existing indexes?

Well, it takes a few ms to grab all 5000 rows by shipment_import_id
and then sort/limit them. It takes 30 seconds to do what it is doing
instead, and only when the table is mostly cached already, more like
4-5 minutes otherwise.

#explain analyze SELECT "shipment_import_records".* FROM
shipment_import_records" WHERE
shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;
---------------------------------------------------------------------
---------------------------------------------------------------------
-------------------------------------
Limit (cost=0.44..873.08 rows=1 width=243) (actual
time=31689.725..31689.726 rows=1 loops=1)
-> Index Scan using shipment_import_records_pkey on
shipment_import_records (cost=0.44..5122405.71 rows=5870 width=243)
(actual time=31689.723..31689.724 rows=1 loops=1)
Filter: (shipment_import_id = 5090609)
Rows Removed by Filter: 28710802
Planning Time: 0.994 ms
Execution Time: 31689.744 ms
(6 rows)

Just with a kludge to force the better index:

# explain analyze SELECT * FROM (SELECT "shipment_import_records".*
FROM "shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id" = 5090609 OFFSET 0) AS
x ORDER BY "id" ASC LIMIT 1;

QUERY PLAN
---------------------------------------------------------------------
---------------------------------------------------------------------
------------------
Limit (cost=10655.34..10655.34 rows=1 width=243) (actual
time=4.868..4.869 rows=1 loops=1)
-> Sort (cost=10655.34..10670.02 rows=5870 width=243) (actual
time=4.867..4.868 rows=1 loops=1)
Sort Key: shipment_import_records.id
Sort Method: top-N heapsort Memory: 27kB
-> Index Scan using
index_shipment_import_records_on_shipment_import_id on
shipment_import_records (cost=0.44..10567.29 rows=5870 width=243)
(actual time=0.037..3.560 rows=5069 loops=1)
Index Cond: (shipment_import_id = 5090609)
Planning Time: 0.135 ms
Execution Time: 4.885 ms
(8 rows)

Certainly a composite index would be very helpful here. Using explain
analyze and sharingï¿œthe output would give more info to go on.

Yeah I am going to just do the composite index for now, but was
hoping for a more generic option.

Thanks for looking at it.

Alan Hodgson

ahodgson@lists.simkin.ca

over 4 years ago

In reply to: Francisco Olarte (#4)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On Mon, 2021-12-06 at 18:20 +0100, Francisco Olarte wrote:

Can you post an explain analyze? To me it seems like the planner
thinks shipment_import_id is randomly distributed and the table is
well correlated with it's PK, so scanning it for the first id
should
be fast.

ï¿œ ï¿œ#explain analyze SELECT "shipment_import_records".* FROMï¿œ
ï¿œï¿œ shipment_import_records" WHEREï¿œ
ï¿œï¿œ shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
ï¿œï¿œ "shipment_import_records"."id" ASC LIMIT 1;
ï¿œï¿œ ------------------------------------------------------------------
---
ï¿œï¿œ ------------------------------------------------------------------
---
ï¿œï¿œ -------------------------------------
ï¿œï¿œï¿œ Limitï¿œ (cost=0.44..873.08 rows=1 width=243) (actual
ï¿œï¿œ time=31689.725..31689.726 rows=1 loops=1)
ï¿œï¿œï¿œï¿œï¿œ ->ï¿œ Index Scan using shipment_import_records_pkey on
ï¿œï¿œ shipment_import_recordsï¿œ (cost=0.44..5122405.71 rows=5870
width=243)
ï¿œï¿œ (actual time=31689.723..31689.724 rows=1 loops=1)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ Filter: (shipment_import_id = 5090609)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ Rows Removed by Filter: 28710802
ï¿œï¿œï¿œ Planning Time: 0.994 ms
ï¿œï¿œï¿œ Execution Time: 31689.744 ms
ï¿œï¿œ (6 rows)

ï¿œ ï¿œ ï¿œ The biggest one (but yes "earlier"):

# explain analyze SELECT "shipment_import_records".* FROM
"shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id" = 1247888 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;ï¿œ
ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ
ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ QUERY PLAN
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ
ï¿œï¿œï¿œï¿œ
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ
---------------------------------------------------------------------
---------------------------------------------------------------------
----------------------------
--------
Limit ï¿œ(cost=0.44..426.59 rows=1 width=243) (actual
time=8007.069..8007.070 rows=1 loops=1)
ï¿œï¿œ-> ï¿œIndex Scan using shipment_import_records_pkey on
shipment_import_records ï¿œ(cost=0.44..5126628.40 rows=12030 width=243)
(actual time=8007.068..8007.068 rows=1 l
oops=1)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œFilter: (shipment_import_id = 1247888)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œRows Removed by Filter: 10929193
Planning Time: 0.584 ms
Execution Time: 8007.086 ms
(6 rows)

And the smallest/latest, which actually uses the "right" index:

ï¿œ# explain analyze SELECT "shipment_import_records".* FROM
"shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id" = 5116174 ORDER BYï¿œ
"shipment_import_records"."id" ASC LIMIT 1;ï¿œï¿œ ï¿œ ï¿œ
ï¿œï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ
ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œ ï¿œQUERY PLAN
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ
---------------------------------------------------------------------
---------------------------------------------------------------------
----------------------------
------------------------
Limit ï¿œ(cost=145.44..145.44 rows=1 width=243) (actual
time=0.018..0.018 rows=1 loops=1)
ï¿œï¿œ-> ï¿œSort ï¿œ(cost=145.44..145.64 rows=79 width=243) (actual
time=0.017..0.018 rows=1 loops=1)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œSort Key: id
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œSort Method: quicksort ï¿œMemory: 26kB
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œ-> ï¿œIndex Scan using
index_shipment_import_records_on_shipment_import_id on
shipment_import_records ï¿œ(cost=0.44..145.05 rows=79 width=243)
(actual time=0.013
..0.014 rows=1 loops=1)
ï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œï¿œIndex Cond: (shipment_import_id = 5116174)
Planning Time: 0.104 ms
Execution Time: 0.032 ms
(8 rows)

But from the names of the field you may have correlation between
shipment_import_id and id hidden somewhere ( like they are two
serial
growing together, you query for the latest shipment ids and it
scans
all the table ). An explain analyze should show that ( or three,
one
for that shipment import id, one for 1, one for a really big one )

This is definitely the case. And we are generally looking for newer
data for most operations.

Thanks for looking at it.

Alan Hodgson

ahodgson@lists.simkin.ca

over 4 years ago

In reply to: Rob Sargent (#3)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On Mon, 2021-12-06 at 10:19 -0700, Rob Sargent wrote:

To be clear, is it the devs or the ORM that's adding the ORDERï¿œ and
theï¿œ
LIMIT?ï¿œ I'm betting on devs.ï¿œ Do they need the smallest id (first
occurrance?) or do they need data common to all 5096 entries
(Name?) and
any record will do?.ï¿œ For the former they might be better off
asking for
just the attributes they need and for the latter you need to
provide an
option which gets them that single record.ï¿œ Of course, If they have
the
"smallest id" in hand they should request that.

That assumes I could figure what bit of ORM code is generating this,
talk to them, and then get them to actually think about what data
they're looking for and it's impact on the database. :/ Given my 25
year track record with devs, I'm thinking of that as plan B.
Hopefully though if they're looking for something common to all the
records they would look at the parent table instead.

I do expect the dev actually specified the order/limit for some
reason.

Thank you for the suggestions.

Adrian Klaver

adrian.klaver@aklaver.com

over 4 years ago

In reply to: Alan Hodgson (#7)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On 12/6/21 10:02, Alan Hodgson wrote:

On Mon, 2021-12-06 at 10:19 -0700, Rob Sargent wrote:

To be clear, is it the devs or the ORM that's adding the ORDER and the
LIMIT? I'm betting on devs. Do they need the smallest id (first
occurrance?) or do they need data common to all 5096 entries (Name?) and
any record will do?. For the former they might be better off asking for
just the attributes they need and for the latter you need to provide an
option which gets them that single record. Of course, If they have the
"smallest id" in hand they should request that.

That assumes I could figure what bit of ORM code is generating this,
talk to them, and then get them to actually think about what data
they're looking for and it's impact on the database. :/ Given my 25 year
track record with devs, I'm thinking of that as plan B. Hopefully though
if they're looking for something common to all the records they would
look at the parent table instead.

I do expect the dev actually specified the order/limit for some reason.

Maybe I'm silly, but why is asking them a Plan B?

Thank you for the suggestions.

--
Adrian Klaver
adrian.klaver@aklaver.com

Pavel Stehule

pavel.stehule@gmail.com

over 4 years ago

In reply to: Francisco Olarte (#4)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

po 6. 12. 2021 v 18:21 odesílatel Francisco Olarte <folarte@peoplecall.com>
napsal:

On Mon, 6 Dec 2021 at 18:03, Alan Hodgson <ahodgson@lists.simkin.ca>
wrote:
...

The table has nearly 29 million records. 5069 of them match

shipment_import_id = 5090609. There is an index on shipment_import_id,
which the planner happily uses without the LIMIT specifically. Yet with it
the query planner will always do something like:

# explain SELECT "shipment_import_records".* FROM

"shipment_import_records" WHERE
"shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------

Limit (cost=0.44..873.35 rows=1 width=243)
-> Index Scan using shipment_import_records_pkey on

shipment_import_records (cost=0.44..5122227.70 rows=5868 width=243)

Filter: (shipment_import_id = 5090609)
.. which takes minutes.

Can you post an explain analyze? To me it seems like the planner
thinks shipment_import_id is randomly distributed and the table is
well correlated with it's PK, so scanning it for the first id should
be fast.

But from the names of the field you may have correlation between
shipment_import_id and id hidden somewhere ( like they are two serial
growing together, you query for the latest shipment ids and it scans
all the table ). An explain analyze should show that ( or three, one
for that shipment import id, one for 1, one for a really big one )

Just wondering if there's a knob I can turn to make these more likely to

work without constantly implementing workarounds?

You may try a composite index.

+1 These issues can be solved by composite indexes. The low limit clause
deforms costs and when the data are not really random, then index scan can
be too long.

Show quoted text

Francisco Olarte.

#10

Rob Sargent

robjsargent@gmail.com

over 4 years ago

In reply to: Alan Hodgson (#7)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On 12/6/21 11:02 AM, Alan Hodgson wrote:

On Mon, 2021-12-06 at 10:19 -0700, Rob Sargent wrote:

To be clear, is it the devs or the ORM that's adding the ORDER and the
LIMIT? I'm betting on devs. Do they need the smallest id (first
occurrance?) or do they need data common to all 5096 entries (Name?) and
any record will do?. For the former they might be better off asking for
just the attributes they need and for the latter you need to provide an
option which gets them that single record. Of course, If they have the
"smallest id" in hand they should request that.

That assumes I could figure what bit of ORM code is generating this,
talk to them, and then get them to actually think about what data
they're looking for and it's impact on the database. :/ Given my 25
year track record with devs, I'm thinking of that as plan B. Hopefully
though if they're looking for something common to all the records they
would look at the parent table instead.

I do expect the dev actually specified the order/limit for some reason.

Until you know what they're after it's hard to make the correct
adjustment. Another index is more over-head and may not be necessary or
even the best solution for getting that one tuple they're after.

Thank you for the suggestions.

Up thread you hoped for a general solution. The devs writing better
queries is your best bet.
If you cannot or choose not to talk with the devs or at least their
manager you'll be chasing these for a long time. Those dealing with ORM
code are server-side: they are supposed to be the smart ones! Maybe they
can be taught.

#11

Laurenz Albe

laurenz.albe@cybertec.at

over 4 years ago

In reply to: Pavel Stehule (#9)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On Mon, 2021-12-06 at 19:22 +0100, Pavel Stehule wrote:

po 6. 12. 2021 v 18:21 odesílatel Francisco Olarte <folarte@peoplecall.com> napsal:

On Mon, 6 Dec 2021 at 18:03, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:

# explain SELECT "shipment_import_records".* FROM "shipment_import_records" WHERE "shipment_import_records"."shipment_import_id" = 5090609 ORDER BY "shipment_import_records"."id" ASC LIMIT 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..873.35 rows=1 width=243)
-> Index Scan using shipment_import_records_pkey on shipment_import_records (cost=0.44..5122227.70 rows=5868 width=243)
Filter: (shipment_import_id = 5090609)
.. which takes minutes.

Just wondering if there's a knob I can turn to make these more likely to work without constantly implementing workarounds?

You may try a composite index.

+1 These issues can be solved by composite indexes. The low limit clause deforms costs and when the data are not really random, then index scan can be too long.

An ugly alternative is to use "ORDER BY id + 0", which prevents PostgreSQL
from using the index.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

#12

Mladen Gogala

gogala.mladen@gmail.com

over 4 years ago

In reply to: Laurenz Albe (#11)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

On 12/6/21 22:16, Laurenz Albe wrote:

An ugly alternative is to use "ORDER BY id + 0", which prevents PostgreSQL
from using the index.

That was actually the earliest form of Oracle hints. I remember doing
exactly that in Oracle 5.1.22 on VAX/VMS.

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com

#13

Francisco Olarte

folarte@peoplecall.com

over 4 years ago

In reply to: Alan Hodgson (#6)

Re: Query planner issue with preferring primary key over a better index when using ORDER BY and LIMIT

Alan:

On Mon, 6 Dec 2021 at 18:58, Alan Hodgson <ahodgson@lists.simkin.ca> wrote:

On Mon, 2021-12-06 at 18:20 +0100, Francisco Olarte wrote:
Can you post an explain analyze? To me it seems like the planner
thinks shipment_import_id is randomly distributed and the table is
well correlated with it's PK, so scanning it for the first id should
be fast.
#explain analyze SELECT "shipment_import_records".* FROM
shipment_import_records" WHERE
shipment_import_records"."shipment_import_id" = 5090609 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;
---------------------------------------------------------------------
Limit (cost=0.44..873.08 rows=1 width=243) (actual
time=31689.725..31689.726 rows=1 loops=1)
-> Index Scan using shipment_import_records_pkey on
shipment_import_records (cost=0.44..5122405.71 rows=5870 width=243)
(actual time=31689.723..31689.724 rows=1 loops=1)
Filter: (shipment_import_id = 5090609)
Rows Removed by Filter: 28710802

This seems to be the problem, IIRC you said you had 29M rows, it is
index-scanning nearly all the table.

I'm not an explain guru, but It seems to me you have a correlation
undetected by the planer.

The biggest one (but yes "earlier"):

# explain analyze SELECT "shipment_import_records".* FROM "shipment_import_records" WHERE "shipment_import_records"."shipment_import_id" = 1247888 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;
--------
Limit (cost=0.44..426.59 rows=1 width=243) (actual time=8007.069..8007.070 rows=1 loops=1)
-> Index Scan using shipment_import_records_pkey on shipment_import_records (cost=0.44..5126628.40 rows=12030 width=243) (actual time=8007.068..8007.068 rows=1 l
oops=1)
Filter: (shipment_import_id = 1247888)
Rows Removed by Filter: 10929193
Planning Time: 0.584 ms
Execution Time: 8007.086 ms
(6 rows)

mmm, the times seem to correlate with rows removed by filter ~ rows
read. To me it seems the planer thinks they are well distributed but
they are not ( from the names it seem like you insert a lot of
shipment_import_records with the same shipment_import_id, and they get
a serial or similar id, they are bunched. The planner thinks they are
distributed and tries to read the table in order, probably because it
correlates well with the id index, but when you select one of the
later-inserted imports it fails miserabley and has to scan the whole /
a lot of the table.

And the smallest/latest, which actually uses the "right" index:
# explain analyze SELECT "shipment_import_records".* FROM "shipment_import_records" WHERE "shipment_import_records"."shipment_import_id" = 5116174 ORDER BY
"shipment_import_records"."id" ASC LIMIT 1;
------------------------
Limit (cost=145.44..145.44 rows=1 width=243) (actual time=0.018..0.018 rows=1 loops=1)
-> Sort (cost=145.44..145.64 rows=79 width=243) (actual time=0.017..0.018 rows=1 loops=1)
Sort Key: id
Sort Method: quicksort Memory: 26kB
-> Index Scan using index_shipment_import_records_on_shipment_import_id on shipment_import_records (cost=0.44..145.05 rows=79 width=243) (actual time=0.013
..0.014 rows=1 loops=1)
Index Cond: (shipment_import_id = 5116174)
Planning Time: 0.104 ms
Execution Time: 0.032 ms

This one found few rows, and probably the planner knew it ( rom the
rows=79 ) so it went on the import_record_id index route.

But from the names of the field you may have correlation between
shipment_import_id and id hidden somewhere ( like they are two serial
growing together, you query for the latest shipment ids and it scans
all the table ). An explain analyze should show that ( or three, one
for that shipment import id, one for 1, one for a really big one )
This is definitely the case. And we are generally looking for newer data for most operations.
Thanks for looking at it.

Well, I have not solution for that, but at least I think the problem
is identified.

You could try building an index on shipment_import_id + shipment_id,
it would solve this kind of lookups, and would certainly pull its
weight if your ORM makes lots of this kind of queries, but I'm not
sure on how to make the planner use it, but it may be worth testing.

Francisco Olarte.