Help needed to understand query planner regression with incremental sort

Started by Henrik Peinar (nodeSWAT.com)almost 3 years ago3 messagesgeneral
Jump to latest

*Quick description: *After upgrading our Aurora PostgreSQL engine from v11
to v15, one of our often run SQL's started taking tens of seconds instead
of running sub 100ms. Explain analyze showed that the query planner had
switched to using incremental sort. Running ANALYZE on the table fixed the
issue temporarily, but it came back after few days.

I'm looking to dig a bit deeper to be able to submit a bug report to the
PostgreSQL team so maybe something can be done to fix this, as this
definitely feels as a bug (the query running times are 100x in difference)
but I need a bit of help as I'm new to this deep PostgreSQL analysis and
I'm struggling finding resource how to proceed in such a case. And maybe
it's an issue with my own setup or database configuration instead (ie. me
just being stupid).

I know that I can solve the issue for my own environments with turning
incremental sort off in the DB parameters, but I feel this might be worth
putting a bit time into to be able to find the root cause and help me
understand the planner a bit better.

*More detailed description:*
Note: I don't have reproduction steps at this time, this is one of the
reasons I'm asking help to figure out what could I try to do to start
trying reproducing this in local setup. Below info is just giving out as
much info as I can to help understand the setup I have.

The query under question is rather simplistic as far as SQL itself goes at
least (attached: original_query.sql):

*select * from "products" inner join "shops" on "shops"."id" =
"products"."shop_id" where "products"."status" = 'published' and
products.desc_tsv @@ to_tsquery('simple', 'nike:*') order by
"products"."pinned_at" ASC, "products"."listed_at" DESC limit 61;*
With normal running conditions this query produces the following query
plan: https://explain.dalibo.com/plan/af8ch7a59ch6459d (attached:
good_planner_path.txt)

With incremental sorting path, it produces the following
https://explain.dalibo.com/plan/428564152aa3ba37 (attached:
incremental_sort_planner_path.txt)

I've tried to include as slim table / index setup as possible, but the
tables are actually bigger and there are a lot more indexes present, I
don't know how other indexes and / or columns might affect the planner path
in this case.
https://dbdiagram.io/d/64771e48722eb7749422715e (attached: schema_setup.sql)

In production DB the size of products table is around 5 452 840 rows.
According to the information_schema the size of the products table is
around 8.1GB.

If the incremental sort query plan starts happening, running ANALYZE
products; fixes the issue and the planner switches back to the original
plan.

*What am I asking for?*
Any ideas / help / queries that would get me closer to understanding why
planner chooses so much slower query in this case or ideas how to start
trying to reproduce this locally as it seems to happen intermittently in
production environment, I'm unsure if I can replicate that type of load
locally. Maybe someone spots very obvious mistakes / issues with the table
/ index / query setup right away that I've missed.

Any help appreciated,
Henrik.

Attachments:

incremental_sort_planner_path.txttext/plain; charset=US-ASCII; name=incremental_sort_planner_path.txtDownload
original_query.sqlapplication/octet-stream; name=original_query.sqlDownload
good_planner_path.txttext/plain; charset=US-ASCII; name=good_planner_path.txtDownload
schema_setup.sqlapplication/octet-stream; name=schema_setup.sqlDownload
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Henrik Peinar (nodeSWAT.com) (#1)
Re: Help needed to understand query planner regression with incremental sort

"Henrik Peinar (nodeSWAT.com)" <henrik@nodeswat.com> writes:

*Quick description: *After upgrading our Aurora PostgreSQL engine from v11
to v15, one of our often run SQL's started taking tens of seconds instead
of running sub 100ms. Explain analyze showed that the query planner had
switched to using incremental sort. Running ANALYZE on the table fixed the
issue temporarily, but it came back after few days.

Hmm. I think it's quite accidental that you get one plan over the other,
because it looks like the key difference is something the planner doesn't
account for. In the fast case you have

-> Index Scan using products_pinned_at_listed_at_ix on products (cost=0.43..3929423.12 rows=26523 width=1463) (actual time=2.325..32.872 rows=61 loops=1)
Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND (status = 'published'::prod_status))
Rows Removed by Filter: 3376

The index is only being used to produce ordered output here: the filter
condition isn't related to the index. And what we see is that the
query is fast because the desired rows are found in the first 3376+61
rows visited in this direction. Meanwhile in the slow case you have

-> Index Scan Backward using products_pinned_at_ix on products (cost=0.43..1172249.47 rows=26139 width=1460) (actual time=5.263..5203.180 rows=18411 loops=1)
Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND (status = 'published'::prod_status))
Rows Removed by Filter: 5415895

Again, the filter condition isn't exploiting the index, we're just using
the index to (partially) satisfy the ORDER BY. This one takes a long time
because it has to trawl through 5415895+61 rows before reaching the LIMIT.

So AFAICS, the runtime differential is mostly/entirely because the rows
satisfying the filter condition are located near one end of the range of
pinned_at. That is not a correlation that the planner knows anything
about, so it's unable to see that these two ways of scanning the table
will have significantly different costs. Moreover, I think you'd be
mistaken to draw any great conclusions from this specific example,
because with some other search term(s) the results might be totally
different due to the required rows not falling in the same place.

What I'd think about if this type of query is important is to set up
an index that can actually be used to satisfy the filter condition,
so that you're not forcing it into "scan the whole table till you
find the rows you want". It looks like you already have such an
index, ie a GIN index on the desc_tsv column, although I don't trust
that your schema attachment is actually accurate because if it is
then you have a bunch of duplicative indexes. You might try
dropping the other indexes to see if you can coerce the planner
into using that one, and then seeing what the cost estimate is.

regards, tom lane

In reply to: Tom Lane (#2)
Re: Help needed to understand query planner regression with incremental sort

Thank you for taking the time to look this through.

After reading your answer it obviously makes perfect sense, I was just
thrown off by a fact that when it started happening, it happened on every
query execution (for the same test query that I used). But I failed to
think about trying different search keywords to see if that made any
difference in the query plan / estimations.

I'll investigate this further and will try to clean the indexes up and make
sure the filter has correct index to use.

And yes, the schema attachment seems a bit off, the DDL copy didn't include
the ordering for the fields inside composite indexes I think, so that's why
they appear the same.

Very much appreciated of your time.

With best,
Henrik.

On Wed, May 31, 2023 at 6:31 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

"Henrik Peinar (nodeSWAT.com)" <henrik@nodeswat.com> writes:

*Quick description: *After upgrading our Aurora PostgreSQL engine from

v11

to v15, one of our often run SQL's started taking tens of seconds instead
of running sub 100ms. Explain analyze showed that the query planner had
switched to using incremental sort. Running ANALYZE on the table fixed

the

issue temporarily, but it came back after few days.

Hmm. I think it's quite accidental that you get one plan over the other,
because it looks like the key difference is something the planner doesn't
account for. In the fast case you have

-> Index Scan using products_pinned_at_listed_at_ix on products
(cost=0.43..3929423.12 rows=26523 width=1463) (actual time=2.325..32.872
rows=61 loops=1)
Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND (status =
'published'::prod_status))
Rows Removed by Filter: 3376

The index is only being used to produce ordered output here: the filter
condition isn't related to the index. And what we see is that the
query is fast because the desired rows are found in the first 3376+61
rows visited in this direction. Meanwhile in the slow case you have

-> Index Scan Backward using products_pinned_at_ix on
products (cost=0.43..1172249.47 rows=26139 width=1460) (actual
time=5.263..5203.180 rows=18411 loops=1)
Filter: ((desc_tsv @@ '''nike'':*'::tsquery) AND
(status = 'published'::prod_status))
Rows Removed by Filter: 5415895

Again, the filter condition isn't exploiting the index, we're just using
the index to (partially) satisfy the ORDER BY. This one takes a long time
because it has to trawl through 5415895+61 rows before reaching the LIMIT.

So AFAICS, the runtime differential is mostly/entirely because the rows
satisfying the filter condition are located near one end of the range of
pinned_at. That is not a correlation that the planner knows anything
about, so it's unable to see that these two ways of scanning the table
will have significantly different costs. Moreover, I think you'd be
mistaken to draw any great conclusions from this specific example,
because with some other search term(s) the results might be totally
different due to the required rows not falling in the same place.

What I'd think about if this type of query is important is to set up
an index that can actually be used to satisfy the filter condition,
so that you're not forcing it into "scan the whole table till you
find the rows you want". It looks like you already have such an
index, ie a GIN index on the desc_tsv column, although I don't trust
that your schema attachment is actually accurate because if it is
then you have a bunch of duplicative indexes. You might try
dropping the other indexes to see if you can coerce the planner
into using that one, and then seeing what the cost estimate is.

regards, tom lane