Significant Execution Time Difference Between PG13.14 and PG16.4 for Query on information_schema Tables.
Hi All,
I've encountered a noticeable difference in execution time and query
execution plan row counts between PostgreSQL 13 and PostgreSQL 16 when
running a query on information_schema tables. Surprisingly, PostgreSQL 16
is performing slower than PostgreSQL 13.
The query executed on both versions is as follows:
SELECT DISTINCT "tc"."constraint_name" AS "ConstraintName",
"ccu"."column_name" AS "ColumnName"
FROM
information_schema.constraint_column_usage AS "ccu" right join
information_schema.table_constraints AS "tc"
ON "tc"."constraint_catalog" = "ccu"."constraint_catalog"
AND "tc"."constraint_name" = "ccu"."constraint_name"
WHERE "tc"."constraint_type" = 'PRIMARY KEY'
AND "ccu"."table_name" = 't_c56ng1_repository'
Here are the details of the PostgreSQL versions and the execution plans:
*4PostgreSQL 13.14 (PostgreSQL 13.14 on x86_64-pc-linux-gnu, compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG13.14 Execution Plan
<https://explain.dalibo.com/plan/ag1a62a9d47dg29d>
*PostgreSQL 16.4 (PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by gcc
11.4.0, 64-bit)*
Execution plan: PG16.4 Execution Plan
<https://explain.dalibo.com/plan/4c66fdfbf2hf9ed2>
Has anyone else experienced similar behavior or could provide insights into
why PostgreSQL 16 might be slower for this query? Any advice or suggestions
for optimization would be greatly appreciated.
Thank you!
NOTE:- PFA the raw file of explain and analyze below.
Import Notes
Reply to msg id not found: CAAvG1pChk3W9MWkK-_JPyUdThSoJhv4MueH5i8cZS2T_HUP5dA@mail.gmail.comReference msg id not found: CALNXOfqDr5jVRZHThyXyNftDuezjxOJB1mqw4x1T8YBibi6uw@mail.gmail.comReference msg id not found: CAG1ps1w61XD1LxUf3EBCZVR3S_orRj_dk6cxmO5KusBVyrgMg@mail.gmail.comReference msg id not found: CAAvG1pChk3W9MWkK-_JPyUdThSoJhv4MueH5i8cZS2T_HUP5dA@mail.gmail.com
On 8/26/24 14:49, nikhil raj wrote:
Hi All,
I've encountered a noticeable difference in execution time and query
execution plan row counts between PostgreSQL 13 and PostgreSQL 16 when
running a query on |information_schema| tables. Surprisingly, PostgreSQL
16 is performing slower than PostgreSQL 13.
Did you run ANALYZE on the Postgres 16 instance?
*4PostgreSQL 13.14 (PostgreSQL 13.14 on x86_64-pc-linux-gnu, compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG13.14 Execution Plan
<https://explain.dalibo.com/plan/ag1a62a9d47dg29d>*PostgreSQL 16.4 (PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG16.4 Execution Plan
<https://explain.dalibo.com/plan/4c66fdfbf2hf9ed2>
Use:
It is easier to follow it's output.
Has anyone else experienced similar behavior or could provide insights
into why PostgreSQL 16 might be slower for this query? Any advice or
suggestions for optimization would be greatly appreciated.
Yes when ANALYZE was not run on a new instance.
Thank you!
NOTE:- PFA the raw file of explain and analyze below.
--
Adrian Klaver
adrian.klaver@aklaver.com
Hi Adrian,
Thanks for the quick response.
I've already performed a vacuum, reindex, and analyze on the entire
database, but the issue persists. As you can see from the execution plan,
the time difference in PostgreSQL 16 is still significantly higher, even
after all maintenance activities have been completed.
It seems there might be a bug in PostgreSQL 16 where the performance of
queries on *information_schema* tables is degraded. As both the tables are
postgres system tables
https://explain.depesz.com/s/bdO6b :-PG13
<https://explain.depesz.com/s/bdO6b>
https://explain.depesz.com/s/bpAU :- PG16
<https://explain.depesz.com/s/bpAU>
On Tue 27 Aug, 2024, 3:40 AM Adrian Klaver, <adrian.klaver@aklaver.com>
wrote:
Show quoted text
On 8/26/24 14:49, nikhil raj wrote:
Hi All,
I've encountered a noticeable difference in execution time and query
execution plan row counts between PostgreSQL 13 and PostgreSQL 16 when
running a query on |information_schema| tables. Surprisingly, PostgreSQL
16 is performing slower than PostgreSQL 13.Did you run ANALYZE on the Postgres 16 instance?
*4PostgreSQL 13.14 (PostgreSQL 13.14 on x86_64-pc-linux-gnu, compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG13.14 Execution Plan
<https://explain.dalibo.com/plan/ag1a62a9d47dg29d>*PostgreSQL 16.4 (PostgreSQL 16.4 on x86_64-pc-linux-gnu, compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG16.4 Execution Plan
<https://explain.dalibo.com/plan/4c66fdfbf2hf9ed2>Use:
It is easier to follow it's output.
Has anyone else experienced similar behavior or could provide insights
into why PostgreSQL 16 might be slower for this query? Any advice or
suggestions for optimization would be greatly appreciated.Yes when ANALYZE was not run on a new instance.
Thank you!
NOTE:- PFA the raw file of explain and analyze below.
--
Adrian Klaver
adrian.klaver@aklaver.com
On 8/26/24 15:41, nikhil raj wrote:
Hi Adrian,
Thanks for the quick response.
I've already performed a vacuum, reindex, and analyze on the entire
database, but the issue persists. As you can see from the execution
plan, the time difference in PostgreSQL 16 is still significantly
higher, even after all maintenance activities have been completed.It seems there might be a bug in PostgreSQL 16 where the performance of
queries on *information_schema* tables is degraded. As both the tables
are postgres system tableshttps://explain.depesz.com/s/bdO6b <https://explain.depesz.com/s/bdO6b>
:-PG13 <https://explain.depesz.com/s/bdO6b>https://explain.depesz.com/s/bpAU <https://explain.depesz.com/s/bpAU>
:- PG16 <https://explain.depesz.com/s/bpAU>
What I see is Postgres 13:
Nested Loop (cost=9.54..119.02 rows=1 width=128) (actual
time=1.038..288.777 rows=1 loops=1)
Join Filter: (("*SELECT* 1".constraint_name)::name = "*SELECT*
1_1".conname)
Rows Removed by Join Filter: 935
Buffers: shared hit=34,675
vs Postgres 16
Nested Loop (cost=62.84..538.22 rows=1 width=128) (actual
time=1,905.153..14,006.921 rows=1 loops=1)
Join Filter: ("*SELECT* 1".conname = ("*SELECT*
1_1".constraint_name)::name)
Rows Removed by Join Filter: 997
Buffers: shared hit=5,153,054
So either switching this
("*SELECT* 1".constraint_name)::name = "*SELECT* 1_1".conname
to
"*SELECT* 1".conname = ("*SELECT* 1_1".constraint_name)::name
is more of a change then I would expect.
Or
Buffers: shared hit=34,675
vs
Buffers: shared hit=5,153,054
indicates a hardware/configuration difference.
Are both instances running on the same machine?
Is the configuration for both the same?
On Tue 27 Aug, 2024, 3:40 AM Adrian Klaver, <adrian.klaver@aklaver.com
<mailto:adrian.klaver@aklaver.com>> wrote:On 8/26/24 14:49, nikhil raj wrote:
Hi All,
I've encountered a noticeable difference in execution time and query
execution plan row counts between PostgreSQL 13 and PostgreSQL 16when
running a query on |information_schema| tables. Surprisingly,
PostgreSQL
16 is performing slower than PostgreSQL 13.
Did you run ANALYZE on the Postgres 16 instance?
*4PostgreSQL 13.14 (PostgreSQL 13.14 on x86_64-pc-linux-gnu,
compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG13.14 Execution Plan
<https://explain.dalibo.com/plan/ag1a62a9d47dg29d<https://explain.dalibo.com/plan/ag1a62a9d47dg29d>>
*PostgreSQL 16.4 (PostgreSQL 16.4 on x86_64-pc-linux-gnu,
compiled by
gcc 11.4.0, 64-bit)*
Execution plan: PG16.4 Execution Plan
<https://explain.dalibo.com/plan/4c66fdfbf2hf9ed2<https://explain.dalibo.com/plan/4c66fdfbf2hf9ed2>>
Use:
https://explain.depesz.com/ <https://explain.depesz.com/>
It is easier to follow it's output.
Has anyone else experienced similar behavior or could provide
insights
into why PostgreSQL 16 might be slower for this query? Any advice or
suggestions for optimization would be greatly appreciated.Yes when ANALYZE was not run on a new instance.
Thank you!
NOTE:- PFA the raw file of explain and analyze below.
--
Adrian Klaver
adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
--
Adrian Klaver
adrian.klaver@aklaver.com
nikhil raj <nikhilraj474@gmail.com> writes:
I've encountered a noticeable difference in execution time and query
execution plan row counts between PostgreSQL 13 and PostgreSQL 16 when
running a query on information_schema tables. Surprisingly, PostgreSQL 16
is performing slower than PostgreSQL 13.
Yeah, it looks like that condition on "table_name" is not getting
pushed down to the scan level anymore. I'm not sure why not,
but will look closer tomorrow.
regards, tom lane
On Tue, 27 Aug 2024 at 13:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, it looks like that condition on "table_name" is not getting
pushed down to the scan level anymore. I'm not sure why not,
but will look closer tomorrow.
I was looking for the offending commit as at first I thought it might
be related to Memoize. It does not seem to be.
I get the following up until 2489d76c, and from then on, it's a subquery filter.
-> Index Scan using pg_class_relname_nsp_index on pg_class r_2
(cost=0.27..8.30 rows=1 width=8) (actual time=0.004..0.004 rows=0
loops=1)
Index Cond: (relname = 't_c56ng1_repository'::name)
Filter: ((relkind = ANY ('{r,p}'::"char"[])) AND
pg_has_role(relowner, 'USAGE'::text))
So looks like it was the "Make Vars be outer-join-aware." commit that
changed this.
David
David Rowley <dgrowleyml@gmail.com> writes:
On Tue, 27 Aug 2024 at 13:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, it looks like that condition on "table_name" is not getting
pushed down to the scan level anymore. I'm not sure why not,
but will look closer tomorrow.
So looks like it was the "Make Vars be outer-join-aware." commit that
changed this.
Yeah, I got that same result by bisecting. It seems like it's
somehow related to the cast to information_schema.sql_identifier:
we are able to get rid of that normally but seem to fail to do so
in this query.
There was a smaller increase in the runtime at dfb75e478 "Add primary
keys and unique constraints to system catalogs", but that seems to
just be due to there being more rows in the relevant catalogs.
(That's from testing the query in an empty database; probably the
effect of dfb75e478 would be swamped in a production DB anyway.)
regards, tom lane
On 2024-08-27 11:50, David Rowley wrote:
On Tue, 27 Aug 2024 at 13:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, it looks like that condition on "table_name" is not getting
pushed down to the scan level anymore. I'm not sure why not,
but will look closer tomorrow.I was looking for the offending commit as at first I thought it might
be related to Memoize. It does not seem to be.
As a general thought, seeing that this might be an actual problem
should some kind of automated testing be added that checks for
performance regressions like this?
Regards and best wishes,
Justin Clift
On Tue, 27 Aug 2024 at 18:00, Justin Clift <justin@postgresql.org> wrote:
As a general thought, seeing that this might be an actual problem
should some kind of automated testing be added that checks for
performance regressions like this?
We normally try to catch these sorts of things with regression tests.
Of course, that requires having a test that would catch a particular
problem, which we don't seem to have for this particular case. A
performance test would also require testing a particular scenario, so
I don't see why that's better. A regression test is better suited as
there's no middle ground between pass and fail.
David
On Tue, 27 Aug 2024 at 14:03, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Yeah, I got that same result by bisecting. It seems like it's
somehow related to the cast to information_schema.sql_identifier:
we are able to get rid of that normally but seem to fail to do so
in this query.
In case it saves you a bit of time, I stripped as much of the
unrelated stuff out as I could and got:
create table t (a name, b int);
explain select * from (select a::varchar,b from (select distinct a,b
from t) st) t right join t t2 on t.b=t2.b where t.a='test';
getting rid of the cast or swapping to INNER JOIN rather than RIGHT
JOIN means that qual_is_pushdown_safe() gets a Var rather than a
PlaceHolderVar.
David
[ switching to -hackers list ]
David Rowley <dgrowleyml@gmail.com> writes:
In case it saves you a bit of time, I stripped as much of the
unrelated stuff out as I could and got:
create table t (a name, b int);
explain select * from (select a::varchar,b from (select distinct a,b
from t) st) t right join t t2 on t.b=t2.b where t.a='test';
getting rid of the cast or swapping to INNER JOIN rather than RIGHT
JOIN means that qual_is_pushdown_safe() gets a Var rather than a
PlaceHolderVar.
Thanks. So it seems that what's happening is that we stick a
PlaceHolderVar on the intermediate subquery's output ("a::varchar"),
and then later when we realize that the RIGHT JOIN can be reduced to
an inner join we run around and remove the right join from the
PlaceHolderVar's nullingrels, leaving a useless PHV with no
nullingrels. remove_nulling_relids explains
* Note: it might seem desirable to remove the PHV altogether if
* phnullingrels goes to empty. Currently we dare not do that
* because we use PHVs in some cases to enforce separate identity
* of subexpressions; see wrap_non_vars usages in prepjointree.c.
However, then when we consider whether the upper WHERE condition
can be pushed down into the unflattened lower subquery,
qual_is_pushdown_safe punts:
* XXX Punt if we find any PlaceHolderVars in the restriction clause.
* It's not clear whether a PHV could safely be pushed down, and even
* less clear whether such a situation could arise in any cases of
* practical interest anyway. So for the moment, just refuse to push
* down.
We didn't see this particular behavior before 2489d76c49 because
pullup_replace_vars avoided inserting a PHV:
* If it contains a Var of the subquery being pulled up, and
* does not contain any non-strict constructs, then it's
* certainly nullable so we don't need to insert a
* PlaceHolderVar.
I dropped that case in 2489d76c49 because now we need to attach
nullingrels to the expression. You could imagine attaching the
nullingrels to the contained Var(s) instead of putting a PHV on top,
but that seems like a mess and I'm not quite sure it's semantically
the same. In any case it wouldn't fix adjacent cases where there is
a non-strict construct in the subquery output expression.
So it seems like we need to fix one or the other of these
implementation shortcuts to restore the previous behavior.
I'm wondering if it'd be okay for qual_is_pushdown_safe to accept
PHVs that have no nullingrels. I'm not really thrilled about trying
to back-patch any such fix though --- the odds of introducing new bugs
seem nontrivial, and the problem case seems rather narrow. If we
are willing to accept a HEAD-only fix, it'd likely be better to
attack the other end and make it possible to remove no-op PHVs.
I think that'd require marking PHVs that need to be kept because
they are serving to isolate subexpressions.
regards, tom lane
I wrote:
We didn't see this particular behavior before 2489d76c49 because
pullup_replace_vars avoided inserting a PHV:
* If it contains a Var of the subquery being pulled up, and
* does not contain any non-strict constructs, then it's
* certainly nullable so we don't need to insert a
* PlaceHolderVar.
I dropped that case in 2489d76c49 because now we need to attach
nullingrels to the expression. You could imagine attaching the
nullingrels to the contained Var(s) instead of putting a PHV on top,
but that seems like a mess and I'm not quite sure it's semantically
the same. In any case it wouldn't fix adjacent cases where there is
a non-strict construct in the subquery output expression.
I realized that actually we do have the mechanism for making that
work: we could apply add_nulling_relids to the expression, if it
meets those same conditions. This is a kluge really, but it would
restore the status quo ante in a fairly localized fashion that
seems like it might be safe enough to back-patch into v16.
Here's a WIP patch that does it like that. One problem with it
is that it requires rcon->relids to be calculated in cases where
we didn't need that before, which is probably not *that* expensive
but it's annoying. If we go forward with this, I'm thinking about
changing add_nulling_relids' API contract to say "if target_relid
is NULL then all level-zero Vars/PHVs are modified", so that we
don't need that relid set in non-LATERAL cases.
The other problem with this is that it breaks one test case in
memoize.sql: a query that formerly generated a memoize plan
now does not use memoize. I am not sure why not --- does that
mean anything to you?
regards, tom lane
Attachments:
avoid-unnecessary-PHV-during-pullup-wip.patchtext/x-diff; charset=us-ascii; name=avoid-unnecessary-PHV-during-pullup-wip.patchDownload+53-14
On Wed, 28 Aug 2024 at 09:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The other problem with this is that it breaks one test case in
memoize.sql: a query that formerly generated a memoize plan
now does not use memoize. I am not sure why not --- does that
mean anything to you?
The reason it works in master is that get_memoize_path() calls
extract_lateral_vars_from_PHVs() and finds PlaceHolderVars to use as
the Memoize keys. With your patch PlannerInfo.placeholder_list is
empty.
The commit that made this work is 069d0ff02. Richard might be able to
explain better. I don't quite understand why RelOptInfo.lateral_vars
don't contain these in the first place.
David
David Rowley <dgrowleyml@gmail.com> writes:
On Wed, 28 Aug 2024 at 09:52, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The other problem with this is that it breaks one test case in
memoize.sql: a query that formerly generated a memoize plan
now does not use memoize. I am not sure why not --- does that
mean anything to you?
The reason it works in master is that get_memoize_path() calls
extract_lateral_vars_from_PHVs() and finds PlaceHolderVars to use as
the Memoize keys. With your patch PlannerInfo.placeholder_list is
empty.
That seems like a pretty fishy way to do it. Are you saying that
Memoize is never applicable if there aren't outer joins in the
query? Without OJs there probably won't be any PHVs.
regards, tom lane
I wrote:
That seems like a pretty fishy way to do it. Are you saying that
Memoize is never applicable if there aren't outer joins in the
query? Without OJs there probably won't be any PHVs.
Oh, scratch that, I see you mean this is an additional way to do it
not the only way to do it. But I'm confused why it works for
t1.two+1 AS c1
but not
t1.two+t2.two AS c1
Those ought to look pretty much the same for this purpose.
regards, tom lane
On Wed, 28 Aug 2024 at 11:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Oh, scratch that, I see you mean this is an additional way to do it
not the only way to do it. But I'm confused why it works for
t1.two+1 AS c1
but not
t1.two+t2.two AS c1
Those ought to look pretty much the same for this purpose.
The bms_overlap(pull_varnos(rcon->root, newnode), rcon->relids) test
is false with t1.two+1. Looks like there needs to be a Var from t2
for the bms_overlap to be true
David
On Wed, Aug 28, 2024 at 5:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I realized that actually we do have the mechanism for making that
work: we could apply add_nulling_relids to the expression, if it
meets those same conditions.
I think this should work, as long as we apply add_nulling_relids only
to Vars/PHVs that belong to the subquery in this case, because only
those Vars/PHVs would be nulled by the outer joins contained in the
nullingrels.
If we go forward with this, I'm thinking about
changing add_nulling_relids' API contract to say "if target_relid
is NULL then all level-zero Vars/PHVs are modified", so that we
don't need that relid set in non-LATERAL cases.
+1. In LATERAL case, we can always find the subquery's relids in
rcon->relids. In non-lateral case, any level-zero Vars/PHVs must
belong to the subquery - so if we change add_nulling_relids' API to be
so, we do not need to have rcon->relids set.
Thanks
Richard
On Wed, Aug 28, 2024 at 11:30 AM Richard Guo <guofenglinux@gmail.com> wrote:
On Wed, Aug 28, 2024 at 5:52 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
I realized that actually we do have the mechanism for making that
work: we could apply add_nulling_relids to the expression, if it
meets those same conditions.I think this should work, as long as we apply add_nulling_relids only
to Vars/PHVs that belong to the subquery in this case, because only
those Vars/PHVs would be nulled by the outer joins contained in the
nullingrels.
To be more concrete, I know theoretically it is the whole expression
that is nullable by the outer joins, not its individual vars. But in
this case if the contained vars (that belong to the subquery) become
NULL, the whole expression would be NULL too, because it does not
contain any non-strict constructs. That's why I think this approach
should work.
Thanks
Richard
On 2024-08-27 20:14, David Rowley wrote:
On Tue, 27 Aug 2024 at 18:00, Justin Clift <justin@postgresql.org>
wrote:As a general thought, seeing that this might be an actual problem
should some kind of automated testing be added that checks for
performance regressions like this?We normally try to catch these sorts of things with regression tests.
Of course, that requires having a test that would catch a particular
problem, which we don't seem to have for this particular case. A
performance test would also require testing a particular scenario, so
I don't see why that's better. A regression test is better suited as
there's no middle ground between pass and fail.
Yeah, that's the kind of thing I was thinking.
Any idea who normally does those, and if it would be reasonable to add
test(s) for the internal information tables?
Regards and best wishes,
Justin Clift
On Wed, Aug 28, 2024 at 12:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
If we
are willing to accept a HEAD-only fix, it'd likely be better to
attack the other end and make it possible to remove no-op PHVs.
I think that'd require marking PHVs that need to be kept because
they are serving to isolate subexpressions.
I think it's always desirable to remove no-op PHVs, even if we end up
with a different approach to fix the issue discussed here. Doing that
could potentially open up opportunities for optimization in other
cases. For example:
explain (costs off)
select * from t t1 left join
lateral (select t1.a as x, * from t t2) s on true
where t1.a = s.a;
QUERY PLAN
----------------------------
Nested Loop
-> Seq Scan on t t1
-> Seq Scan on t t2
Filter: (t1.a = a)
(4 rows)
The target entry s.x is wrapped in a PHV that contains lateral
reference to t1, which forces us to resort to nestloop join. However,
since the left join has been reduced to an inner join, and it is
removed from the PHV's nullingrels, leaving the nullingrels being
empty, we should be able to remove this PHV and use merge or hash
joins, depending on which is cheaper.
I think there may be more cases where no-op PHVs constrain
optimization opportunities.
In [1]/messages/by-id/CAMbWs4_2t2pqqCFdS3NYJLwMMkAzYQKBOhKweFt-wE3YOi7rGg@mail.gmail.com when working on the fix-grouping-sets patch, I included a
mechanism in 0003 to remove no-op PHVs by including a flag in
PlaceHolderVar to indicate whether it is safe to remove the PHV when
its phnullingrels becomes empty. In that patch this flag is only set
in cases where the PHV is used to carry the nullingrel bit that
represents the grouping step. Maybe we can extend its use to remove
all no-op PHVs, except those that are serving to isolate
subexpressions.
Any thoughts on this?
[1]: /messages/by-id/CAMbWs4_2t2pqqCFdS3NYJLwMMkAzYQKBOhKweFt-wE3YOi7rGg@mail.gmail.com
Thanks
Richard