BUG #19106: Potential regression with CTE materialization planning in Postgres 18
The following bug has been logged on the website:
Bug reference: 19106
Logged by: Kamil Monicz
Email address: kamil@monicz.dev
PostgreSQL version: 18.0
Operating system: NixOS unstable ffcdcf99d65c61956d882df249a9be53e59
Description:
After upgrading from Postgres 17 to 18, one of my queries started raising an
error:
"unexpected outer reference in CTE query"
The problematic query is:
https://github.com/openstreetmap-ng/openstreetmap-ng/blob/eb805d8766fb4b359b96eb6b50acc8c2a835a165/app/services/element_spatial_service.py#L82-L215
Specifically, the `WITH member_geoms ...` part inside the `LEFT JOIN
LATERAL`.
I was able to resolve the issue by forcing the LATERAL CTEs as NOT
MATERIALIZED:
--- app/services/element_spatial_service.py
+++ app/services/element_spatial_service.py
@@ -155,3 +155,3 @@ rels_computed AS (
LEFT JOIN LATERAL (
- WITH member_geoms AS (
+ WITH member_geoms AS NOT MATERIALIZED (
SELECT ST_Collect(geom_val) AS geom
@@ -179,3 +179,3 @@ rels_computed AS (
),
- noded_geoms AS (
+ noded_geoms AS NOT MATERIALIZED (
SELECT ST_UnaryUnion(ST_Collect(
@@ -186,3 +186,3 @@ rels_computed AS (
),
- polygon_geoms AS (
+ polygon_geoms AS NOT MATERIALIZED (
SELECT ST_UnaryUnion(ST_Collect(
This seems like a regression because in cases where a CTE has an outer
reference, it simply shouldn't be materialized (I don't really know the
Postgres internals). I never expected these CTEs to be materialized. I
simply use them for improved readability.
PG Bug reporting form <noreply@postgresql.org> writes:
After upgrading from Postgres 17 to 18, one of my queries started raising an
error:
"unexpected outer reference in CTE query"
I agree that sounds like a bug ...
The problematic query is:
https://github.com/openstreetmap-ng/openstreetmap-ng/blob/eb805d8766fb4b359b96eb6b50acc8c2a835a165/app/services/element_spatial_service.py#L82-L215
... but I am not going to spend time trying to reproduce it given
this amount of detail. There's too much missing context, like what
data you were running the query on. I could spend all day, not
see the failure, and be left no wiser than before as to whether
it's already fixed or I just didn't duplicate your context closely
enough. Please see if you can reduce the problem case to a
self-contained SQL script.
regards, tom lane
On Sun, Nov 9, 2025, at 16:53, Tom Lane wrote:
PG Bug reporting form <noreply@postgresql.org> writes:
After upgrading from Postgres 17 to 18, one of my queries started raising an
error:
"unexpected outer reference in CTE query"I agree that sounds like a bug ...
The problematic query is:
https://github.com/openstreetmap-ng/openstreetmap-ng/blob/eb805d8766fb4b359b96eb6b50acc8c2a835a165/app/services/element_spatial_service.py#L82-L215... but I am not going to spend time trying to reproduce it given
this amount of detail. There's too much missing context, like what
data you were running the query on. I could spend all day, not
see the failure, and be left no wiser than before as to whether
it's already fixed or I just didn't duplicate your context closely
enough. Please see if you can reduce the problem case to a
self-contained SQL script.regards, tom lane
Hello Tom,
It's my first time here (and realistically on a proper mailing list), so please excuse me. Here's the small, self-contained reproduction:
```
EXPLAIN
SELECT *
FROM (
SELECT ARRAY[1, 2] AS arr
) r
CROSS JOIN LATERAL (
WITH a AS (
SELECT
CASE
WHEN id = 1 THEN ST_GeomFromText('LINESTRING(0 0,1 0,1 1)')
ELSE ST_GeomFromText('POINT(0 0)')
END AS geom
FROM unnest(r.arr) AS id
),
b AS (
SELECT ST_Polygonize(
(SELECT ST_UnaryUnion(ST_Collect(geom)) FROM a)
) AS st_polygonize
)
SELECT
(SELECT st_polygonize FROM b),
(SELECT st_polygonize FROM b)
) s;
```
```
ERROR: unexpected outer reference in CTE query
SQL state: XX000
```
It depends on PostGIS being installed and loaded. In my case, it's version 3.6.0. I tried to make it work without it, but I couldn't figure it out.
-Kamil Monicz
"Kamil Monicz" <kamil@monicz.dev> writes:
It's my first time here (and realistically on a proper mailing list), so please excuse me. Here's the small, self-contained reproduction:
Thanks. After a bit of fooling around I was able to convert this
to something without any PostGIS dependency:
EXPLAIN
SELECT *
FROM (
SELECT ARRAY[1, 2] AS arr
) r
CROSS JOIN LATERAL (
WITH a AS (
SELECT id FROM unnest(r.arr) AS id
),
b AS (
SELECT max((SELECT sum(id) FROM a)) AS agg
)
SELECT
(SELECT agg FROM b)
) s;
This worked up until commit b0cc0a71e, and since then it hits an
assertion failure in check_agglevels_and_constraints(); or if you
don't have asserts enabled then the planner gets confused, because
the max() aggregate function is given the wrong agglevelsup.
I need to think through what is the correct behavior for cross-CTE
references like these. Sadly, this is too late for this week's
releases ...
regards, tom lane
[ cc'ing Peter and Vik for possible input on SQL-standard question ]
I wrote:
This worked up until commit b0cc0a71e, and since then it hits an
assertion failure in check_agglevels_and_constraints(); or if you
don't have asserts enabled then the planner gets confused, because
the max() aggregate function is given the wrong agglevelsup.
I need to think through what is the correct behavior for cross-CTE
references like these. Sadly, this is too late for this week's
releases ...
Here is a draft fix for this. What it basically decides is that
commit b0cc0a71e was in error to suppose that an outer CTE reference
should work like an outer Var reference. In this even-more-simplified
test case:
WITH a AS (
SELECT id FROM (VALUES (1), (2)) AS v(id)
),
b AS (
SELECT max((SELECT sum(id) FROM a)) AS agg
)
SELECT agg FROM b;
it's pretty obvious that the reference to "a" should not cause us to
act as though the max() aggregate belongs to the outer "WITH ...
SELECT agg FROM b" query level. However, if it doesn't belong there,
where does it belong?
The draft patch takes the approach of forcing agglevelsup to zero
anytime we find a CTE reference within an aggregate's arguments
(unless it is to a CTE defined inside those arguments). That
means such an aggregate always has its syntactic level and cannot
be treated as an outer aggregate even if it otherwise references
(only) outer Vars.
I don't have a huge amount of confidence in this being the final
answer; it seems like it might still be too simplistic. But I'm
not sure what to do instead. It's at least less likely to break
cases that worked before b0cc0a71e, since "it's level zero" is
pretty much the default answer anyway.
I looked at the SQL standard for possible guidance and found none:
they disallow subqueries altogether within aggregate arguments,
so they need not consider such cases. I am curious though whether
Peter or Vik know if the committee ever considered relaxing that
restriction, and if so whether they stopped to think about this
particular point.
regards, tom lane
Attachments:
v1-fix-CTEs-in-aggregates-some-more.patchtext/x-diff; charset=us-ascii; name=v1-fix-CTEs-in-aggregates-some-more.patchDownload+53-6
On 10/11/2025 22:05, Tom Lane wrote:
[ cc'ing Peter and Vik for possible input on SQL-standard question ]
Thanks!
WITH a AS (
SELECT id FROM (VALUES (1), (2)) AS v(id)
),
b AS (
SELECT max((SELECT sum(id) FROM a)) AS agg
)
SELECT agg FROM b;
snip
I looked at the SQL standard for possible guidance and found none:
they disallow subqueries altogether within aggregate arguments,
so they need not consider such cases. I am curious though whether
Peter or Vik know if the committee ever considered relaxing that
restriction, and if so whether they stopped to think about this
particular point.
I am not seeing that restriction in the standard. For this test case,
we have MAX which has the lineage:
<aggregate function>
<general set function>
<set function type>
<computational operation>
MAX
Its argument, (SELECT SUM(id) FROM a), has this lineage:
<value expression>
<common value expression>
<numeric value expression>
<term>
<factor>
<numeric primary>
<value expression primary>
<non-parenthesized value expression primary>
<scalar subquery>
<subquery>
Since there are no outer column references, the subquery should be
independent. And if we inline it:
WITH
b (agg) AS (
SELECT MAX((
SELECT SUM(id)
FROM (VALUES (1), (2)) AS v (id)
))
-- FROM nothing
)
TABLE b
then the query works as expected. MATERIALIZEDing either or both CTEs
has no effect, which I find strange.
--
Vik Fearing
Vik Fearing <vik@postgresfriends.org> writes:
On 10/11/2025 22:05, Tom Lane wrote:
I looked at the SQL standard for possible guidance and found none:
they disallow subqueries altogether within aggregate arguments,
so they need not consider such cases.
I am not seeing that restriction in the standard.
Maybe I'm misunderstanding what I read, but in SQL:2021
6.9 <set function specification> SR1 says
If <aggregate function> specifies a <general set function>, then
the <value expression> simply contained in the <general set
function> shall not contain a <set function specification>
or a <query expression>.
The predecessor text in SQL99 says
4) The <value expression> simply contained in <set function
specification> shall not contain a <set function specification>
or a <subquery>.
I don't think replacing <subquery> with <query expression> moved the
goalposts at all, but maybe I'm missing something.
... MATERIALIZEDing either or both CTEs
has no effect, which I find strange.
The fundamental problem is that the parser is mis-assigning
agglevelsup; given that, the planner is very likely to get
confused no matter what other details there are.
regards, tom lane
On 11/11/2025 16:24, Tom Lane wrote:
Vik Fearing <vik@postgresfriends.org> writes:
On 10/11/2025 22:05, Tom Lane wrote:
I looked at the SQL standard for possible guidance and found none:
they disallow subqueries altogether within aggregate arguments,
so they need not consider such cases.I am not seeing that restriction in the standard.
Maybe I'm misunderstanding what I read, but in SQL:2021
6.9 <set function specification> SR1 saysIf <aggregate function> specifies a <general set function>, then
the <value expression> simply contained in the <general set
function> shall not contain a <set function specification>
or a <query expression>.The predecessor text in SQL99 says
4) The <value expression> simply contained in <set function
specification> shall not contain a <set function specification>
or a <subquery>.I don't think replacing <subquery> with <query expression> moved the
goalposts at all, but maybe I'm missing something.
I don't think you are. I was missing that you can't get to <aggregate
function> without going through <set function specification> (or a
window) so I did not see that rule.
I had a rummage through the archives but couldn't easily find the paper
introducing aggregates so I can't see what the justification for that
rule was. This language was not in 1989 but is in 1992. It may just be a
case of "this is what we've implemented so this is what we are specifying."
... MATERIALIZEDing either or both CTEs
has no effect, which I find strange.The fundamental problem is that the parser is mis-assigning
agglevelsup; given that, the planner is very likely to get
confused no matter what other details there are.
Thank you for the explanation.
--
Vik Fearing
I wrote:
Here is a draft fix for this. What it basically decides is that
commit b0cc0a71e was in error to suppose that an outer CTE reference
should work like an outer Var reference. In this even-more-simplified
test case:
WITH a AS (
SELECT id FROM (VALUES (1), (2)) AS v(id)
),
b AS (
SELECT max((SELECT sum(id) FROM a)) AS agg
)
SELECT agg FROM b;
I wanted to post a little more analysis of what's happening here,
mostly for the archives' sake. Since b0cc0a71e,
check_agg_arguments_walker mistakenly decides that the max()
aggregate ought to belong to the query level where the "a" CTE
is, that is the outer "WITH ... SELECT agg FROM b". In an
assert-enabled build, check_agglevels_and_constraints promptly
crashes at
switch (pstate->p_expr_kind)
{
case EXPR_KIND_NONE:
Assert(false); /* can't happen */
break;
because it's looking at the ParseState for that outer query
level, where we are not examining any particular subexpression.
In a non-assert build, we more-or-less-accidentally get through
check_agglevels_and_constraints unscathed, but then the parser
fails with
ERROR: column "b.agg" must appear in the GROUP BY clause or be used in an aggregate function
LINE 7: SELECT agg FROM b;
^
because we've marked the outer query level with p_hasAggs = true.
(This seems like sufficient proof that we're assigning the aggregate
to the wrong level, if you were doubting that conclusion.)
The submitted problem query is shaped a little differently, though.
It's more nearly
WITH a AS (
SELECT id FROM (VALUES (1), (2)) AS v(id)
),
b AS (
SELECT max((SELECT sum(id) FROM a)) AS agg
)
SELECT (SELECT agg FROM b);
and that extra level of sub-select avoids the aforesaid error message,
since the sub-select where the "agg" reference is doesn't get marked
p_hasAggs. (I wonder whether the OP introduced that extra sub-select
in trying to work around this bug.)
Rather remarkably, this formulation actually gets the expected answer
"3", although if you look at the generated plan it's fairly wacko:
postgres=# explain (verbose, costs off) WITH a AS (
SELECT id FROM (VALUES (1), (2)) AS v(id)
),
b AS (
SELECT max((SELECT sum(id) FROM a)) AS agg
)
SELECT (SELECT agg FROM b);
QUERY PLAN
-----------------------------------------------
Aggregate
Output: (SubPlan expr_1)
InitPlan expr_2
-> Aggregate
Output: sum("*VALUES*".column1)
-> Values Scan on "*VALUES*"
Output: "*VALUES*".column1
-> Result
SubPlan expr_1
-> Result
Output: max((InitPlan expr_2).col1)
(11 rows)
I don't quite understand how that works, since the max() call is
not within the upper Aggregate's expression trees: why isn't it
triggering "Aggref found in non-Agg plan node"? It's probably not
worth figuring out though; this is all garbage-in-garbage-out behavior
so far as the planner is concerned. Variants of this, such as
inserting MATERIALIZED for "b", lead to planner failures like
ERROR: unexpected outer reference in CTE query
which is similar to the original report.
Anyway, the point is that we are assigning the wrong semantic level
to the max() aggregate, and all the rest of this behavior is just
follow-on effects of that.
regards, tom lane
Vik Fearing <vik@postgresfriends.org> writes:
I had a rummage through the archives but couldn't easily find the paper
introducing aggregates so I can't see what the justification for that
rule was. This language was not in 1989 but is in 1992. It may just be a
case of "this is what we've implemented so this is what we are specifying."
Thanks for doing that research. It's not at all surprising if back
in the early 90's nobody had tried to make it work for sub-selects
(or at least had not succeeded), so they just wrote the spec to not
require it.
After sleeping on it I feel that my proposal of "force the aggregate
to have agglevelsup = 0" is a reasonably sane solution. I'd
originally sought some minimal adjustment to the
make-CTE-refs-work-like-Vars approach of b0cc0a71e, like moving the
aggregate down one level if we detect that the CTE reference is to a
sibling CTE. However, that wouldn't be sufficient to deal with
nested WITHs, for example
WITH a AS (
SELECT id FROM (VALUES (1), (2)) AS v(id)
),
b AS (
WITH b1 AS ( SELECT max((SELECT sum(id) FROM a)) AS agg )
SELECT * FROM b1
)
SELECT agg FROM b;
"Move down one level" would assign the max() to the
"WITH b1 ... SELECT * FROM b1" level, so we still have the bug.
"Move down through all levels of WITH" might do the trick,
or it might not. In any case it's feeling arbitrary and rather
far away from what we do for Vars. So I feel like the analogy
to Vars was fatally flawed to start with.
However, the real reason why I'm feeling good about the simplistic
solution is: if we invent some more-complex rule, whose life are
we making better? I think the submitted problem query is an example
of a common SQL programming idiom, which is to use a series of WITH
CTEs to chop up a complex computation into small independent black
boxes. It's the exact opposite of independence if the positioning of
a CTE reference in a later CTE affects the semantics of aggregates in
that CTE. Now, I fear we can't solve the problem that b0cc0a71e
set out to solve without having some impact of that sort, but we
should keep it as minimal and surprise-free as possible. Forcing
affected aggregates to have their semantic level equal to their
syntactic level seems about as surprise-free as we can get.
Especially since that would have been the result we produced before
b0cc0a71e in all non-contrived cases. The bug report that induced
us to do b0cc0a71e was a pretty contrived case if you ask me:
why would you combine an outer Var reference with a select from
a CTE? And if you did, why would you expect the surrounding
aggregate to be considered non-local?
regards, tom lane
On Wed, Nov 12, 2025 at 5:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thanks for doing that research. It's not at all surprising if back
in the early 90's nobody had tried to make it work for sub-selects
(or at least had not succeeded), so they just wrote the spec to not
require it.
I played with this patch, but I couldn't quite wrap my head around the
expected behavior of using subqueries as arguments to aggregate
functions. The outputs of the following queries are confusing to me.
create table t (a int);
insert into t values (1), (2);
Query 1:
select (select sum((select a from t t1 limit 1))) from t t2;
sum
-----
1
1
(2 rows)
As I understand it, a query of the form:
SELECT <scalar_expression> FROM table;
... produces one output row for each row in the table, with the value
of <scalar_expression> evaluated for that row. Thus, the output of
Query 1 makes sense to me.
Query 2:
select (select sum((select a from t t1 where a = t2.a or true limit
1))) from t t2;
sum
-----
2
(1 row)
I don't quite understand the output of Query 2. The subquery is now
correlated with the outer table t2, but I believe it's still in the
same form as Query 1, so I would expect it to also produce one output
row per table row. Moreover, IIUC, the "or true" clause should make
the two queries semantically equivalent.
Query 3:
with t as (select a from (values (1), (2)) as v(a))
select (select sum((select a from t t1 where a = t2.a or true limit
1))) from t t2;
sum
-----
1
1
(2 rows)
Query 3 replaces the physical table with a CTE that produces the same
logical table content, so I would expect the query's output to remain
unchanged. So the differing outputs of Query 2 and Query 3 are also
confusing to me.
- Richard
Richard Guo <guofenglinux@gmail.com> writes:
I played with this patch, but I couldn't quite wrap my head around the
expected behavior of using subqueries as arguments to aggregate
functions. The outputs of the following queries are confusing to me.
...
Query 2:
select (select sum((select a from t t1 where a = t2.a or true limit
1))) from t t2;
sum
-----
2
(1 row)
I don't quite understand the output of Query 2. The subquery is now
correlated with the outer table t2, but I believe it's still in the
same form as Query 1, so I would expect it to also produce one output
row per table row.
I believe the critical point about Q2 is that the presence of the
reference to t2.a causes the sum() aggregate to be assigned to the
outer query level. Now, that outer query is an aggregation query
so it will produce only one row, aggregated over both rows of t2
(for each of which, the bottom sub-select produces the value "1").
I've never totally understood the rationale for the SQL standard
to assign aggregates to outer query levels, but it's definitely
their fault not ours.
Query 3:
with t as (select a from (values (1), (2)) as v(a))
select (select sum((select a from t t1 where a = t2.a or true limit
1))) from t t2;
sum
-----
1
1
(2 rows)
Actually, as of HEAD we produce:
regression=# with t as (select a from (values (1), (2)) as v(a))
select (select sum((select a from t t1 where a = t2.a or true limit
1))) from t t2;
sum
-----
2
(1 row)
and it's the same if you remove the t2.a reference:
regression=# with t as (select a from (values (1), (2)) as v(a))
select (select sum((select a from t t1 where true limit
1))) from t t2;
sum
-----
2
(1 row)
In both cases we assign the aggregate to the outer-level SELECT.
I assume you were testing with my patch, which forces the sum()
to be level zero, that is belonging to the intermediate sub-select.
Query 3 replaces the physical table with a CTE that produces the same
logical table content, so I would expect the query's output to remain
unchanged. So the differing outputs of Query 2 and Query 3 are also
confusing to me.
The sticky point here is that a CTE reference isn't quite as absolute
as a physical-table reference: the CTE name only has meaning within
a portion of the query. So the problem that b0cc0a71e tried to solve
is "what do we do if the SQL-standard rules about semantic level of
an aggregate would result in putting the aggregate outside of the
scope of a CTE it references?"
I suppose another answer would be to throw up our hands and give
an error if that happens, rather than trying to fix the level
assigned to the aggregate.
regards, tom lane
On Thu, Nov 13, 2025 at 4:32 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
The sticky point here is that a CTE reference isn't quite as absolute
as a physical-table reference: the CTE name only has meaning within
a portion of the query. So the problem that b0cc0a71e tried to solve
is "what do we do if the SQL-standard rules about semantic level of
an aggregate would result in putting the aggregate outside of the
scope of a CTE it references?"
So, IIUC, the confusion arises in cases where an aggregation is to be
assigned to the outer side of its syntactic level. With the current
patch, if the aggregation does not reference any CTEs, it would be
evaluated at the outer query level. If the aggregation references any
CTEs, it'd be evaluated at its syntactic query level.
However, I still find this behavior somewhat confusing. For example,
one might expect that an inlined CTE should be semantically equivalent
to a subquery, yet the following two queries can produce different
results.
create table t (a int);
insert into t values (1), (2);
with ss as not materialized (select * from t)
select (select sum((select a from ss where a = t.a limit 1))) from t;
sum
-----
1
2
(2 rows)
select (select sum((select a from (select * from t) ss where a = t.a
limit 1))) from t;
sum
-----
3
(1 row)
I don't have much experience reading the SQL spec, but from the
discussions, it seems that the spec does not provide guidance on this
case. So the current behavior may be acceptable. I think it might be
helpful to explicitly document this behavior somewhere.
- Richard
Richard Guo <guofenglinux@gmail.com> writes:
However, I still find this behavior somewhat confusing. For example,
one might expect that an inlined CTE should be semantically equivalent
to a subquery, yet the following two queries can produce different
results.
create table t (a int);
insert into t values (1), (2);
with ss as not materialized (select * from t)
select (select sum((select a from ss where a = t.a limit 1))) from t;
sum
-----
1
2
(2 rows)
select (select sum((select a from (select * from t) ss where a = t.a
limit 1))) from t;
sum
-----
3
(1 row)
That's with my v1 patch? I agree it's pretty bogus.
The more I think about this, the more I like the other idea of just
throwing an error rather than trying to fix up cases like bug #19055.
I don't think we have much evidence that anyone is trying to do that
in the real world (otherwise reports would have surfaced years ago).
And this discussion is making it clear that fixing it up is harder
than it sounds.
Also: if we throw an error, and someone shows up with a real-world
example that triggers the error, we'd then have some more evidence
about what would be a plausible interpretation.
So now I'm thinking about something more like the attached. It
causes the #19055 query to throw an error, and restores our current
test query to working (I added said query to the test so we can't
break it again).
Producing a decently localized error turned out to be harder than
I expected, because RangeTableEntrys don't have location fields so
it's impossible to put an error cursor on the troublesome reference.
I ended up putting the cursor on the aggregate function itself
and giving the name of the CTE in errdetail.
regards, tom lane
Attachments:
v2-fix-CTEs-in-aggregates-some-more.patchtext/x-diff; charset=us-ascii; name=v2-fix-CTEs-in-aggregates-some-more.patchDownload+87-36
I wrote:
The more I think about this, the more I like the other idea of just
throwing an error rather than trying to fix up cases like bug #19055.
I don't think we have much evidence that anyone is trying to do that
in the real world (otherwise reports would have surfaced years ago).
And this discussion is making it clear that fixing it up is harder
than it sounds.
Hearing no further comments, I've pushed the v2 patch that does it
that way, restoring the previous behavior in cases that would not
have failed before #19055.
regards, tom lane