Unfortunate pushing down of expressions below sort
Hi,
I was recently [1]/messages/by-id/wgf63h3doepg2jnmofzbygrg7jujbjvxwkvoc7arej2zqcuf6c@3tzz22tizuew reminded of something I've seen be problematic before:
We push down expressions below a sort node, even though they could be
evaluated above. That can very substantially increase the space needed for the
sort.
A simplified (and extreme-y-fied) example:
EXPLAIN (VERBOSE, ANALYZE) SELECT repeat(g.i::text, 1000) FROM generate_series(1, 10000) g(i) ORDER BY g.i;
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Sort (cost=839.39..864.39 rows=10000 width=36) (actual time=65.905..66.552 rows=10000.00 loops=1) │
│ Output: (repeat((i)::text, 1000)), i │
│ Sort Key: g.i │
│ Sort Method: quicksort Memory: 38601kB │
│ -> Function Scan on pg_catalog.generate_series g (cost=0.00..175.00 rows=10000 width=36) (actual time=0.896..48.459 rows=10000.00 loops=1) │
│ Output: repeat((i)::text, 1000), i │
│ Function Call: generate_series(1, 10000) │
│ Planning Time: 0.063 ms │
│ Execution Time: 69.253 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
I can manually rewrite that be executed better:
EXPLAIN (VERBOSE, ANALYZE) SELECT repeat(i::text, 1000) FROM (SELECT * FROM generate_series(1, 10000) g(i) ORDER BY g.i);
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Subquery Scan on unnamed_subquery (cost=764.39..864.39 rows=10000 width=32) (actual time=2.642..50.738 rows=10000.00 loops=1) │
│ Output: repeat((unnamed_subquery.i)::text, 1000) │
│ -> Sort (cost=764.39..789.39 rows=10000 width=4) (actual time=2.633..3.342 rows=10000.00 loops=1) │
│ Output: g.i │
│ Sort Key: g.i │
│ Sort Method: quicksort Memory: 385kB │
│ -> Function Scan on pg_catalog.generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=0.999..1.690 rows=10000.00 loops=1) │
│ Output: g.i │
│ Function Call: generate_series(1, 10000) │
│ Planning Time: 0.063 ms │
│ Execution Time: 51.648 ms │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(11 rows)
Note that the runtime as well as the memory usage are reduced noticeably.
It's even worse when there is a LIMIT above the sort, because it leads to
evaluating the expression way more often than needed:
EXPLAIN (VERBOSE, ANALYZE) SELECT repeat(g.i::text, 1000) FROM generate_series(1, 10000) g(i) ORDER BY g.i LIMIT 10;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Limit (cost=391.10..391.12 rows=10 width=36) (actual time=50.910..50.912 rows=10.00 loops=1) │
│ Output: (repeat((i)::text, 1000)), i │
│ -> Sort (cost=391.10..416.10 rows=10000 width=36) (actual time=50.908..50.909 rows=10.00 loops=1) │
│ Output: (repeat((i)::text, 1000)), i │
│ Sort Key: g.i │
│ Sort Method: top-N heapsort Memory: 36kB │
│ -> Function Scan on pg_catalog.generate_series g (cost=0.00..175.00 rows=10000 width=36) (actual time=0.869..47.820 rows=10000.00 loops=1) │
│ Output: repeat((i)::text, 1000), i │
│ Function Call: generate_series(1, 10000) │
│ Planning Time: 0.074 ms │
│ Execution Time: 50.938 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(11 rows)
vs:
EXPLAIN (VERBOSE, ANALYZE) SELECT repeat(i::text, 1000) FROM (SELECT * FROM generate_series(1, 10000) g(i) ORDER BY g.i LIMIT 10);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Subquery Scan on unnamed_subquery (cost=316.10..316.20 rows=10 width=32) (actual time=3.098..3.149 rows=10.00 loops=1) │
│ Output: repeat((unnamed_subquery.i)::text, 1000) │
│ -> Limit (cost=316.10..316.12 rows=10 width=4) (actual time=3.086..3.090 rows=10.00 loops=1) │
│ Output: g.i │
│ -> Sort (cost=316.10..341.10 rows=10000 width=4) (actual time=3.083..3.085 rows=10.00 loops=1) │
│ Output: g.i │
│ Sort Key: g.i │
│ Sort Method: top-N heapsort Memory: 25kB │
│ -> Function Scan on pg_catalog.generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=1.482..2.244 rows=10000.00 loops=1) │
│ Output: g.i │
│ Function Call: generate_series(1, 10000) │
│ Planning Time: 0.073 ms │
│ Execution Time: 3.185 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Now, a repeat(,1000) is obviously a silly example, but I think this is a real
issue. In the case in [1]/messages/by-id/wgf63h3doepg2jnmofzbygrg7jujbjvxwkvoc7arej2zqcuf6c@3tzz22tizuew, deferring the evaluation of acldefault() till after
the sort reduces memory consumption by ~38%
Why are we evaluating the expression below the sort instead of above? I can
maybe see an argument for doing that if it's volatile, but it's not.
Interestingly we seem to do the sane thing for aggregation:
EXPLAIN (VERBOSE, ANALYZE) SELECT repeat(g.i::text, 1000) FROM generate_series(1, 10000) g(i) GROUP BY g.i;
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ HashAggregate (cost=125.00..128.50 rows=200 width=36) (actual time=4.575..52.142 rows=10000.00 loops=1) │
│ Output: repeat((i)::text, 1000), i │
│ Group Key: g.i │
│ Batches: 1 Memory Usage: 553kB │
│ -> Function Scan on pg_catalog.generate_series g (cost=0.00..100.00 rows=10000 width=4) (actual time=0.897..1.518 rows=10000.00 loops=1) │
│ Output: i │
│ Function Call: generate_series(1, 10000) │
│ Planning Time: 0.042 ms │
│ Execution Time: 53.126 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(9 rows)
Note that the repeat() is computed above the aggregate. That also is true if
it's a sort based agg...
Greetings,
Andres Freund
[1]: /messages/by-id/wgf63h3doepg2jnmofzbygrg7jujbjvxwkvoc7arej2zqcuf6c@3tzz22tizuew
Andres Freund <andres@anarazel.de> writes:
We push down expressions below a sort node, even though they could be
evaluated above.
Yeah. That's a hangover from an ancient decision that sort/limit
would always be applied at the top of the plan tree. I'm too
lazy to check the details right now, but I think we already relaxed
that in some cases (or maybe it was about evaluating stuff
before/after GROUP BY?).
That can very substantially increase the space needed for the
sort.
Could decrease it too, eg if what you are outputting is
md5(some-wide-column). Not sure we are smart enough to tell
which way is better.
regards, tom lane
I wrote:
Yeah. That's a hangover from an ancient decision that sort/limit
would always be applied at the top of the plan tree. I'm too
lazy to check the details right now, but I think we already relaxed
that in some cases (or maybe it was about evaluating stuff
before/after GROUP BY?).
Ah, right: see make_sort_input_target() and its very extensive
comment in planner.c. I wonder why that didn't trigger in
your example.
regards, tom lane
Hi,
I took a closer look at make_sort_input_target() in
src/backend/optimizer/plan/planner.c.
The current heuristics only defer targetlist expressions past a Sort
when they are:
* volatile, or
* set-returning (due to SRF synchronization constraints), or
* considered “expensive” (cost > 10 * cpu_operator_cost) and there is a
LIMIT.
Functions like repeat() and acldefault() have the default procost = 1,
so they don’t meet the “expensive” threshold and therefore remain below
the Sort, which is what leads to the tuple width inflation seen in these
examples.
Grouping behaves differently: make_group_input_target() unconditionally
flattens non-grouping expressions to Vars, which is why repeat() ends up
above the Aggregate node there.
Tom mentioned the md5(widecol) counterexample, where evaluating the
expression before the sort can actually reduce memory usage. The key
distinction seems to be whether the expression depends solely on sort
keys.
The approach I’m experimenting with is to defer an expression only when
all the Vars it depends on are sort keys. That gives the desired
behavior in both cases:
* repeat(i, 1000) ORDER BY i: i is the sort key, so we defer and keep
the sort tuples narrow.
* md5(widecol) ORDER BY id: widecol is not a sort key, so we keep the
expression below the sort and avoid carrying the wide column.
This seems to address the cases discussed in this thread and should be
low-risk for the common case.
I’m working on a patch along these lines; any thoughts?
--
Best regards,
Chengpeng Yan
Hi,
Following up on the discussion below, I now have a patch.
The patch extends make_sort_input_target() with a conservative rule:
defer additional non-sort targetlist expressions past Sort only when
doing so does not require carrying any additional Vars/PlaceHolderVars
through Sort. This way, Sort input width never increases.
This still allows cases like repeat(i::text, ...) ORDER BY i to be
projected above Sort, while avoiding the md5(widecol) counterexample
mentioned earlier, since such expressions are not deferred when they
would force a wide non-sort column through Sort.
One limitation remains: this can't help queries like
'SELECT repeat(i::text, ...) FROM t ORDER BY othercols;'
where the output expression depends on Vars that are not sort keys. In
that case we still have to carry i through the Sort to be able to
compute the final targetlist, so the patch cannot avoid inflating Sort's
input width (and may still evaluate repeat() before Sort, depending on
the existing projection placement rules).
The existing volatile/SRF/expensive behavior is unchanged (expensive
exprs are still postponed once a post-sort projection is needed).
I also added regression coverage (including an md5(widecol)-style case).
Some existing EXPLAIN outputs (e.g. join/groupingsets) now show a Result
above Sort, which is expected from postponing additional non-sort
outputs.
Patch attached. Comments welcome.
--
Best regards,
Chengpeng Yan
Attachments:
v1-0001-planner-postpone-some-non-sort-output-expressions.patchapplication/octet-stream; name=v1-0001-planner-postpone-some-non-sort-output-expressions.patchDownload+315-51
Hi,
I've attached v2 of the patch.
Changes since v1:
- No planner logic changes.
- Fixed one regression-test fallout in contrib/postgres_fdw by updating
expected EXPLAIN output. With the planner change, the deparsed remote
SQL can omit a redundant constant there, so GROUP BY positional
references shift accordingly; query results are unchanged.
Behavior is unchanged from v1:
- Postpone additional non-sort target expressions only when doing so
does not require carrying any extra Vars/PlaceHolderVars through Sort.
- Existing volatile/SRF/expensive-expression behavior is unchanged.
Validation:
- make check-world passes.
Would especially appreciate feedback on whether the “no extra Vars/PHVs
through sort” rule is the right safety boundary.
Thanks!
--
Best regards,
Chengpeng Yan