scale parallel_tuple_cost by tuple width
While investigating a performance issue, I found that it was extremely
difficult to get a parallel plan in some cases due to the fixed
parallel_tuple_cost. But this cost is not really fixed - it's going to
be larger for larger tuples. So this proposal adjusts the cost used
according to how large we expect the results to be. The result is that
in the common case where, say, you're getting a group id and some
aggregates, a parallel plan is more likely to be chosen. By contrast,
queries that generate very wide results will be less likely to choose
parallel plans. The formula chosen does have a fixed cost piece built
into it, which accounts for the shm_mq_sendv() and shm_mq_receive()
synchronization that occurs regardless of width.
The patch itself is pretty simple.
Also attached is a benchmark report that I had claude create. Its main
result shows a speedup of about 2.7x.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patchtext/x-patch; charset=UTF-8; name=0001-Scale-parallel_tuple_cost-by-tuple-width-at-Gather-n.patchDownload+39-10
parallel-tuple-cost-benchmark-2026-03-30.txttext/plain; charset=UTF-8; name=parallel-tuple-cost-benchmark-2026-03-30.txtDownload
Andrew Dunstan <andrew@dunslane.net> writes:
While investigating a performance issue, I found that it was extremely
difficult to get a parallel plan in some cases due to the fixed
parallel_tuple_cost. But this cost is not really fixed - it's going to
be larger for larger tuples. So this proposal adjusts the cost used
according to how large we expect the results to be.
Unfortunately, I'm afraid that this is going to produce mostly
"garbage in, garbage out" estimates, because our opinion of how wide
tuples-in-flight are is pretty shaky. (See get_expr_width and
particularly get_typavgwidth, and note that we only have good
statistics-based numbers for plain Vars not function outputs.)
I agree that it could be useful to have some kind of adjustment here,
but I fear that linear scaling is putting way too much faith in the
quality of the data.
How many cases have you checked with this modified code? Did it
make the plan worse in any cases?
regards, tom lane
On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
While investigating a performance issue, I found that it was extremely
difficult to get a parallel plan in some cases due to the fixed
parallel_tuple_cost. But this cost is not really fixed - it's going to
be larger for larger tuples. So this proposal adjusts the cost used
according to how large we expect the results to be.Unfortunately, I'm afraid that this is going to produce mostly
"garbage in, garbage out" estimates, because our opinion of how wide
tuples-in-flight are is pretty shaky. (See get_expr_width and
particularly get_typavgwidth, and note that we only have good
statistics-based numbers for plain Vars not function outputs.)
I agree that it could be useful to have some kind of adjustment here,
but I fear that linear scaling is putting way too much faith in the
quality of the data.
(I suspect you're saying this because of the "Benchmark 2" in the text
file, which contains aggregates which return a varlena type, of which
we won't estimate the width very well for...)
Sure, it's certainly true that there are cases where we don't get the
width estimate right, but that's not stopped us using them before. So
why is this case so much more critical? We now also have GROUP BY
before join abilities in the planner, which I suspect must also be
putting trust into the very same thing. Also, varlena-returning
Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
avoid making improvements in this area because we're not great at one
of the things that could be in the targetlist?
For the patch and the analysis: This reminds me of [1]/messages/by-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com, where some
reverse-engineering of costs from query run-times was done, which
ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
get that for this case would require various tests with different
tuple widths and ensuring that the costs scale linearly with the
run-time of the query with the patched version. Of course, the test
query would have to have perfect width estimates, but that could be
easy enough to do by populating a text typed GROUP BY column and
populating that with all the same width of text for each of the tests
before increasing the width for the next test, using a fixed-width
aggregate each time, e.g count(*). The "#define
PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
number. It would be good to know how close this is to reality.
Ideally, it would be good to see results from an Apple M<something>
chip and recent x86. In my experience, these produce very different
performance results, so it might be nice to find a value that is
somewhere in the middle of what we get from those machines. I suspect
having the GROUP BY column with text widths from 8 to 1024, increasing
in powers of two would be enough data points.
David
[1]: /messages/by-id/CAKJS1f9UXdk6ZYyqbJnjFO9a9hyHKGW7B=ZRh-rxy9qxfPA5Gw@mail.gmail.com
David Rowley <dgrowleyml@gmail.com> writes:
On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Unfortunately, I'm afraid that this is going to produce mostly
"garbage in, garbage out" estimates, because our opinion of how wide
tuples-in-flight are is pretty shaky.
Sure, it's certainly true that there are cases where we don't get the
width estimate right, but that's not stopped us using them before. So
why is this case so much more critical?
What I'm concerned about is that the estimated cost's dependency on
tuple width may be much stronger here than it has been in other uses.
That impression might be false, of course.
For the patch and the analysis: This reminds me of [1], where some
reverse-engineering of costs from query run-times was done, which
ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
get that for this case would require various tests with different
tuple widths and ensuring that the costs scale linearly with the
run-time of the query with the patched version.
Right, I think we really need more test cases to base these magic
numbers on. Such testing would likely also alleviate (or not...)
my discomfort with the width estimates being so poor.
regards, tom lane
On Tue, 31 Mar 2026 at 12:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
What I'm concerned about is that the estimated cost's dependency on
tuple width may be much stronger here than it has been in other uses.
That impression might be false, of course.
I think it's good to be concerned, but I think this is far from the
worst place to put trust in the width estimates. We also use them in
Memoize, and if we underestimate there, then we might end up with a
Nested Loop -> Memoize plan instead of a Hash or Merge Join. If the
actual Memoize cache hit ratio ends up much worse than expected due to
wider-than-expected tuples, then the chosen plan might be well off
being the optimal one. The execution costs of running a poorly chosen
Nested Loop with a poorly caching Memoize can become quadratic. I
think the parallel vs non-parallel problem is much more linear.
I'm more concerned about the opposite problem of being too liberal and
choosing parallel plans too often, resulting in worker exhaustion and
poorer performance as a result of serially executing parallel plans. I
suppose people could fix that by bumping up the parallel_setup_cost so
that the planner favours reserving parallel workers for plans that get
much bigger benefits from parallelisation.
David
On 2026-03-30 Mo 6:51 PM, David Rowley wrote:
On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
While investigating a performance issue, I found that it was extremely
difficult to get a parallel plan in some cases due to the fixed
parallel_tuple_cost. But this cost is not really fixed - it's going to
be larger for larger tuples. So this proposal adjusts the cost used
according to how large we expect the results to be.Unfortunately, I'm afraid that this is going to produce mostly
"garbage in, garbage out" estimates, because our opinion of how wide
tuples-in-flight are is pretty shaky. (See get_expr_width and
particularly get_typavgwidth, and note that we only have good
statistics-based numbers for plain Vars not function outputs.)
I agree that it could be useful to have some kind of adjustment here,
but I fear that linear scaling is putting way too much faith in the
quality of the data.(I suspect you're saying this because of the "Benchmark 2" in the text
file, which contains aggregates which return a varlena type, of which
we won't estimate the width very well for...)Sure, it's certainly true that there are cases where we don't get the
width estimate right, but that's not stopped us using them before. So
why is this case so much more critical? We now also have GROUP BY
before join abilities in the planner, which I suspect must also be
putting trust into the very same thing. Also, varlena-returning
Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
avoid making improvements in this area because we're not great at one
of the things that could be in the targetlist?For the patch and the analysis: This reminds me of [1], where some
reverse-engineering of costs from query run-times was done, which
ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
get that for this case would require various tests with different
tuple widths and ensuring that the costs scale linearly with the
run-time of the query with the patched version. Of course, the test
query would have to have perfect width estimates, but that could be
easy enough to do by populating a text typed GROUP BY column and
populating that with all the same width of text for each of the tests
before increasing the width for the next test, using a fixed-width
aggregate each time, e.g count(*). The "#define
PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
number. It would be good to know how close this is to reality.
Ideally, it would be good to see results from an Apple M<something>
chip and recent x86. In my experience, these produce very different
performance results, so it might be nice to find a value that is
somewhere in the middle of what we get from those machines. I suspect
having the GROUP BY column with text widths from 8 to 1024, increasing
in powers of two would be enough data points.
I followed your suggested methodology to measure how Gather IPC
cost actually scales with tuple width.
Setup: 10M rows, 100K distinct text values per table, text column
padded to a fixed width with lpad(). Query: SELECT txt, count(*)
FROM ptc_bench_W GROUP BY txt. This produces Partial HashAggregate
in workers, then Gather passes ~240K partial-aggregate tuples whose
width is dominated by the text column. 2 workers, work_mem=256MB,
cassert=off, debugoptimized build, aarch64 Linux.
I tested widths from 8 to 1024 bytes (10 data points). For each
width, I ran 5 iterations of both parallel and serial execution,
and computed the Gather overhead as:
overhead = T_parallel - T_serial / 3
This isolates the IPC cost: the serial time captures pure scan +
aggregate work, and dividing by 3 gives the ideal parallel time
(2 workers + leader). The excess is Gather overhead.
Results (microseconds per tuple through Gather, median of 5 runs):
Width(B) us/tuple Implied ptc (if ptc=0.1 at w=100)
-------- -------- ----------------------------------
8 0.30 0.032
16 0.24 0.025
32 0.77 0.083
64 0.72 0.078
128 1.03 0.111
256 1.62 0.175
384 2.90 0.313
512 3.21 0.346
768 4.12 0.445
1024 5.56 0.600
The best-fit models:
Linear: cost(w) = 0.42 + 0.0051 * w R² = 0.983
Power law: cost(w) = 0.061 * w^0.63 R² = 0.966
Linear fits best.
One notable finding: at the proposed reference width of 100 bytes,
the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of
which 0.42 is fixed.
The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
which substantially underestimates the width-independent component.
A higher fixed fraction would dampen the width adjustment, which
also partly addresses Tom's concern about sensitivity to width
estimate errors: with ~45% of the cost being fixed, even a 2x
error in width only translates to a ~1.5x error in total cost.
The script used to get the timings is attached.
cheers
andrew
--
Andrew Dunstan
EDB: https://www.enterprisedb.com
Attachments:
Andrew Dunstan <andrew@dunslane.net> writes:
I followed your suggested methodology to measure how Gather IPC
cost actually scales with tuple width.
I ran your test script on two of my development machines and got:
Linux/x86_64:
Width Parallel(ms) Serial(ms) Speedup Gather rows
----- ------------ ---------- ------- -----------
8 510.976 1219.453 2.39x 235706
16 532.123 1271.692 2.39x 235603
32 588.826 1356.428 2.30x 242062
64 674.485 1570.758 2.33x 239561
128 817.417 1887.202 2.31x 243158
256 1066.836 2548.100 2.39x 242304
384 1293.900 3038.905 2.35x 243941
512 1515.822 3573.144 2.36x 239064
768 1998.638 4725.448 2.36x 247558
1024 9865.460 22779.795 2.31x 10000000
macOS/M4-Pro:
Width Parallel(ms) Serial(ms) Speedup Gather rows
----- ------------ ---------- ------- -----------
8 299.464 769.130 2.57x 242549
16 310.361 787.629 2.54x 243643
32 344.541 839.589 2.44x 242419
64 413.330 967.512 2.34x 238771
128 519.794 1185.757 2.28x 241440
256 1479.766 1823.559 1.23x 238615
384 2022.882 2326.823 1.15x 240617
512 2423.938 2778.995 1.15x 244752
768 3511.425 3934.384 1.12x 235814
1024 9905.073 12214.577 1.23x 10000000
It's not entirely clear to me how you reduced these numbers
to a ptc formula, but we should do that and see how the
results compare to your machine.
The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
which substantially underestimates the width-independent component.
A higher fixed fraction would dampen the width adjustment, which
also partly addresses Tom's concern about sensitivity to width
estimate errors: with ~45% of the cost being fixed, even a 2x
error in width only translates to a ~1.5x error in total cost.
That does make me feel better, assuming that we come out with
similar results on several machines.
The script used to get the timings is attached.
If anyone else wants to try this with a platform having non-GNU
grep, you'll need these changes:
$ diff -pud ptc_calibrate.sh~ ptc_calibrate.sh
--- ptc_calibrate.sh~ 2026-04-01 10:02:53.000000000 -0400
+++ ptc_calibrate.sh 2026-04-01 13:31:06.873772739 -0400
@@ -32,7 +32,8 @@ psql_cmd() {
psql_cmd_timing() {
"$PGBIN/psql" -h /tmp -p $PORT -d "$DB" -o /dev/null -qAt \
- -c '\timing on' "$@" 2>&1 | grep -oP 'Time: \K[\d.]+' | tail -1
+ -c '\timing on' "$@" 2>&1 | \
+ sed -n 's/.*Time: \([0-9.][0-9.]*\).*/\1/p' | tail -1
}
# Create the database if needed
@@ -114,7 +115,7 @@ for W in $WIDTHS; do
# Get Gather row count from EXPLAIN
gather_rows=$(psql_cmd -c "$SET_CMDS EXPLAIN (COSTS ON) $Q;" \
- | grep -oP 'Gather.*rows=\K\d+' | head -1)
+ | sed -n 's/.*Gather.*rows=\([0-9][0-9]*\).*/\1/p' | head -1)
gather_rows=${gather_rows:-"?"}
# Warm up (2 runs each)
regards, tom lane
I wrote:
macOS/M4-Pro:
Width Parallel(ms) Serial(ms) Speedup Gather rows
----- ------------ ---------- ------- -----------
8 299.464 769.130 2.57x 242549
16 310.361 787.629 2.54x 243643
32 344.541 839.589 2.44x 242419
64 413.330 967.512 2.34x 238771
128 519.794 1185.757 2.28x 241440
256 1479.766 1823.559 1.23x 238615
384 2022.882 2326.823 1.15x 240617
512 2423.938 2778.995 1.15x 244752
768 3511.425 3934.384 1.12x 235814
1024 9905.073 12214.577 1.23x 10000000
On closer review, it looks like I carelessly allowed this test
to run in parallel with a buildfarm run. Here are numbers
with an idle machine:
Width Parallel(ms) Serial(ms) Speedup Gather rows
----- ------------ ---------- ------- -----------
8 281.881 758.167 2.69x 242549
16 300.997 791.184 2.63x 243643
32 340.815 842.715 2.47x 242419
64 401.282 985.711 2.46x 238771
128 507.066 1183.727 2.33x 241440
256 718.008 1667.830 2.32x 238615
384 1774.601 2224.726 1.25x 240617
512 2439.593 2784.242 1.14x 244752
768 3254.088 3698.615 1.14x 235814
1024 8990.584 12176.341 1.35x 10000000
This is interesting because while the speedup ratio was
pretty insensitive to row width on the x86_64 box, that's
far from true on the Apple box.
regards, tom lane