benchmark results comparing versions 15.2 and 16

Started by MARK CALLAGHANalmost 3 years ago29 messageshackers
Jump to latest
#1MARK CALLAGHAN
mdcallag@gmail.com

Let me know if results like this shouldn't be posted here.

This is mostly a hobby project for me - my other hobby is removing
invasive weeds. I am happy to answer questions and run more tests, but
turnaround for answers won't be instant. Getting results from Linux perf
for these tests is on my TODO list. For now I am just re-running a subset
of these to get more certainty that the regressions are real and not noise.

These are results for the insert benchmark on a small server comparing
performance for versions 15.2 and 16. For version 16 I built from source at
git sha 1ab763fc.
https://github.com/postgres/postgres/commit/1ab763fc22adc88e5d779817e7b42b25a9dd7c9e

Late in the version 15 beta release cycle this benchmark found a
significant regression. I don't see anything significant this time, but
there are potential small regressions.

More detail on how I run the benchmark and the HW is here, the server is
small -- Beelink with 8 cores, 16G RAM and 1T NVMe SSD.
http://smalldatum.blogspot.com/2023/05/the-insert-benchmark-postgres-versions.html

Performance reports are linked below. But first, disclaimers:
* the goal is to determine whether there are CPU improvements or
regressions. To make that easier to spot I disable fsync on commit.
* my scripts compute CPU/operation where operation is a SQL statement.
However, CPU in this case is measured by vmstat and includes CPU from the
benchmark client and Postgres server
* the regressions here are small, usually less than 5% which means it can
be hard to distinguish between normal variance and a regression but the
results are repeatable
* the links below are to the Summary section which includes throughput
(absolute and relative). The relative throughput is the (throughput for PG
16 / throughput for PG 15.2) and
* I used the same compiler options for the builds of 15.2 and 16

Summary of the results:
* from r1 - insert-heavy (l.i0, l.i1) and create indexes (l.x) steps are
~2% slower in PG 16
* from r2 - create index (l.x) step is ~4% slower in PG 16
* from r3 - regressions are similar to r1
* from r4, r5 and r6 - regressions are mostly worse than r1, r2, r3. Note
r4, r5, r6 are the same workload as r1, r2, r3 except the database is
cached by PG for r1, r2, r3 so the r4, r5 and r6 benchmarks will do much
more copying from the OS page cache into the Postgres buffer pool.

I will repeat r1, r2, r4 and r5 but with the tests run in a different order
to confirm this isn't just noise.

Database cached by Postgres:
r1) 1 table, 1 client -
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.cached/all.html#summary
r2) 4 tables, 4 clients -
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tno.cached/all.html#summary
r3) 1 table, 4 clients -
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.cached/all.html#summary

Database cached by OS but not by Postgres:
r4) 1 table, 1 client -
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.1g/all.html#summary
r5) 4 tables, 4 clients -
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tno.1g/all.html#summary
r6) 1 table, 4 clients -
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/all.html#summary

--
Mark Callaghan
mdcallag@gmail.com

#2Andres Freund
andres@anarazel.de
In reply to: MARK CALLAGHAN (#1)
Re: benchmark results comparing versions 15.2 and 16

Hi,

On 2023-05-05 10:45:12 -0700, MARK CALLAGHAN wrote:

This is mostly a hobby project for me - my other hobby is removing
invasive weeds.

Hah :)

Summary of the results:
* from r1 - insert-heavy (l.i0, l.i1) and create indexes (l.x) steps are
~2% slower in PG 16
* from r2 - create index (l.x) step is ~4% slower in PG 16
* from r3 - regressions are similar to r1
* from r4, r5 and r6 - regressions are mostly worse than r1, r2, r3. Note
r4, r5, r6 are the same workload as r1, r2, r3 except the database is
cached by PG for r1, r2, r3 so the r4, r5 and r6 benchmarks will do much
more copying from the OS page cache into the Postgres buffer pool.

One thing that's somewhat odd is that there's very marked changes in l.i0's
p99 latency for the four clients cases - but whether 15 or 16 are better
differs between the runs.

r2)
p99
20m.pg152_o3_native_lto.cx7 300
20m.pg16prebeta.cx7 23683

r3)
p99
20m.pg152_o3_native_lto.cx7 70245
20m.pg16prebeta.cx7 8191

r5)
p99
20m.pg152_o3_native_lto.cx7 11188
20m.pg16prebeta.cx7 72720

r6)
p99
20m.pg152_o3_native_lto.cx7 1898
20m.pg16prebeta.cx7 31666

I do wonder if there's something getting scheduled in some of these runs
increasing latency?

Or what we're seeing depends on the time between the start of the server and
the start of the benchmark? It is interesting that the per-second throughput
graph shows a lot of up/down at the end:
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/tput.l.i0.html

Both 15 and 16 have one very high result at 70s, 15 then has one low, but 16
has two low results.

Greetings,

Andres Freund

#3MARK CALLAGHAN
mdcallag@gmail.com
In reply to: Andres Freund (#2)
Re: benchmark results comparing versions 15.2 and 16

I have two more runs of the benchmark in progress so we will have 3 results
for each of the test cases to confirm that the small regressions are
repeatable.

On Fri, May 5, 2023 at 1:34 PM Andres Freund <andres@anarazel.de> wrote:

One thing that's somewhat odd is that there's very marked changes in l.i0's
p99 latency for the four clients cases - but whether 15 or 16 are better
differs between the runs.

From the response time sections for l.i0 the [1ms, 4ms) bucket has 99% or
more for all 6 cases.
For example,
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.cached/all.html#l.i0.rt
But yes, the p99 is as you state. I will wade through my test scripts
tomorrow to see how the p99 is computed.

I do wonder if there's something getting scheduled in some of these runs

increasing latency?

Do you mean interference from other processes? Both the big and small
servers run Ubuntu 22.04 server (no X) and there shouldn't be many extra
things, although Google Cloud adds a few extra things that run in the
background.

Or what we're seeing depends on the time between the start of the server
and
the start of the benchmark? It is interesting that the per-second
throughput
graph shows a lot of up/down at the end:

https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/tput.l.i0.html

Both 15 and 16 have one very high result at 70s, 15 then has one low, but
16
has two low results.

Immediately prior to l.i0 the database directory is wiped and then Postgres
is initialized and started.

The IPS vs time graphs are more interesting for benchmark steps that run
longer. Alas, this can't run too long if the resulting database is to fit
in <= 16G. But that is a problem for another day. The IPS vs time graphs
are not a flat line, but I am not ready to pursue that as problem unless it
shows multi-second write-stalls (fortunately it does not).

--
Mark Callaghan
mdcallag@gmail.com

#4Alexander Lakhin
exclusion@gmail.com
In reply to: MARK CALLAGHAN (#1)
Re: benchmark results comparing versions 15.2 and 16

Hello Mark,

05.05.2023 20:45, MARK CALLAGHAN wrote:

This is mostly a hobby project for me - my other hobby is removing invasive weeds. I am happy to answer questions and
run more tests, but turnaround for answers won't be instant. Getting results from Linux perf for these tests is on my
TODO list. For now I am just re-running a subset of these to get more certainty that the regressions are real and not
noise.

It's a very interesting topic to me, too. I had developed some scripts to
measure and compare postgres`s performance using miscellaneous public
benchmarks (ycsb, tpcds, benchmarksql_tpcc, htapbench, benchbase, gdprbench,
s64da-benchmark, ...). Having compared 15.3 (56e869a09) with master
(58f5edf84) I haven't seen significant regressions except a few minor ones.
First regression observed with a simple pgbench test:
pgbench -i benchdb
pgbench -c 10 -T 300 benchdb
(with default compilation options and fsync = off)

On master I get:
tps = 10349.826645 (without initial connection time)
On 15.3:
tps = 11296.064992 (without initial connection time)

This difference is confirmed by multiple test runs. `git bisect` for this
regression pointed at f193883fc.

Best regards,
Alexander

#5Andres Freund
andres@anarazel.de
In reply to: Alexander Lakhin (#4)
Re: benchmark results comparing versions 15.2 and 16

Hi,

On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote:

This difference is confirmed by multiple test runs. `git bisect` for this
regression pointed at f193883fc.

I can reproduce a significant regression due to f193883fc of a workload just
running
SELECT CURRENT_TIMESTAMP;

A single session running it on my workstation via pgbench -Mprepared gets
before:
tps = 89359.128359 (without initial connection time)
after:
tps = 83843.585152 (without initial connection time)

Obviously this is an extreme workload, but that nevertheless seems too large
to just accept...

Michael, the commit message notes that there were no measured performance
regression - yet I see one in a trivial test. What were you measuring?

I'm a bit surprised by the magnitude of the regression, but it's not
surprising that there is a performance effect. You're replacing something that
doesn't go through the whole generic function rigamarole, and replace it with
something that does...

Looking at two perf profiles, the biggest noticable difference is

Before:

-    5.51%     0.13%  postgres  postgres  [.] ExecInitResult
   - 5.38% ExecInitResult
      + 2.29% ExecInitResultTupleSlotTL
      - 2.22% ExecAssignProjectionInfo
         - 2.19% ExecBuildProjectionInfo
              0.47% ExecReadyInterpretedExpr
            - 0.43% ExecInitExprRec
               - 0.10% palloc
                    AllocSetAlloc.localalias (inlined)
            + 0.32% expression_tree_walker_impl.localalias (inlined)
            + 0.28% get_typlen
              0.09% ExecPushExprSlots
            + 0.06% MemoryContextAllocZeroAligned
            + 0.04% MemoryContextAllocZeroAligned
              0.02% exprType.localalias (inlined)
      + 0.41% ExecAssignExprContext
      + 0.35% MemoryContextAllocZeroAligned
        0.11% ExecInitQual.localalias (inlined)
   + 0.11% _start
   + 0.02% 0x55b89c764d7f

After:

-    6.57%     0.17%  postgres  postgres  [.] ExecInitResult
   - 6.40% ExecInitResult
      - 3.00% ExecAssignProjectionInfo
         - ExecBuildProjectionInfo
            - 0.91% ExecInitExprRec
               - 0.65% ExecInitFunc
                    0.23% fmgr_info_cxt_security
                  + 0.18% palloc0
                  + 0.07% object_aclcheck
                    0.04% fmgr_info
                 0.05% check_stack_depth
               + 0.05% palloc
            + 0.58% expression_tree_walker_impl.localalias (inlined)
            + 0.55% get_typlen
              0.37% ExecReadyInterpretedExpr
            + 0.11% MemoryContextAllocZeroAligned
              0.09% ExecPushExprSlots
              0.04% exprType.localalias (inlined)
      + 2.77% ExecInitResultTupleSlotTL
      + 0.50% ExecAssignExprContext
      + 0.09% MemoryContextAllocZeroAligned
        0.05% ExecInitQual.localalias (inlined)
   + 0.10% _start

I.e. we spend more time building the expression state for expression
evaluation, because we now go through the generic ExecInitFunc(), instead of
something dedicated. We also now need to do permission checking etc.

I don't think that's the entirety of the regression...

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#5)
Re: benchmark results comparing versions 15.2 and 16

Hi,

On 2023-05-08 12:11:17 -0700, Andres Freund wrote:

Hi,

On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote:

This difference is confirmed by multiple test runs. `git bisect` for this
regression pointed at f193883fc.

I can reproduce a significant regression due to f193883fc of a workload just
running
SELECT CURRENT_TIMESTAMP;

A single session running it on my workstation via pgbench -Mprepared gets
before:
tps = 89359.128359 (without initial connection time)
after:
tps = 83843.585152 (without initial connection time)

Obviously this is an extreme workload, but that nevertheless seems too large
to just accept...

Added an open item for this.

Greetings,

Andres Freund

#7MARK CALLAGHAN
mdcallag@gmail.com
In reply to: MARK CALLAGHAN (#3)
Re: benchmark results comparing versions 15.2 and 16

On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

I have two more runs of the benchmark in progress so we will have 3
results for each of the test cases to confirm that the small regressions
are repeatable.

They get similar results. Then I tried Linux perf but the hierarchical call
stacks, to be used for Flamegraph, have too many "[unknown]" entries.
I was using: ./configure --prefix=$pfx --enable-debug CFLAGS="-O3
-march=native -mtune=native -flto" LDFLAGS="-flto" > o.cf.$x 2> e.cf.$x
Adding -no-omit-frame-pointer fixes the problem, so I am repeating the
benchmark with that to confirm there are still regressions and then I will
get flamegraphs.

--
Mark Callaghan
mdcallag@gmail.com

#8Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#4)
Re: benchmark results comparing versions 15.2 and 16

08.05.2023 16:00, Alexander Lakhin wrote:

... Having compared 15.3 (56e869a09) with master
(58f5edf84) I haven't seen significant regressions except a few minor ones.
First regression observed with a simple pgbench test:

Another noticeable, but not critical performance degradation is revealed by
query 87 from TPC-DS (I use s64da-benchmark):
https://github.com/swarm64/s64da-benchmark-toolkit/blob/master/benchmarks/tpcds/queries/queries_10/87.sql

With `prepare_benchmark --scale-factor=2`, `run_benchmark --scale-factor=10`
I get on master:
2023-05-10 09:27:52,888 INFO    : finished 80/103: query 87 of stream  0: 2.26s OK
but on REL_15_STABLE:
2023-05-10 08:13:40,648 INFO    : finished 80/103: query 87 of stream  0: 1.94s OK

This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
(both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):
->  Subquery Scan on "*SELECT* 1"  (cost=149622.00..149958.68 rows=16834 width=21) (actual time=1018.606..1074.468
rows=93891 loops=1)
 ->  Unique  (cost=149622.00..149790.34 rows=16834 width=17) (actual time=1018.604..1064.790 rows=93891 loops=1)
  ->  Sort  (cost=149622.00..149664.09 rows=16834 width=17) (actual time=1018.603..1052.591 rows=94199 loops=1)
   ->  Gather  (cost=146588.59..148440.33 rows=16834 width=17) (actual time=880.899..913.978 rows=94199 loops=1)
vs
->  Subquery Scan on "*SELECT* 1"  (cost=147576.79..149829.53 rows=16091 width=21) (actual time=1126.489..1366.751
rows=93891 loops=1)
 ->  Unique  (cost=147576.79..149668.62 rows=16091 width=17) (actual time=1126.487..1356.938 rows=93891 loops=1)
  ->  Gather Merge  (cost=147576.79..149547.94 rows=16091 width=17) (actual time=1126.487..1345.253 rows=94204 loops=1)
   ->  Unique  (cost=146576.78..146737.69 rows=16091 width=17) (actual time=1124.426..1306.532 rows=47102 loops=2)
    ->  Sort  (cost=146576.78..146617.01 rows=16091 width=17) (actual time=1124.424..1245.110 rows=533434 loops=2)

->  Subquery Scan on "*SELECT* 2"  (cost=52259.82..52428.16 rows=8417 width=21) (actual time=653.640..676.879 rows=62744
loops=1)
 ->  Unique  (cost=52259.82..52343.99 rows=8417 width=17) (actual time=653.639..670.405 rows=62744 loops=1)
  ->  Sort  (cost=52259.82..52280.86 rows=8417 width=17) (actual time=653.637..662.428 rows=62744 loops=1)
   ->  Gather  (cost=50785.20..51711.07 rows=8417 width=17) (actual time=562.158..571.737 rows=62744 loops=1)
    ->  HashAggregate  (cost=49785.20..49869.37 rows=8417 width=17) (actual time=538.263..544.336 rows=31372 loops=2)
     ->  Nested Loop  (cost=0.85..49722.07 rows=8417 width=17) (actual time=2.049..469.747 rows=284349 loops=2)
vs
->  Subquery Scan on "*SELECT* 2"  (cost=48503.68..49630.12 rows=8046 width=21) (actual time=700.050..828.388 rows=62744
loops=1)
 ->  Unique  (cost=48503.68..49549.66 rows=8046 width=17) (actual time=700.047..821.836 rows=62744 loops=1)
  ->  Gather Merge  (cost=48503.68..49489.31 rows=8046 width=17) (actual time=700.047..814.136 rows=62744 loops=1)
   ->  Unique  (cost=47503.67..47584.13 rows=8046 width=17) (actual time=666.348..763.403 rows=31372 loops=2)
    ->  Sort  (cost=47503.67..47523.78 rows=8046 width=17) (actual time=666.347..730.336 rows=284349 loops=2)
     ->  Nested Loop  (cost=0.85..46981.72 rows=8046 width=17) (actual time=1.852..454.111 rows=284349 loops=2)

->  Subquery Scan on "*SELECT* 3"  (cost=50608.83..51568.70 rows=7165 width=21) (actual time=302.571..405.305 rows=23737
loops=1)
 ->  Unique  (cost=50608.83..51497.05 rows=7165 width=17) (actual time=302.568..402.818 rows=23737 loops=1)
  ->  Gather Merge  (cost=50608.83..51443.31 rows=7165 width=17) (actual time=302.567..372.246 rows=287761 loops=1)
   ->  Sort  (cost=49608.81..49616.27 rows=2985 width=17) (actual time=298.204..310.075 rows=95920 loops=3)
    ->  Nested Loop  (cost=2570.65..49436.52 rows=2985 width=17) (actual time=3.205..229.192 rows=95920 loops=3)
vs
->  Subquery Scan on "*SELECT* 3"  (cost=50541.84..51329.11 rows=5708 width=21) (actual time=302.042..336.820 rows=23737
loops=1)
 ->  Unique  (cost=50541.84..51272.03 rows=5708 width=17) (actual time=302.039..334.329 rows=23737 loops=1)
  ->  Gather Merge  (cost=50541.84..51229.22 rows=5708 width=17) (actual time=302.039..331.296 rows=24128 loops=1)
   ->  Unique  (cost=49541.81..49570.35 rows=2854 width=17) (actual time=298.771..320.560 rows=8043 loops=3)
    ->  Sort  (cost=49541.81..49548.95 rows=2854 width=17) (actual time=298.770..309.603 rows=95920 loops=3)
     ->  Nested Loop  (cost=2570.52..49378.01 rows=2854 width=17) (actual time=3.209..230.291 rows=95920 loops=3)

From the commit message and the discussion [1]/messages/by-id/CAApHDvo8Lz2H=42urBbfP65LTcEUOh288MT7DsG2_EWtW1AXHQ@mail.gmail.com, it's not clear to me, whether
this plan change is expected. Perhaps it's too minor issue to bring attention
to, but maybe this information could be useful for v16 performance analysis.

[1]: /messages/by-id/CAApHDvo8Lz2H=42urBbfP65LTcEUOh288MT7DsG2_EWtW1AXHQ@mail.gmail.com

Best regards,
Alexander

Attachments:

tpcds-q87-3c6fc5820-plan.txttext/plain; charset=UTF-8; name=tpcds-q87-3c6fc5820-plan.txtDownload
tpcds-q87-e5b8a4c09-plan.txttext/plain; charset=UTF-8; name=tpcds-q87-e5b8a4c09-plan.txtDownload
#9David Rowley
dgrowleyml@gmail.com
In reply to: Alexander Lakhin (#8)
Re: benchmark results comparing versions 15.2 and 16

On Thu, 11 May 2023 at 01:00, Alexander Lakhin <exclusion@gmail.com> wrote:

This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
(both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):

Based on what you've sent, I'm uninspired to want to try to do
anything about it. The patched version finds a plan that's cheaper.
The row estimates are miles off with both plans. I'm not sure what
we're supposed to do here. It's pretty hard to make changes to the
planner's path generation without risking that a bad plan is chosen
when it wasn't beforehand with bad row estimates.

Is the new plan still slower if you increase work_mem so that the sort
no longer goes to disk? Maybe the planner would have picked Hash
Aggregate if the row estimates had been such that cost_tuplesort()
knew that the sort would have gone to disk.

David

#10Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#6)
Re: benchmark results comparing versions 15.2 and 16

On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:

On 2023-05-08 12:11:17 -0700, Andres Freund wrote:

I can reproduce a significant regression due to f193883fc of a workload just
running
SELECT CURRENT_TIMESTAMP;

A single session running it on my workstation via pgbench -Mprepared gets
before:
tps = 89359.128359 (without initial connection time)
after:
tps = 83843.585152 (without initial connection time)

Obviously this is an extreme workload, but that nevertheless seems too large
to just accept...

Added an open item for this.

Thanks for the report, I'll come back to it and look at it at the
beginning of next week. In the worst case, that would mean a revert
of this refactoring, I assume.
--
Michael

#11Alexander Lakhin
exclusion@gmail.com
In reply to: David Rowley (#9)
Re: benchmark results comparing versions 15.2 and 16

11.05.2023 01:27, David Rowley wrote:

On Thu, 11 May 2023 at 01:00, Alexander Lakhin <exclusion@gmail.com> wrote:

This time `git bisect` pointed at 3c6fc5820. Having compared execution plans
(both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820):

Based on what you've sent, I'm uninspired to want to try to do
anything about it. The patched version finds a plan that's cheaper.
The row estimates are miles off with both plans.

I've made sure that s64da-benchmark performs analyze before running the
queries (pg_class.reltuples fields for tables in question contain actual
counts), so it seems that nothing can be done on the benchmark side to
improve those estimates.

... It's pretty hard to make changes to the
planner's path generation without risking that a bad plan is chosen
when it wasn't beforehand with bad row estimates.

Yeah, I see. It's also interesting to me, which tests perform better after
that commit. It takes several hours to run all tests, so I can't present
results quickly, but I'll try to collect this information next week.

Is the new plan still slower if you increase work_mem so that the sort
no longer goes to disk? Maybe the planner would have picked Hash
Aggregate if the row estimates had been such that cost_tuplesort()
knew that the sort would have gone to disk.

Yes, increasing work_mem to 50MB doesn't affect the plans (new plans
attached), though the sort method changed to quicksort. The former plan is
still executed slightly faster.

Best regards,
Alexander

Attachments:

tpcds-q87-3c6fc5820-wm50-plan.txttext/plain; charset=UTF-8; name=tpcds-q87-3c6fc5820-wm50-plan.txtDownload
tpcds-q87-e5b8a4c09-wm50-plan.txttext/plain; charset=UTF-8; name=tpcds-q87-e5b8a4c09-wm50-plan.txtDownload
#12Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#10)
Re: benchmark results comparing versions 15.2 and 16

On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote:

On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:

On 2023-05-08 12:11:17 -0700, Andres Freund wrote:

I can reproduce a significant regression due to f193883fc of a workload just
running
SELECT CURRENT_TIMESTAMP;

A single session running it on my workstation via pgbench -Mprepared gets
before:
tps = 89359.128359 (without initial connection time)
after:
tps = 83843.585152 (without initial connection time)

Obviously this is an extreme workload, but that nevertheless seems too large
to just accept...

Extreme is adapted for a worst-case scenario. Looking at my notes
from a few months back, that's kind of what I did on my laptop, which
was the only machine I had at hand back then:
- Compilation of code with -O2.
- Prepared statement of a simple SELECT combined with a DO block
running executed the query in a loop a few million times on a single
backend:
PREPARE test AS SELECT CURRENT_TIMESTAMP;
DO $$
BEGIN
FOR i IN 1..10000000
LOOP
EXECUTE 'EXECUTE test';
END LOOP;
END $$;
- The second test is mentioned at [1]/messages/by-id/Y0+dHDYA46UnnLs/@paquier.xyz -- Michael, with a generate_series() on a
keyword.
- And actually I recall some pgbench runs similar to that.. But I
don't have any traces of that in my notes.

This was not showing much difference, and it does not now, either.
Funny thing is that the pre-patch period was showing signs of being a
bit slower in this environment. Anyway, I have just tested the DO
case in a second "bigbox" environment that I have set up for
benchmarking a few days ago, and the DO test is showing me a 1%~1.5%
regression in runtime. That's repeatable so that's not noise.

I have re-run a bit more pgbench (1 client, prepared query with a
single SELECT on a SQL keyword, etc.). And, TBH, I am not seeing as
much difference as you do (nothing with default pgbench setup, FWIW),
still that's able to show a bit more difference than the other two
cases. HEAD shows me an average output close to 43900 TPS (3 run of
60s each, for instance), while relying on SQLValueFunction shows an
average of 45000TPS. That counts for ~2.4% output regression here
on bigbox based on these numbers. Not a regression as high as
mentioned above, still that's visible.

Added an open item for this.

Thanks for the report, I'll come back to it and look at it at the
beginning of next week. In the worst case, that would mean a revert
of this refactoring, I assume.

So, this involves commits 7aa81c6, f193883 and fb32748. 7aa81c6 has
added some tests, so I would let the tests it added be on HEAD as the
precision was not tested for the SQL keywords this has added cover
for.

fb32748 has removed the dependency of SQLValueFunction on collations
by making the name functions use COERCE_SQL_SYNTAX. One case where
these could be heavily used is RLS, for example, so that could be
visible. f193883 has removed the rest and the timestamp keywords.

I am not going to let that hang in the air with beta1 getting released
next week, though, so attached are two patches to revert the change
(these would be combined, just posted this way for clarity). The only
conflict I could see is caused by the query jumbling where a
SQLValueFunction needs "typmod" and "op" in the computation, ignoring
the rest, so the node definition primnodes.h gains the required
node_attr. Making the execution path cheaper while avoiding the
collation tweaks for SQLValueFunction would be nice, but not at this
time of the year on this branch.

[1]: /messages/by-id/Y0+dHDYA46UnnLs/@paquier.xyz -- Michael
--
Michael

Attachments:

v1-0001-Revert-Replace-SQLValueFunction-by-COERCE_SQL_SYN.patchtext/x-diff; charset=us-asciiDownload+385-217
v1-0002-Revert-Switch-SQLValueFunction-on-name-to-use-COE.patchtext/x-diff; charset=us-asciiDownload+92-56
#13Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#12)
Re: benchmark results comparing versions 15.2 and 16

Hi,

On 2023-05-15 14:20:24 +0900, Michael Paquier wrote:

On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote:

On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote:

On 2023-05-08 12:11:17 -0700, Andres Freund wrote:

I can reproduce a significant regression due to f193883fc of a workload just
running
SELECT CURRENT_TIMESTAMP;

A single session running it on my workstation via pgbench -Mprepared gets
before:
tps = 89359.128359 (without initial connection time)
after:
tps = 83843.585152 (without initial connection time)

Obviously this is an extreme workload, but that nevertheless seems too large
to just accept...

Extreme is adapted for a worst-case scenario. Looking at my notes
from a few months back, that's kind of what I did on my laptop, which
was the only machine I had at hand back then:
- Compilation of code with -O2.

I assume without assertions as well?

I have re-run a bit more pgbench (1 client, prepared query with a
single SELECT on a SQL keyword, etc.). And, TBH, I am not seeing as
much difference as you do (nothing with default pgbench setup, FWIW),
still that's able to show a bit more difference than the other two
cases.

HEAD shows me an average output close to 43900 TPS (3 run of
60s each, for instance), while relying on SQLValueFunction shows an
average of 45000TPS. That counts for ~2.4% output regression here
on bigbox based on these numbers. Not a regression as high as
mentioned above, still that's visible.

45k seems too low for a modern machine, given that I get > 80k in such a
workload, on a workstation with server CPUs (i.e. many cores, but not that
fast individually). Hence wondering about assertions being enabled...

I get quite variable performance if I don't pin client / server to the same
core, but even the slow performance is faster than 45k.

Greetings,

Andres Freund

#14Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#13)
Re: benchmark results comparing versions 15.2 and 16

On Mon, May 15, 2023 at 05:14:47PM -0700, Andres Freund wrote:

On 2023-05-15 14:20:24 +0900, Michael Paquier wrote:

On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote:
Extreme is adapted for a worst-case scenario. Looking at my notes
from a few months back, that's kind of what I did on my laptop, which
was the only machine I had at hand back then:
- Compilation of code with -O2.

I assume without assertions as well?

Yup, no assertions.

45k seems too low for a modern machine, given that I get > 80k in such a
workload, on a workstation with server CPUs (i.e. many cores, but not that
fast individually). Hence wondering about assertions being enabled...

Nope, disabled.

I get quite variable performance if I don't pin client / server to the same
core, but even the slow performance is faster than 45k.

Okay. You mean with something like taskset or similar, I guess?
--
Michael

#15Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#14)
Re: benchmark results comparing versions 15.2 and 16

Hi,

On 2023-05-16 09:42:31 +0900, Michael Paquier wrote:

I get quite variable performance if I don't pin client / server to the same
core, but even the slow performance is faster than 45k.

Okay. You mean with something like taskset or similar, I guess?

Yes. numactl --physcpubind ... in my case. Linux has an optimization where it
does not need to send an IPI when the client and server are scheduled on the
same core. For single threaded ping-pong tasks like pgbench -c1, that can make
a huge difference, particularly on larger CPUs. So you get a lot better
performance when forcing things to be colocated.

Greetings,

Andres Freund

#16Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#15)
Re: benchmark results comparing versions 15.2 and 16

On Mon, May 15, 2023 at 05:54:53PM -0700, Andres Freund wrote:

Yes. numactl --physcpubind ... in my case. Linux has an optimization where it
does not need to send an IPI when the client and server are scheduled on the
same core. For single threaded ping-pong tasks like pgbench -c1, that can make
a huge difference, particularly on larger CPUs. So you get a lot better
performance when forcing things to be colocated.

Yes, that's not bringing the numbers higher with the simple cases I
reported previously, either.

Anyway, even if I cannot see such a high difference, I don't see how
to bring back the original numbers you are reporting without doing
more inlining and tying COERCE_SQL_SYNTAX more tightly within the
executor's portions for the FuncExprs, and there are the collation
assumptions as well. Perhaps that's not the correct thing to do with
SQLValueFunction remaining around, but nothing can be done for v16, so
I am planning to just revert the change before beta1, and look at it
again later, from scratch.
--
Michael

#17Alexander Lakhin
exclusion@gmail.com
In reply to: Michael Paquier (#12)
Re: benchmark results comparing versions 15.2 and 16

15.05.2023 08:20, Michael Paquier wrote:

I am not going to let that hang in the air with beta1 getting released
next week, though, so attached are two patches to revert the change
(these would be combined, just posted this way for clarity).

I can confirm that the patches improve (restore) performance for my test:
pgbench -i benchdb
pgbench -c 10 -T 300 benchdb

tps (over three runs):
HEAD (08c45ae23):
10238.441580, 10697.202119, 10706.764703

HEAD with the patches:
11134.510118, 11176.554996, 11150.338488

f193883fc~1 (240e0dbac)
11082.561388, 11233.604446, 11087.071768

15.3 (8382864eb)
11328.699555, 11128.057883, 11057.934392

Best regards,
Alexander

#18Michael Paquier
michael@paquier.xyz
In reply to: Alexander Lakhin (#17)
Re: benchmark results comparing versions 15.2 and 16

On Tue, May 16, 2023 at 06:00:00PM +0300, Alexander Lakhin wrote:

I can confirm that the patches improve (restore) performance for my test:
pgbench -i benchdb
pgbench -c 10 -T 300 benchdb

Thanks for running these!

tps (over three runs):
HEAD (08c45ae23):
10238.441580, 10697.202119, 10706.764703

HEAD with the patches:
11134.510118, 11176.554996, 11150.338488

f193883fc~1 (240e0dbac)
11082.561388, 11233.604446, 11087.071768

15.3 (8382864eb)
11328.699555, 11128.057883, 11057.934392

The numbers between f193883fc~1 and HEAD+patch are close to each
other. It does not seem to make the whole difference with 15.3, but
most of it. The difference can also be explained with some noise,
based on the number patterns of the third runs?

I have now applied the revert, ready for beta1. Thanks for the
feedback!
--
Michael

#19Alexander Lakhin
exclusion@gmail.com
In reply to: Michael Paquier (#18)
Re: benchmark results comparing versions 15.2 and 16

17.05.2023 04:25, Michael Paquier wrote:

The numbers between f193883fc~1 and HEAD+patch are close to each
other. It does not seem to make the whole difference with 15.3, but
most of it. The difference can also be explained with some noise,
based on the number patterns of the third runs?

I have now applied the revert, ready for beta1. Thanks for the
feedback!

Thank you for paying attention to it!

Yes, I ran the benchmark on my workstation, so numbers could vary due parallel
activity. Now I've compared 15.3 (8382864eb) with d8c3106bb and 1d369c9e9,
this time with the CPU boost mode disabled:
1d369c9e9:
10007.130326, 10047.722122, 9920.612426, 10016.053846, 10060.606408
d8c3106bb:
10492.100485, 10505.326827, 10535.918137, 10625.904871, 10573.608859
15.3:
10458.752330, 10308.677192, 10366.496526, 10489.395275, 10319.458041

Best "1d369c9e9" worse than "15.3" by 4.1 percents (10060.61 < 10489.40)
Average "1d369c9e9" worse than "15.3" by 3.6 percents (10010.43 < 10388.56)

Best "d8c3106bb" better than "15.3" by 1.3 percents (10625.90 > 10489.40)
Average "d8c3106bb" better than "15.3" by 1.5 percents (10546.57 > 10388.56)

So it seems that there is nothing left on this plate.

Best regards,
Alexander

#20MARK CALLAGHAN
mdcallag@gmail.com
In reply to: MARK CALLAGHAN (#7)
Re: benchmark results comparing versions 15.2 and 16

On Tue, May 9, 2023 at 10:36 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:

I have two more runs of the benchmark in progress so we will have 3
results for each of the test cases to confirm that the small regressions
are repeatable.

I repeated the benchmark a few times using a more recent PG16 build (git
sha 08c45ae2) and have yet to see any significant changes. So that is good
news. My testing scripts have been improved so I should be able to finish
the next round of tests in less time.

--
Mark Callaghan
mdcallag@gmail.com

#21Andres Freund
andres@anarazel.de
In reply to: MARK CALLAGHAN (#20)
#22MARK CALLAGHAN
mdcallag@gmail.com
In reply to: Andres Freund (#21)
#23MARK CALLAGHAN
mdcallag@gmail.com
In reply to: MARK CALLAGHAN (#22)
#24MARK CALLAGHAN
mdcallag@gmail.com
In reply to: MARK CALLAGHAN (#23)
#25David Rowley
dgrowleyml@gmail.com
In reply to: MARK CALLAGHAN (#23)
In reply to: David Rowley (#25)
#27MARK CALLAGHAN
mdcallag@gmail.com
In reply to: Peter Geoghegan (#26)
In reply to: MARK CALLAGHAN (#27)
#29Alexander Lakhin
exclusion@gmail.com
In reply to: Alexander Lakhin (#11)