Re: Index tuple killing code committed

Started by Tatsuo Ishiiover 22 years ago3 messages

t-ishii@sra.co.jp

over 22 years ago

I found following mail in my mail archive and tried the same test with
7.4 current. Contrary to my expectation 7.4 showed some performance
degration with continuous pgbench runs:

$ pgbench -c 5 -t 1000 -n test
tps = 57.444037 (including connections establishing)
tps = 57.455300 (excluding connections establishing)

$ pgbench -c 5 -t 1000 -n test
tps = 54.125785 (including connections establishing)
tps = 54.134871 (excluding connections establishing)

$pgbench -c 5 -t 1000 -n test
tps = 51.116465 (including connections establishing)
tps = 51.124878 (excluding connections establishing)

$ pgbench -c 5 -t 1000 -n test
tps = 50.410659 (including connections establishing)
tps = 50.420215 (excluding connections establishing)

$ pgbench -c 5 -t 1000 test
tps = 46.791980 (including connections establishing)
tps = 46.799837 (excluding connections establishing)

Any idea?

data is initialized by pgbench -i -s 10.
--
Tatsuo Ishii

Show quoted text

From: Tom Lane <tgl@sss.pgh.pa.us>
To: pgsql-hackers@postgresql.org
Date: Fri, 24 May 2002 16:42:55 -0400

Per previous discussion, I have committed changes that cause the btree
and hash index methods to mark index tuples "killed" the first time they
are fetched after becoming globally dead. Subsequently the killed
entries are not returned out of indexscans, saving useless heap fetches.
(I haven't changed rtree and gist yet; they will need some internal
restructuring to do this efficiently. Perhaps Oleg or Teodor would like
to take that on.)

This seems to make a useful improvement in pgbench results. Yesterday's
CVS tip gave me these results:

(Running postmaster with "-i -F -B 1024", other parameters at defaults,
and pgbench initialized with "pgbench -i -s 10 bench")

$ time pgbench -c 5 -t 1000 -n bench
tps = 26.428787(including connections establishing)
tps = 26.443410(excluding connections establishing)
real 3:09.74
$ time pgbench -c 5 -t 1000 -n bench
tps = 18.838304(including connections establishing)
tps = 18.846281(excluding connections establishing)
real 4:26.41
$ time pgbench -c 5 -t 1000 -n bench
tps = 13.541641(including connections establishing)
tps = 13.545646(excluding connections establishing)
real 6:10.19

Note the "-n" switches here to prevent vacuums between runs; the point
is to observe the degradation as more and more dead tuples accumulate.

With the just-committed changes I get (starting from a fresh start):

$ time pgbench -c 5 -t 1000 -n bench
tps = 28.393271(including connections establishing)
tps = 28.410117(excluding connections establishing)
real 2:56.53
$ time pgbench -c 5 -t 1000 -n bench
tps = 23.498645(including connections establishing)
tps = 23.510134(excluding connections establishing)
real 3:33.89
$ time pgbench -c 5 -t 1000 -n bench
tps = 18.773239(including connections establishing)
tps = 18.780936(excluding connections establishing)
real 4:26.84

The remaining degradation is actually in seqscan performance, not
indexscan --- unless one uses a much larger -s setting, the planner will
think it ought to use seqscans for updating the "branches" and "tellers"
tables, since those nominally have just a few rows; and there's no way
to avoid scanning lots of dead tuples in a seqscan. Forcing indexscans
helps some in the former CVS tip:

$ PGOPTIONS="-fs" time pgbench -c 5 -t 1000 -n bench
tps = 28.840678(including connections establishing)
tps = 28.857442(excluding connections establishing)
real 2:53.9
$ PGOPTIONS="-fs" time pgbench -c 5 -t 1000 -n bench
tps = 25.670674(including connections establishing)
tps = 25.684493(excluding connections establishing)
real 3:15.7
$ PGOPTIONS="-fs" time pgbench -c 5 -t 1000 -n bench
tps = 22.593429(including connections establishing)
tps = 22.603928(excluding connections establishing)
real 3:42.7

and with the changes I get:

$ PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench
tps = 29.445004(including connections establishing)
tps = 29.463948(excluding connections establishing)
real 2:50.3
$ PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench
tps = 30.277968(including connections establishing)
tps = 30.301363(excluding connections establishing)
real 2:45.6
$ PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench
tps = 30.209377(including connections establishing)
tps = 30.230646(excluding connections establishing)
real 2:46.0

This is the first time I have ever seen repeated pgbench runs without
substantial performance degradation. Not a bad result for a Friday
afternoon...

regards, tom lane

Tatsuo Ishii

t-ishii@sra.co.jp

over 22 years ago

In reply to: Tatsuo Ishii (#1)

Forgot to mention an important thing: I set enable_seqscan = off in
postgresql.conf.

Show quoted text

I found following mail in my mail archive and tried the same test with
7.4 current. Contrary to my expectation 7.4 showed some performance
degration with continuous pgbench runs:

$ pgbench -c 5 -t 1000 -n test
tps = 57.444037 (including connections establishing)
tps = 57.455300 (excluding connections establishing)

$ pgbench -c 5 -t 1000 -n test
tps = 54.125785 (including connections establishing)
tps = 54.134871 (excluding connections establishing)

$pgbench -c 5 -t 1000 -n test
tps = 51.116465 (including connections establishing)
tps = 51.124878 (excluding connections establishing)

$ pgbench -c 5 -t 1000 -n test
tps = 50.410659 (including connections establishing)
tps = 50.420215 (excluding connections establishing)

$ pgbench -c 5 -t 1000 test
tps = 46.791980 (including connections establishing)
tps = 46.799837 (excluding connections establishing)

Any idea?

data is initialized by pgbench -i -s 10.
--
Tatsuo Ishii

From: Tom Lane <tgl@sss.pgh.pa.us>
To: pgsql-hackers@postgresql.org
Date: Fri, 24 May 2002 16:42:55 -0400

Per previous discussion, I have committed changes that cause the btree
and hash index methods to mark index tuples "killed" the first time they
are fetched after becoming globally dead. Subsequently the killed
entries are not returned out of indexscans, saving useless heap fetches.
(I haven't changed rtree and gist yet; they will need some internal
restructuring to do this efficiently. Perhaps Oleg or Teodor would like
to take that on.)

This seems to make a useful improvement in pgbench results. Yesterday's
CVS tip gave me these results:

(Running postmaster with "-i -F -B 1024", other parameters at defaults,
and pgbench initialized with "pgbench -i -s 10 bench")

$ time pgbench -c 5 -t 1000 -n bench
tps = 26.428787(including connections establishing)
tps = 26.443410(excluding connections establishing)
real 3:09.74
$ time pgbench -c 5 -t 1000 -n bench
tps = 18.838304(including connections establishing)
tps = 18.846281(excluding connections establishing)
real 4:26.41
$ time pgbench -c 5 -t 1000 -n bench
tps = 13.541641(including connections establishing)
tps = 13.545646(excluding connections establishing)
real 6:10.19

Note the "-n" switches here to prevent vacuums between runs; the point
is to observe the degradation as more and more dead tuples accumulate.

With the just-committed changes I get (starting from a fresh start):

$ time pgbench -c 5 -t 1000 -n bench
tps = 28.393271(including connections establishing)
tps = 28.410117(excluding connections establishing)
real 2:56.53
$ time pgbench -c 5 -t 1000 -n bench
tps = 23.498645(including connections establishing)
tps = 23.510134(excluding connections establishing)
real 3:33.89
$ time pgbench -c 5 -t 1000 -n bench
tps = 18.773239(including connections establishing)
tps = 18.780936(excluding connections establishing)
real 4:26.84

The remaining degradation is actually in seqscan performance, not
indexscan --- unless one uses a much larger -s setting, the planner will
think it ought to use seqscans for updating the "branches" and "tellers"
tables, since those nominally have just a few rows; and there's no way
to avoid scanning lots of dead tuples in a seqscan. Forcing indexscans
helps some in the former CVS tip:

$ PGOPTIONS="-fs" time pgbench -c 5 -t 1000 -n bench
tps = 28.840678(including connections establishing)
tps = 28.857442(excluding connections establishing)
real 2:53.9
$ PGOPTIONS="-fs" time pgbench -c 5 -t 1000 -n bench
tps = 25.670674(including connections establishing)
tps = 25.684493(excluding connections establishing)
real 3:15.7
$ PGOPTIONS="-fs" time pgbench -c 5 -t 1000 -n bench
tps = 22.593429(including connections establishing)
tps = 22.603928(excluding connections establishing)
real 3:42.7

and with the changes I get:

$ PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench
tps = 29.445004(including connections establishing)
tps = 29.463948(excluding connections establishing)
real 2:50.3
$ PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench
tps = 30.277968(including connections establishing)
tps = 30.301363(excluding connections establishing)
real 2:45.6
$ PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench
tps = 30.209377(including connections establishing)
tps = 30.230646(excluding connections establishing)
real 2:46.0

This is the first time I have ever seen repeated pgbench runs without
substantial performance degradation. Not a bad result for a Friday
afternoon...

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Tatsuo Ishii (#1)

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I found following mail in my mail archive and tried the same test with
7.4 current. Contrary to my expectation 7.4 showed some performance
degration with continuous pgbench runs:

I did some profiling and found that with repeated runs, an increasing
amount of time was being spent in _bt_check_unique, verifying that
updates aren't inserting duplicate keys. This is not real surprising
when running without vacuums between runs --- the number of dead tuples
will increase from run to run. I am not sure why the effect was not
visible in my old post that you quote. Perhaps it was masked by some
other inefficiency and is now visible in 7.4.

I was able to reduce the cost somewhat by changing the order of tests
in _bt_check_unique. With CVS tip, using default postmaster settings
(except fsync off) and "pgbench -i -s 10 bench" followed by a manual
CHECKPOINT command, I get this for repeated runs of
PGOPTIONS=-fs time pgbench -c 5 -t 1000 -n bench

tps = 113.387182 (including connections establishing)
tps = 113.433897 (excluding connections establishing)
0.95user 0.86system 0:44.51elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k

tps = 72.989851 (including connections establishing)
tps = 73.010209 (excluding connections establishing)
0.86user 0.89system 1:09.17elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 88.848571 (including connections establishing)
tps = 88.878369 (excluding connections establishing)
0.82user 0.92system 0:57.19elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 89.156320 (including connections establishing)
tps = 89.186299 (excluding connections establishing)
0.88user 0.92system 0:57.08elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 92.708733 (including connections establishing)
tps = 92.742335 (excluding connections establishing)
0.85user 0.83system 0:54.76elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 93.533412 (including connections establishing)
tps = 93.566385 (excluding connections establishing)
0.92user 0.87system 0:54.33elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 94.987538 (including connections establishing)
tps = 95.021715 (excluding connections establishing)
0.91user 0.88system 0:53.44elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 98.119179 (including connections establishing)
tps = 98.157492 (excluding connections establishing)
0.93user 0.94system 0:51.58elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 108.404186 (including connections establishing)
tps = 108.448242 (excluding connections establishing)
0.85user 0.90system 0:47.30elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 102.725845 (including connections establishing)
tps = 102.767392 (excluding connections establishing)
0.91user 0.90system 0:49.93elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

Not sure why the apparent up-trend in performance after the second run.
Maybe we've done about as much btree index splitting as we need, after
a few runs? Odd.

Same conditions, but with a vacuum between runs, ie no "-n":

tps = 112.266969 (including connections establishing)
tps = 112.314017 (excluding connections establishing)
0.91user 0.89system 0:44.62elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k

tps = 77.936183 (including connections establishing)
tps = 77.959810 (excluding connections establishing)
0.90user 0.92system 1:07.20elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 81.669010 (including connections establishing)
tps = 81.694586 (excluding connections establishing)
0.88user 0.87system 1:02.23elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 86.493578 (including connections establishing)
tps = 86.522430 (excluding connections establishing)
0.87user 0.83system 0:59.61elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 100.759980 (including connections establishing)
tps = 100.797922 (excluding connections establishing)
0.90user 0.89system 0:50.10elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 77.215387 (including connections establishing)
tps = 77.238131 (excluding connections establishing)
0.95user 0.88system 1:06.35elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 83.298182 (including connections establishing)
tps = 83.324866 (excluding connections establishing)
0.88user 0.89system 1:01.23elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 90.534451 (including connections establishing)
tps = 90.566022 (excluding connections establishing)
0.95user 0.86system 0:56.40elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 84.085577 (including connections establishing)
tps = 84.113060 (excluding connections establishing)
0.93user 0.89system 1:05.35elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 95.637302 (including connections establishing)
tps = 95.672293 (excluding connections establishing)
0.92user 0.85system 0:52.80elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

Same with vacuum and checkpoint between runs:

tps = 107.715226 (including connections establishing)
tps = 107.758258 (excluding connections establishing)
0.86user 0.88system 0:46.51elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 78.816549 (including connections establishing)
tps = 78.839575 (excluding connections establishing)
0.87user 0.87system 1:03.68elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 90.150489 (including connections establishing)
tps = 90.184264 (excluding connections establishing)
0.90user 0.89system 0:55.74elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 81.648951 (including connections establishing)
tps = 81.674276 (excluding connections establishing)
0.93user 0.94system 1:01.54elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 86.850517 (including connections establishing)
tps = 86.879127 (excluding connections establishing)
0.84user 0.88system 0:57.86elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 85.776988 (including connections establishing)
tps = 85.809491 (excluding connections establishing)
0.83user 0.94system 0:58.53elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 80.293178 (including connections establishing)
tps = 80.317318 (excluding connections establishing)
0.90user 0.86system 1:02.49elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 80.691960 (including connections establishing)
tps = 80.716409 (excluding connections establishing)
0.84user 0.93system 1:02.19elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

tps = 85.067751 (including connections establishing)
tps = 85.095110 (excluding connections establishing)
0.92user 0.91system 0:59.02elapsed 3%CPU (0avgtext+0avgdata 0maxresident)k

tps = 80.134553 (including connections establishing)
tps = 80.158595 (excluding connections establishing)
0.87user 0.97system 1:02.61elapsed 2%CPU (0avgtext+0avgdata 0maxresident)k

It appears that much of the run-to-run variability is accounted for by
background checkpoints; forcing a checkpoint between runs helps but
doesn't eliminate the noise. (Probably if I'd changed the checkpoint
frequency settings to prevent intra-run checkpoints from happening,
I could get stable numbers. I'm out of time to play with it for now
though.)

regards, tom lane