7.1 vs. 7.2 on AIX 5L

Started by Tatsuo Ishiiabout 24 years ago25 messages
#1Tatsuo Ishii
t-ishii@sra.co.jp
2 attachment(s)

Hi,

I have made a new version of pgbench which allows not to update
branches and tellers tables, which should significantly reduce the
contentions. (See attached patches against current. Note that the
paches also includes changes removing CHECKPOINT command while
running in initialization mode (pgbench -i)). With the patches you
could specify -N option not to update branches and tellers tables.

With the new pgbench, I ran a test with current and 7.1 and saw
not-so-small differences. Any idea to get better performance on 7.2
and AIX 5L combo?

7.2 with lwlock.patch rev.2
7.1.3
AIX 5L 4way with 4GB RAM
testing script is same as my previous postings (except -N for pgbench,
of course).

Attachments:

pgbench.patchtext/plain; charset=us-asciiDownload
Index: pgbench/README.pgbench
===================================================================
RCS file: /cvsroot/pgsql/contrib/pgbench/README.pgbench,v
retrieving revision 1.4
diff -c -r1.4 README.pgbench
*** pgbench/README.pgbench	2001/10/24 08:07:19	1.4
--- pgbench/README.pgbench	2002/01/09 01:31:26
***************
*** 122,127 ****
--- 122,130 ----
  	-S
  		Perform select only transactions instead of TPC-B.
  
+ 	-N
+ 		Skip updation of "branches" and "tellers" table. This
+ 		would result in lower contention of updates.
          -C
  		Establish connection for each transaction, rather than
  		doing it just once at begining of pgbench in the normal
Index: pgbench/pgbench.c
===================================================================
RCS file: /cvsroot/pgsql/contrib/pgbench/pgbench.c,v
retrieving revision 1.13
diff -c -r1.13 pgbench.c
*** pgbench/pgbench.c	2001/11/05 17:46:23	1.13
--- pgbench/pgbench.c	2002/01/09 01:31:26
***************
*** 4,10 ****
   * pgbench: a simple TPC-B like benchmark program for PostgreSQL
   * written by Tatsuo Ishii
   *
!  * Copyright (c) 2000  Tatsuo Ishii
   *
   * Permission to use, copy, modify, and distribute this software and
   * its documentation for any purpose and without fee is hereby
--- 4,10 ----
   * pgbench: a simple TPC-B like benchmark program for PostgreSQL
   * written by Tatsuo Ishii
   *
!  * Copyright (c) 2000-2002  Tatsuo Ishii
   *
   * Permission to use, copy, modify, and distribute this software and
   * its documentation for any purpose and without fee is hereby
***************
*** 95,101 ****
  static void
  usage()
  {
! 	fprintf(stderr, "usage: pgbench [-h hostname][-p port][-c nclients][-t ntransactions][-s scaling_factor][-n][-C][-v][-S][-U login][-P password][-d][dbname]\n");
  	fprintf(stderr, "(initialize mode): pgbench -i [-h hostname][-p port][-s scaling_factor][-U login][-P password][-d][dbname]\n");
  }
  
--- 95,101 ----
  static void
  usage()
  {
! 	fprintf(stderr, "usage: pgbench [-h hostname][-p port][-c nclients][-t ntransactions][-s scaling_factor][-n][-C][-v][-S][-N][-U login][-P password][-d][dbname]\n");
  	fprintf(stderr, "(initialize mode): pgbench -i [-h hostname][-p port][-s scaling_factor][-U login][-P password][-d][dbname]\n");
  }
  
***************
*** 168,174 ****
  
  /* process a transaction */
  static void
! doOne(CState * state, int n, int debug)
  {
  	char		sql[256];
  	PGresult   *res;
--- 168,174 ----
  
  /* process a transaction */
  static void
! doOne(CState * state, int n, int debug, int ttype)
  {
  	char		sql[256];
  	PGresult   *res;
***************
*** 295,306 ****
  			sprintf(sql, "select abalance from accounts where aid = %d", st->aid);
  			break;
  		case 3:
! 			sprintf(sql, "update tellers set tbalance = tbalance + %d where tid = %d\n",
! 					st->delta, st->tid);
! 			break;
  		case 4:
! 			sprintf(sql, "update branches set bbalance = bbalance + %d where bid = %d", st->delta, st->bid);
! 			break;
  		case 5:
  			sprintf(sql, "insert into history(tid,bid,aid,delta,mtime) values(%d,%d,%d,%d,'now')",
  					st->tid, st->bid, st->aid, st->delta);
--- 295,312 ----
  			sprintf(sql, "select abalance from accounts where aid = %d", st->aid);
  			break;
  		case 3:
! 			if (ttype == 0)
! 			{
! 			    sprintf(sql, "update tellers set tbalance = tbalance + %d where tid = %d\n",
! 				    st->delta, st->tid);
! 			    break;
! 			}
  		case 4:
! 			if (ttype == 0)
! 			{
! 			    sprintf(sql, "update branches set bbalance = bbalance + %d where bid = %d", st->delta, st->bid);
! 			    break;
! 			}
  		case 5:
  			sprintf(sql, "insert into history(tid,bid,aid,delta,mtime) values(%d,%d,%d,%d,'now')",
  					st->tid, st->bid, st->aid, st->delta);
***************
*** 549,554 ****
--- 555,561 ----
  				exit(1);
  			}
  
+ #ifdef NOT_USED
  			/*
  			 * do a checkpoint to purge the old WAL logs
  			 */
***************
*** 558,563 ****
--- 565,571 ----
  				fprintf(stderr, "%s", PQerrorMessage(con));
  				exit(1);
  			}
+ #endif /* NOT_USED */
  		}
  	}
  
***************
*** 585,590 ****
--- 593,599 ----
  				t2;
  	int			i;
  	int			normal_xacts = 0;
+ 	char	*s;
  
  	for (i = 0; i < nclients; i++)
  		normal_xacts += state[i].cnt;
***************
*** 594,601 ****
  
  	t2 = (tv3->tv_sec - tv2->tv_sec) * 1000000.0 + (tv3->tv_usec - tv2->tv_usec);
  	t2 = normal_xacts * 1000000.0 / t2;
  
! 	printf("transaction type: %s\n", ttype == 0 ? "TPC-B (sort of)" : "SELECT only");
  	printf("scaling factor: %d\n", tps);
  	printf("number of clients: %d\n", nclients);
  	printf("number of transactions per client: %d\n", nxacts);
--- 603,617 ----
  
  	t2 = (tv3->tv_sec - tv2->tv_sec) * 1000000.0 + (tv3->tv_usec - tv2->tv_usec);
  	t2 = normal_xacts * 1000000.0 / t2;
+ 
+ 	if (ttype == 0)
+ 	    s = "TPC-B (sort of)";
+ 	else if (ttype == 2)
+ 	    s = "Update only accounts";
+ 	else
+ 	    s = "SELECT only";
  
! 	printf("transaction type: %s\n", s);
  	printf("scaling factor: %d\n", tps);
  	printf("number of clients: %d\n", nclients);
  	printf("number of transactions per client: %d\n", nxacts);
***************
*** 619,625 ****
  	int			is_full_vacuum = 0;		/* do full vacuum before testing? */
  	int			debug = 0;		/* debug flag */
  	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT
! 								 * only */
  
  	static CState state[MAXCLIENTS];	/* clients status */
  
--- 635,642 ----
  	int			is_full_vacuum = 0;		/* do full vacuum before testing? */
  	int			debug = 0;		/* debug flag */
  	int			ttype = 0;		/* transaction type. 0: TPC-B, 1: SELECT
! 								 * only 
! 				 2: skip updation of branches and tellers */
  
  	static CState state[MAXCLIENTS];	/* clients status */
  
***************
*** 641,647 ****
  	PGconn	   *con;
  	PGresult   *res;
  
! 	while ((c = getopt(argc, argv, "ih:nvp:dc:t:s:U:P:CS")) != EOF)
  	{
  		switch (c)
  		{
--- 658,664 ----
  	PGconn	   *con;
  	PGresult   *res;
  
! 	while ((c = getopt(argc, argv, "ih:nvp:dc:t:s:U:P:CNS")) != EOF)
  	{
  		switch (c)
  		{
***************
*** 666,671 ****
--- 683,691 ----
  			case 'S':
  				ttype = 1;
  				break;
+ 			case 'N':
+ 				ttype = 2;
+ 				break;
  			case 'c':
  				nclients = atoi(optarg);
  				if (nclients <= 0 || nclients > MAXCLIENTS)
***************
*** 851,858 ****
  	/* send start up quries in async manner */
  	for (i = 0; i < nclients; i++)
  	{
! 		if (ttype == 0)
! 			doOne(state, i, debug);
  		else if (ttype == 1)
  			doSelectOnly(state, i, debug);
  	}
--- 871,878 ----
  	/* send start up quries in async manner */
  	for (i = 0; i < nclients; i++)
  	{
! 		if (ttype == 0 || ttype == 2)
! 			doOne(state, i, debug, ttype);
  		else if (ttype == 1)
  			doSelectOnly(state, i, debug);
  	}
***************
*** 915,922 ****
  		{
  			if (state[i].con && FD_ISSET(PQsocket(state[i].con), &input_mask))
  			{
! 				if (ttype == 0)
! 					doOne(state, i, debug);
  				else if (ttype == 1)
  					doSelectOnly(state, i, debug);
  			}
--- 935,942 ----
  		{
  			if (state[i].con && FD_ISSET(PQsocket(state[i].con), &input_mask))
  			{
! 				if (ttype == 0 || ttype == 2)
! 					doOne(state, i, debug, ttype);
  				else if (ttype == 1)
  					doSelectOnly(state, i, debug);
  			}
[t-ishii@srapc1474 contrib]$ 
result-Jan-09.pngimage/pngDownload
#2Hannu Krosing
hannu@tm.ee
In reply to: Tatsuo Ishii (#1)
Re: 7.1 vs. 7.2 on AIX 5L

Tatsuo Ishii wrote:

Hi,

I have made a new version of pgbench which allows not to update
branches and tellers tables, which should significantly reduce the
contentions. (See attached patches against current. Note that the
paches also includes changes removing CHECKPOINT command while
running in initialization mode (pgbench -i)). With the patches you
could specify -N option not to update branches and tellers tables.

With the new pgbench, I ran a test with current and 7.1 and saw
not-so-small differences. Any idea to get better performance on 7.2
and AIX 5L combo?

7.2 with lwlock.patch rev.2
7.1.3
AIX 5L 4way with 4GB RAM
testing script is same as my previous postings (except -N for pgbench,
of course).

------------------------------------------------------------------------
Name: pgbench.patch
pgbench.patch Type: Plain Text (Text/Plain)
Encoding: 7bit

Name: result-Jan-09.png
result-Jan-09.png Type: PNG Image (image/png)
Encoding: base64

Could you add some labels to lines as Tom did ?

We can only guess which line is which.

--------------
Hannu

#3Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Hannu Krosing (#2)
Re: 7.1 vs. 7.2 on AIX 5L

Could you add some labels to lines as Tom did ?

We can only guess which line is which.

I thought I already added labels. 7.1 is "+"(green one), and 7.2 is
"rhombus"(red one).
--
Tatsuo Ishii

#4Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#1)
Re: 7.1 vs. 7.2 on AIX 5L

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I have made a new version of pgbench which allows not to update
branches and tellers tables, which should significantly reduce the
contentions.

I used this version of pgbench in some fresh runs on RedHat's 4-way SMP
Linux box. I did several test runs under varying conditions (pgbench
scale 500 or 50, checkpoint_segments/wal_files either default 3/0 or
30/5, fsync on or off). I compared current CVS tip (including the
now-committed lwlock rev 2 patch) to 7.1.3. The results are attached.
As you can see, current beats 7.1 pretty much across the board on that
hardware. The reason seems to be revealed by looking at vmstat output.
Typical "vmstat 5" output for 7.1.3 (here in a 6-client pgbench -N
run) is

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 108444 8920 4917092 0 0 213 0 170 4814 0 1 99
1 0 0 0 103592 8948 4921912 0 0 234 357 230 4811 1 1 98
0 0 0 0 98776 8968 4926704 0 0 233 428 235 4854 1 1 97
0 0 0 0 94300 8980 4931168 0 0 216 423 229 4809 1 2 97
0 0 0 0 89960 8984 4935504 0 0 209 771 421 4723 2 2 96
0 0 0 0 69280 9016 4956140 0 0 205 842 457 4645 1 2 96

The system is capable of much greater I/O rates, so neither disks nor
CPUs are exactly exerting themselves here. In contrast, 7.2 shows:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 0 2927344 9148 1969356 0 0 0 5772 102 13753 61 32 7
7 0 0 0 3042272 9148 1969716 0 0 0 2267 2400 14083 58 32 10
5 0 0 0 3042168 9148 1970100 0 0 0 2734 1028 12994 53 37 11

I think that 7.1's poor showing here is undoubtedly due to the spinlock
backoff algorithm it used --- there is no other way to explain 99% idle
CPU than that all of the backends are caught in 10-msec select() waits.

With the new pgbench, I ran a test with current and 7.1 and saw
not-so-small differences. Any idea to get better performance on 7.2
and AIX 5L combo?

I'm thinking more and more that there must be something weird about the
cs() routine that we use for spinlocks on AIX. Could someone dig into
that and find exactly what it does and whether it's got any performance
issues?

regards, tom lane

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#4)
Re: 7.1 vs. 7.2 on AIX 5L

Tom Lane wrote:

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

I have made a new version of pgbench which allows not to update
branches and tellers tables, which should significantly reduce the
contentions.

I used this version of pgbench in some fresh runs on RedHat's 4-way SMP
Linux box. I did several test runs under varying conditions (pgbench
scale 500 or 50, checkpoint_segments/wal_files either default 3/0 or
30/5, fsync on or off). I compared current CVS tip (including the
now-committed lwlock rev 2 patch) to 7.1.3. The results are attached.
As you can see, current beats 7.1 pretty much across the board on that
hardware. The reason seems to be revealed by looking at vmstat output.
Typical "vmstat 5" output for 7.1.3 (here in a 6-client pgbench -N
run) is

Those are dramatic graphs. Is it the WAL increase that made 7.2 much
faster?

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#6Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Bruce Momjian (#5)
Re: 7.1 vs. 7.2 on AIX 5L

With the new pgbench, I ran a test with current and 7.1 and saw
not-so-small differences. Any idea to get better performance on 7.2
and AIX 5L combo?

I'm thinking more and more that there must be something weird about the
cs() routine that we use for spinlocks on AIX. Could someone dig into
that and find exactly what it does and whether it's got any performance
issues?

The manual page sais:

Note: The cs subroutine is only provided to support binary compatibility with
AIX Version 3 applications. When writing new applications, it is not
recommended to use this subroutine; it may cause reduced performance in the
future. Applications should use the compare_and_swap subroutine, unless they
need to use unaligned memory locations.

I once tried to replace cs() with compare_and_swap() but saw worse performance
for the limited testing I did (probably on a single CPU). Maybe the "threat"
that performance will be reduced is actually true on AIX 5 now.

The thing would imho now be for Tatsuo to try to replace cs with compare_and_swap,
and see what happens on AIX 5.

Andreas

PS: Would the __powerpc__ assembly work on AIX machines ?

#7Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Zeugswetter Andreas SB SD (#6)
Re: 7.1 vs. 7.2 on AIX 5L

I'm thinking more and more that there must be something weird about the
cs() routine that we use for spinlocks on AIX. Could someone dig into
that and find exactly what it does and whether it's got any performance
issues?

The manual page sais:

Note: The cs subroutine is only provided to support binary compatibility with
AIX Version 3 applications. When writing new applications, it is not
recommended to use this subroutine; it may cause reduced performance in the
future. Applications should use the compare_and_swap subroutine, unless they
need to use unaligned memory locations.

I once tried to replace cs() with compare_and_swap() but saw worse performance
for the limited testing I did (probably on a single CPU). Maybe the "threat"
that performance will be reduced is actually true on AIX 5 now.

The thing would imho now be for Tatsuo to try to replace cs with compare_and_swap,
and see what happens on AIX 5.

Andreas

PS: Would the __powerpc__ assembly work on AIX machines ?

I wish I could do that but...

From the manual page of compare_and_swap (see below):

What I'm not sure is this part:

Note If compare_and_swap is used as a locking primitive, insert an
isync at the start of any critical sections;

What is "isync"? Also, how I can implement calling compare_and_swap in
the assembly language?
--
Tatsuo Ishii

-----------------------------------------------------------------------
boolean_t compare_and_swap ( word_addr, old_val_addr, new_val)

atomic_p word_addr;

int *old_val_addr;

int new_val;

Description

The compare_and_swap subroutine performs an atomic operation which compares
the contents of a single word variable with a stored old value; If the
values are equal, a new value is stored in the single word variable and TRUE
is returned; otherwise, the old value is set to the current value of the
single word variable and FALSE is returned;

The compare_and_swap subroutine is useful when a word value must be updated
only if it has not been changed since it was last read;

Note The word containing the single word variable must be aligned on
a full word boundary

Note If compare_and_swap is used as a locking primitive, insert an
isync at the start of any critical sections;

Parameters

word_addr Specifies the address of the single word variable&#46;

old_val_addr Specifies the address of the old value to be checked against
(and conditionally updated with) the value of the single word variable&#46;

new_val Specifies the new value to be conditionally assigned to the single
word variable&#46;

Return Values

TRUE Indicates that the single word variable was equal to the old value, and
has been set to the new value&#46;

FALSE Indicates that the single word variable was not equal to the old value,
and that its current value has been returned in the location where the old
value was previously stored&#46;

Implementation Specifics

Implementation Specifics

The compare_and_swap subroutine is part of the Base Operating System (BOS)
Runtime

Related Information

The fetch_and_add (fetch_and_add Subroutine) subroutine, fetch_and_and
(fetch_and_and or fetch_and_or Subroutine) subroutine, fetch_and_or
(fetch_and_and or fetch_and_or Subroutine) subroutine&#46;

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tatsuo Ishii (#7)
Re: 7.1 vs. 7.2 on AIX 5L

Tatsuo Ishii <t-ishii@sra.co.jp> writes:

[ compare_and_swap man page ]

Looks kinda baroque. What about the referenced fetch_and_or routine?
If that's atomic it might be closer to TAS semantics.

regards, tom lane

#9Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#8)
Re: 7.1 vs. 7.2 on AIX 5L

What is "isync"? Also, how I can implement calling

sorry no idea :-(

compare_and_swap in the assembly language?

In assembly language you would do the locking yourself,
the code would be identical, or at least very similar to
the __APPLE__ __ppc__ code.

sample lock code supplied in the PowerPC
Architecture book (page 254):

unlock: sync
stw 0, lock_location
blr

In the unlock case the sync is all that is necessary to make all changes
protected by the lock globally visible. Note that no lwarx or stwcx. is
needed.

lock:
1: lwarx r5, lock_location
cmpiw r5, 0
bne 2f:
stwcx. 1, lock_location
bne 1b
isync
blr
2: need to indicate the lock is already locked (could spin if you want to
in this case or put on a sleep queue)
blr

isync only affects the running processor.

I have tried LinuxPPC's TAS code but AIX's assembler complains that
lwarx and stwcx are unsupported op. So it seems that we need to tweak
your code actually.
--
Tatsuo Ishii

#10Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tatsuo Ishii (#9)
Re: 7.1 vs. 7.2 on AIX 5L

isync only affects the running processor.

I have tried LinuxPPC's TAS code but AIX's assembler complains that
lwarx and stwcx are unsupported op. So it seems that we need to tweak
your code actually.

The problem is, that the default on AIX is to produce architecture independent
code (arch=COM). Unfortunately not all AIX architectures seem to have these
instructions. With arch=ppc it works (two lines adjusted .globl .tas and .tas:).
My worry is, that the Architecture book sais that the isync is necessary on SMP.
I wonder why that would not also apply to LinuxPPC or Apple.

Andreas

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#10)
Re: 7.1 vs. 7.2 on AIX 5L

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

The problem is, that the default on AIX is to produce architecture
independent code (arch=COM). Unfortunately not all AIX architectures
seem to have these instructions.

AIX does more than one architecture? Hmm, s_lock.h doesn't know that...

With arch=ppc it works (two lines
adjusted .globl .tas and .tas:). My worry is, that the Architecture
book sais that the isync is necessary on SMP. I wonder why that would
not also apply to LinuxPPC or Apple.

I doubt we've had anyone test on SMP PPC machines, other than Tatsuo's
tests on AIX. Worse, I'd imagine that any failures from a missing sync
instruction would be rare and tough to reproduce. So there may indeed
be a lurking problem here.

regards, tom lane

#12Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#11)
Re: 7.1 vs. 7.2 on AIX 5L

Tom writes:

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

The problem is, that the default on AIX is to produce architecture
independent code (arch=COM). Unfortunately not all AIX architectures
seem to have these instructions.

AIX does more than one architecture? Hmm, s_lock.h doesn't
know that...

It does not need to, since all of them currently use cs().
The compilers by default generate executables that run on all
of the different processors (they are all Risc).

Andreas

#13Alex Avriette
a_avriette@acs.org
In reply to: Zeugswetter Andreas SB SD (#12)
Re: 7.1 vs. 7.2 on AIX 5L

Um, and then there's darwin on the dual g4's.... SMP POSIX code on PPC is
not limited to AIX and Linux (ugh). freebsd 5.0 is rumored to be smp and run
on ppc as well.

alex

#14Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Alex Avriette (#13)
Re: 7.1 vs. 7.2 on AIX 5L

Tom wrote:

[ compare_and_swap man page ]

Looks kinda baroque. What about the referenced fetch_and_or routine?
If that's atomic it might be closer to TAS semantics.

Thanks for the hint!
Tatsuo, can you try the performance of fetch_and_or on your machine ?
Replace the cs() line in s_lock.h with the following:

<-- #define TAS(lock) cs((int *) (lock), 0, 1)
--> #define TAS(lock) fetch_and_or(lock, 1)

On my machine the various implemtations have the following runtimes:
(with a modified s_lock test that does not sleep, and SPINS_PER_DELAY 100000)

with LinuxPPC asm: 1m5.16s (Which may not work relyably on SMP)
with cs: 1m12.25s
with fetch_and_or: 1m26.71s

I don't know if that is enough difference to worry about.

Andreas

#15Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Zeugswetter Andreas SB SD (#14)
Re: 7.1 vs. 7.2 on AIX 5L

On my machine the various implemtations have the following runtimes:
(with a modified s_lock test that does not sleep, and
SPINS_PER_DELAY 100000)

with LinuxPPC asm: 1m5.16s (Which may not work relyably on SMP)
with cs: 1m12.25s
with fetch_and_or: 1m26.71s

Aah, there we have it. Same test on a slower 4Way SMP:

with LinuxPPC asm: 2m9.340s
with cs: 10m11.15s
with fetch_and_or: 3m55.19s

These numbers look more alarming.
Unfortunately the man page for fetch_and_or does not mention anything
about using it as a locking primitive. It is documented atomic though,
so I guess that is enough.

Andreas

#16Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Zeugswetter Andreas SB SD (#15)
Re: 7.1 vs. 7.2 on AIX 5L

On my machine the various implemtations have the following runtimes:
(with a modified s_lock test that does not sleep, and
SPINS_PER_DELAY 100000)

with LinuxPPC asm: 1m5.16s (Which may not work relyably on SMP)
with cs: 1m12.25s
with fetch_and_or: 1m26.71s

Aah, there we have it. Same test on a slower 4Way SMP:

with LinuxPPC asm: 2m9.340s
with cs: 10m11.15s
with fetch_and_or: 3m55.19s

These numbers look more alarming.
Unfortunately the man page for fetch_and_or does not mention anything
about using it as a locking primitive. It is documented atomic though,
so I guess that is enough.

Andreas

I did several times of pgbenhc -c 10 -t 20 with the modification you
suggested and once got a hung (all backends sleeping with the status
showing "COMMIT"). Also I had an eror:

NOTICE: LockRelease: no such lock

I'm afraid now the locking is broken. Will look into more.
--
Tatsuo Ishii

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#15)
Re: 7.1 vs. 7.2 on AIX 5L

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

Unfortunately the man page for fetch_and_or does not mention anything
about using it as a locking primitive. It is documented atomic though,
so I guess that is enough.

You could disassemble it and see if it includes that "isync" instruction
or not.

Given Tatsuo's later report, I'm afraid the answer is "not".

regards, tom lane

#18Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Tom Lane (#17)
Re: 7.1 vs. 7.2 on AIX 5L

NOTICE: LockRelease: no such lock

I'm afraid now the locking is broken. Will look into more.

Next try would be _check_lock (found in sys/atomic_op.h):

Description (from man page)

The _check_lock subroutine performs an atomic (uninterruptible) sequence of
operations. The compare_and_swap subroutine is similar, but does not issue
synchronization instructions and therefore is inappropriate for updating lock
words.

replace the TAS define with:
#define TAS(lock) _check_lock(lock, 0, 1)

with LinuxPPC asm: 2m9.340s
with cs: 10m11.15s
with fetch_and_or: 3m55.19s

with _check_lock: 3m29.990s

I think that the culprit may also be the S_UNLOCK (use _clear_lock) ?

Andreas

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: Zeugswetter Andreas SB SD (#18)
Re: 7.1 vs. 7.2 on AIX 5L

"Zeugswetter Andreas SB SD" <ZeugswetterA@spardat.at> writes:

I think that the culprit may also be the S_UNLOCK (use _clear_lock) ?

If isync is anything like the MB instruction on Alpha (ie, force memory
updates to occur before proceeding), then it must be used in S_UNLOCK
as well as S_LOCK. All updates done within the spinlocked section must
reach memory before another processor is allowed to acquire the lock.

regards, tom lane

#20Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Zeugswetter Andreas SB SD (#18)
Re: 7.1 vs. 7.2 on AIX 5L

NOTICE: LockRelease: no such lock

I'm afraid now the locking is broken. Will look into more.

Next try would be _check_lock (found in sys/atomic_op.h):

Description (from man page)

The _check_lock subroutine performs an atomic (uninterruptible) sequence of
operations. The compare_and_swap subroutine is similar, but does not issue
synchronization instructions and therefore is inappropriate for updating lock
words.

replace the TAS define with:
#define TAS(lock) _check_lock(lock, 0, 1)

with LinuxPPC asm: 2m9.340s
with cs: 10m11.15s
with fetch_and_or: 3m55.19s

with _check_lock: 3m29.990s

I think that the culprit may also be the S_UNLOCK (use _clear_lock) ?

Thanks. I will try with it.

BTW, I'm still wondering why 7.2 is slower than 7.1 on AIX. Tom said
cs() is responsible for that. But not only 7.2 but 7.1 uses cs(). It
seems cs() does not explain the difference of the performance.
--
Tatsuo Ishii

#21Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#20)
Re: 7.1 vs. 7.2 on AIX 5L

Thanks. I will try with it.

BTW, I'm still wondering why 7.2 is slower than 7.1 on AIX. Tom said
cs() is responsible for that. But not only 7.2 but 7.1 uses cs(). It
seems cs() does not explain the difference of the performance.

cs() may be used more heavily on 7.2 --- not sure.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#21)
Re: 7.1 vs. 7.2 on AIX 5L

Bruce Momjian <pgman@candle.pha.pa.us> writes:

BTW, I'm still wondering why 7.2 is slower than 7.1 on AIX. Tom said
cs() is responsible for that. But not only 7.2 but 7.1 uses cs(). It
seems cs() does not explain the difference of the performance.

cs() may be used more heavily on 7.2 --- not sure.

Most of the places that were SpinLockAcquire ... SpinRelease on 7.1 are
now LWLockAcquire ... LWLockRelease on 7.2. And each of LWLockAcquire
and LWLockRelease does a SpinLockAcquire + SpinRelease + some other
computation. So there's no doubt that we expend more cycles; cycles
that are wasted in a pure-single-backend scenario. However, if the
spinlock operations are as cheap as they should be, it's hard to believe
that the extra cost would be enough to notice. And indeed we haven't
been able to measure any penalty on Linux, HPUX, nor BSD (right Bruce?).
So I'm still suspicious that our cs()-based spinlock for AIX is carrying
some unexpected cost.

regards, tom lane

#23Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#22)
Re: 7.1 vs. 7.2 on AIX 5L

that the extra cost would be enough to notice. And indeed we haven't
been able to measure any penalty on Linux, HPUX, nor BSD (right Bruce?).

Right.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#24Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tatsuo Ishii (#20)
Re: 7.1 vs. 7.2 on AIX 5L

Added to TODO:

* Evaluate AIX cs() spinlock macro for performance optimizations (Tatsuo)

---------------------------------------------------------------------------

Tatsuo Ishii wrote:

NOTICE: LockRelease: no such lock

I'm afraid now the locking is broken. Will look into more.

Next try would be _check_lock (found in sys/atomic_op.h):

Description (from man page)

The _check_lock subroutine performs an atomic (uninterruptible) sequence of
operations. The compare_and_swap subroutine is similar, but does not issue
synchronization instructions and therefore is inappropriate for updating lock
words.

replace the TAS define with:
#define TAS(lock) _check_lock(lock, 0, 1)

with LinuxPPC asm: 2m9.340s
with cs: 10m11.15s
with fetch_and_or: 3m55.19s

with _check_lock: 3m29.990s

I think that the culprit may also be the S_UNLOCK (use _clear_lock) ?

Thanks. I will try with it.

BTW, I'm still wondering why 7.2 is slower than 7.1 on AIX. Tom said
cs() is responsible for that. But not only 7.2 but 7.1 uses cs(). It
seems cs() does not explain the difference of the performance.
--
Tatsuo Ishii

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026
#25Tatsuo Ishii
t-ishii@sra.co.jp
In reply to: Tom Lane (#4)
Re: 7.1 vs. 7.2 on AIX 5L

Tom,

Can I use the fourth graph (scale=50, fsync=on) to show how 7.2 could
outperform 7.1 on SMP boxes? I'm going to make a presentation at
Net&Com 2002 (http://expo.nikkeibp.co.jp/netcom/web/e/index.html) the
day after tomorrow.
--
Tatsuo Ishii

Show quoted text

I used this version of pgbench in some fresh runs on RedHat's 4-way SMP
Linux box. I did several test runs under varying conditions (pgbench
scale 500 or 50, checkpoint_segments/wal_files either default 3/0 or
30/5, fsync on or off). I compared current CVS tip (including the
now-committed lwlock rev 2 patch) to 7.1.3. The results are attached.
As you can see, current beats 7.1 pretty much across the board on that
hardware. The reason seems to be revealed by looking at vmstat output.
Typical "vmstat 5" output for 7.1.3 (here in a 6-client pgbench -N
run) is

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
1 0 0 0 108444 8920 4917092 0 0 213 0 170 4814 0 1 99
1 0 0 0 103592 8948 4921912 0 0 234 357 230 4811 1 1 98
0 0 0 0 98776 8968 4926704 0 0 233 428 235 4854 1 1 97
0 0 0 0 94300 8980 4931168 0 0 216 423 229 4809 1 2 97
0 0 0 0 89960 8984 4935504 0 0 209 771 421 4723 2 2 96
0 0 0 0 69280 9016 4956140 0 0 205 842 457 4645 1 2 96

The system is capable of much greater I/O rates, so neither disks nor
CPUs are exactly exerting themselves here. In contrast, 7.2 shows:

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 0 2927344 9148 1969356 0 0 0 5772 102 13753 61 32 7
7 0 0 0 3042272 9148 1969716 0 0 0 2267 2400 14083 58 32 10
5 0 0 0 3042168 9148 1970100 0 0 0 2734 1028 12994 53 37 11

I think that 7.1's poor showing here is undoubtedly due to the spinlock
backoff algorithm it used --- there is no other way to explain 99% idle
CPU than that all of the backends are caught in 10-msec select() waits.

With the new pgbench, I ran a test with current and 7.1 and saw
not-so-small differences. Any idea to get better performance on 7.2
and AIX 5L combo?

I'm thinking more and more that there must be something weird about the
cs() routine that we use for spinlocks on AIX. Could someone dig into
that and find exactly what it does and whether it's got any performance
issues?

regards, tom lane