update i386 spinlock for hyperthreading

Started by Manfred Spraulover 22 years ago34 messageshackersgeneral

manfred@colorfullife.com

over 22 years ago

hackersgeneral

Hi,

Intel recommends to add a special pause instruction into spinlock busy
loops. It's necessary for hyperthreading - without it, the cpu can't
figure out that a logical thread does no useful work and incorrectly
awards lots of execution resources to that thread. Additionally, it's
supposed to reduce the time the cpu needs to recover from the
(mispredicted) branch after the spinlock was obtained.
The attached patch adds a new platform hook and implements it for i386.
The new instruction is backward compatible, thus no cpu detection is
necessary.
Additionally I've increased the number of loops from 100 to 1000 - a 3
GHz Pentium 4 might execute 100 loops faster than a single bus
transaction. I don't know if this change is appropriate for all
platforms, or if SPINS_PER_DELAY should be made platform specific.

Mark did a test run with his dbt-2 benchmark on a 4-way Xeon with HT
enabled, and the patch resulted in a 10% performance increase:
Before:
http://developer.osdl.org/markw/dbt2-pgsql/284/
After:
http://developer.osdl.org/markw/dbt2-pgsql/300/

--
Manfred

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Manfred Spraul (#1)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Manfred Spraul <manfred@colorfullife.com> writes:

Intel recommends to add a special pause instruction into spinlock busy
loops. It's necessary for hyperthreading - without it, the cpu can't
figure out that a logical thread does no useful work and incorrectly
awards lots of execution resources to that thread. Additionally, it's
supposed to reduce the time the cpu needs to recover from the
(mispredicted) branch after the spinlock was obtained.

Don't you have to put it in a specific place in the loop to make that
work? If not, why not? I doubt that rep;nop is magic enough to
recognize the loop that will be generated from s_lock()'s code.

My guess is that it'd be more useful to insert the rep;nop into the
failure branch of the TAS macro and forget about the separate CPU_DELAY
construct. This would allow you to control where exactly rep;nop
appears relative to the xchgb.

Additionally I've increased the number of loops from 100 to 1000

I think this change is almost certainly counterproductive; for any
platform other than the Xeon, remove "almost".

+ #ifndef HAS_CPU_DELAY
+ #define CPU_DELAY() cpu_delay()
+ 
+ static __inline__ void
+ cpu_delay(void)
+ {
+ }
+ #endif

This breaks every non-gcc compiler in the world (or at least all those
that don't recognize __inline__). If you really want to keep CPU_DELAY,
consider

#ifndef CPU_DELAY
#define CPU_DELAY()
#endif

but as stated above, I'm dubious that the bottom of the s_lock loop
is the place to be adding anything anyway.

regards, tom lane

Manfred Spraul

manfred@colorfullife.com

over 22 years ago

In reply to: Tom Lane (#2)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Tom Lane wrote:

Manfred Spraul <manfred@colorfullife.com> writes:

Intel recommends to add a special pause instruction into spinlock busy
loops. It's necessary for hyperthreading - without it, the cpu can't
figure out that a logical thread does no useful work and incorrectly
awards lots of execution resources to that thread. Additionally, it's
supposed to reduce the time the cpu needs to recover from the
(mispredicted) branch after the spinlock was obtained.

Don't you have to put it in a specific place in the loop to make that
work? If not, why not? I doubt that rep;nop is magic enough to
recognize the loop that will be generated from s_lock()'s code.

Rep;nop is just a short delay - that's all. It means that the cpu
pipelines have a chance to drain, and that the other thread gets enough
cpu resources. Below is the full instruction documentation, from the
latest ia32 doc set from Intel:
<<<
Improves the performance of spin-wait loops. When executing a spin-wait
loop, a Pentium 4 or Intel Xeon processor suffers a severe performance
penalty when exiting the loop because it detects a possible memory order
violation. The PAUSE instruction provides a hint to the processor that
the code sequence is a spin-wait loop. The processor uses this hint to
avoid the memory order violation in most situations, which greatly
improves processor performance. For this reason, it is recommended that
a PAUSE instruction be placed in all spin-wait loops. An additional
function of the PAUSE instruction is to reduce the power consumed by a
Pentium 4 processor while executing a spin loop. The Pentium 4 processor
can execute a spin-wait loop extremely quickly, causing the processor to
consume a lot of power while it waits for the resource it is spinning on
to become available. Inserting a pause instruction in a spin-wait loop
greatly reduces the processor s power consumption. This instruction was
introduced in the Pentium 4 processors, but is backward compatible with
all IA-32 processors. In earlier IA-32 processors, the PAUSE instruction
operates like a NOP instruction. The Pentium 4 and Intel Xeon processors
implement the PAUSE instruction as a pre-defined delay. The delay is
finite and can be zero for some processors. This instruction does not
change the architectural state of the processor (that is, it performs
essentially a delaying noop operation).
<<<

I think a separate function is better than adding it into TAS: if it's
part of tas, then it would automatically be included by every
SpinLockAcquire call - unnecessary .text bloat. Additionally, there
might be other busy loops, in addition to TAS, that could use a delay
function.

I'll post a new patch that doesn't rely on __inline__ in the i386
independant part.

--
Manfred

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Manfred Spraul (#3)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Manfred Spraul <manfred@colorfullife.com> writes:

Tom Lane wrote:

Don't you have to put it in a specific place in the loop to make that
work? If not, why not?

Rep;nop is just a short delay - that's all.

That view seems to me to be directly contradicted by this statement:

The PAUSE instruction provides a hint to the processor that
the code sequence is a spin-wait loop. The processor uses this hint to
avoid the memory order violation in most situations, which greatly
improves processor performance.

It's not apparent to me how a short delay translates into avoiding a
memory order violation (possibly some docs on what that means exactly
might help...). I suspect strongly that there needs to be some near
proximity between the PAUSE instruction and the lock-test instruction
for this to work as advertised. It would help if Intel were less coy
about what the instruction really does.

This instruction does not change the architectural state of the
processor (that is, it performs essentially a delaying noop
operation).

This can be rephrased as "we're not telling you what this instruction
really does, because its interesting effects are below the level of the
instruction set architecture". Great. How are we supposed to know
how to use it?

I think a separate function is better than adding it into TAS: if it's
part of tas, then it would automatically be included by every
SpinLockAcquire call - unnecessary .text bloat.

Why do you think it's unnecessary? One thing that I find particularly
vague in the quoted documentation is the statement that the PAUSE
instruction is needed to avoid a delay when *exiting* the spin-wait
loop. Doesn't this mean that a PAUSE is needed in the success path
when the first TAS succeeds (i.e, the normal no-contention path)?
If not, why not? If so, does it go before or after the lock
instruction?

Also, if the principal effect is a "short delay", do we really need it
at all considering that our inner loop in s_lock is rather more than
an "xchgb" followed by a conditional branch? There will be time for
the write queue to drain while we're incrementing and testing our
spin counter (which I trust is in a register...).

The reason I'm so full of questions is that I spent some time several
days ago looking at exactly this issue, and came away with only the
conclusion that I had to find some more-detailed documentation before
I could figure out what we should do about the spinlocks for Xeons.
You have not convinced me that you know more about the issue than I do.
A "10% speedup" is nice, but how do we know that that's what we should
expect to get? Maybe there's a lot more to be won by doing it correctly
(for some value of "correctly").

regards, tom lane

Manfred Spraul

manfred@colorfullife.com

over 22 years ago

In reply to: Tom Lane (#4)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Tom Lane wrote:

Manfred Spraul <manfred@colorfullife.com> writes:

Tom Lane wrote:

Don't you have to put it in a specific place in the loop to make that
work? If not, why not?

Rep;nop is just a short delay - that's all.

That view seems to me to be directly contradicted by this statement:

The PAUSE instruction provides a hint to the processor that
the code sequence is a spin-wait loop. The processor uses this hint to
avoid the memory order violation in most situations, which greatly
improves processor performance.

It's not apparent to me how a short delay translates into avoiding a
memory order violation (possibly some docs on what that means exactly
might help...). I suspect strongly that there needs to be some near
proximity between the PAUSE instruction and the lock-test instruction
for this to work as advertised. It would help if Intel were less coy
about what the instruction really does.

My guess: Pentium 4 cpu support something like 250 uops in flight - it
will have a dozend of the spinlock loops in it's pipeline. When the
spinlock is released, it must figure out which of the loops should get
it, and gets lost. My guess is that rep;nop delays the cpu buy at least
100 cpu ticks, and thus the pipeline will be empty before it proceeds. I
don't have a Pentium 4, and the HP testdrive is down. Someone around who
could run my test app?

This instruction does not change the architectural state of the
processor (that is, it performs essentially a delaying noop
operation).

This can be rephrased as "we're not telling you what this instruction
really does, because its interesting effects are below the level of the
instruction set architecture". Great. How are we supposed to know
how to use it?

There was a w_spinlock.pdf document with reference code. google still
finds it, but the links are dead :-(

I think a separate function is better than adding it into TAS: if it's
part of tas, then it would automatically be included by every
SpinLockAcquire call - unnecessary .text bloat.

Why do you think it's unnecessary? One thing that I find particularly
vague in the quoted documentation is the statement that the PAUSE
instruction is needed to avoid a delay when *exiting* the spin-wait
loop. Doesn't this mean that a PAUSE is needed in the success path
when the first TAS succeeds (i.e, the normal no-contention path)?

IIRC: No.

If not, why not? If so, does it go before or after the lock
instruction?

Neither: somewhere in the failure path.

Also, if the principal effect is a "short delay", do we really need it
at all considering that our inner loop in s_lock is rather more than
an "xchgb" followed by a conditional branch? There will be time for
the write queue to drain while we're incrementing and testing our
spin counter (which I trust is in a register...).

The reason I'm so full of questions is that I spent some time several
days ago looking at exactly this issue, and came away with only the
conclusion that I had to find some more-detailed documentation before
I could figure out what we should do about the spinlocks for Xeons.

I'll try to find some more docs and post links.

The 2nd thing I would change is to add a nonatomic test in the slow
path: locked instructions generate lots of bus traffic, and that's a
waste of resources.

Another question: regardless of the placement of rep;nop - 10% speedup
means that the postgres spends far too much time in the spinlock code.
I've looked at the oprofile dumps, and something like 1.2% of the total
cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest
instruction in the whole profile, it eats more cpu cycles than the
memcpy() calls that transfer data to/from kernel.
Is there an easy way find out which LWLock is contended?
--
Manfred

Olivier PRENANT

ohp@pyrenet.fr

over 22 years ago

In reply to: Manfred Spraul (#5)

hackersgeneral

Re: update i386 spinlock for hyperthreading

I have a bi-Xeon 2.6G hyperthreaded if it helps... feel free

Regards
On Sat, 27 Dec 2003, Manfred Spraul wrote:

Date: Sat, 27 Dec 2003 11:34:16 +0100
From: Manfred Spraul <manfred@colorfullife.com>
To: Tom Lane <tgl@sss.pgh.pa.us>
Cc: PostgreSQL-patches <pgsql-patches@postgresql.org>
Subject: Re: [PATCHES] update i386 spinlock for hyperthreading

Tom Lane wrote:

Manfred Spraul <manfred@colorfullife.com> writes:

Tom Lane wrote:

Don't you have to put it in a specific place in the loop to make that
work? If not, why not?

Rep;nop is just a short delay - that's all.

That view seems to me to be directly contradicted by this statement:

The PAUSE instruction provides a hint to the processor that
the code sequence is a spin-wait loop. The processor uses this hint to
avoid the memory order violation in most situations, which greatly
improves processor performance.

It's not apparent to me how a short delay translates into avoiding a
memory order violation (possibly some docs on what that means exactly
might help...). I suspect strongly that there needs to be some near
proximity between the PAUSE instruction and the lock-test instruction
for this to work as advertised. It would help if Intel were less coy
about what the instruction really does.

My guess: Pentium 4 cpu support something like 250 uops in flight - it
will have a dozend of the spinlock loops in it's pipeline. When the
spinlock is released, it must figure out which of the loops should get
it, and gets lost. My guess is that rep;nop delays the cpu buy at least
100 cpu ticks, and thus the pipeline will be empty before it proceeds. I
don't have a Pentium 4, and the HP testdrive is down. Someone around who
could run my test app?

This instruction does not change the architectural state of the
processor (that is, it performs essentially a delaying noop
operation).

This can be rephrased as "we're not telling you what this instruction
really does, because its interesting effects are below the level of the
instruction set architecture". Great. How are we supposed to know
how to use it?

There was a w_spinlock.pdf document with reference code. google still
finds it, but the links are dead :-(

I think a separate function is better than adding it into TAS: if it's
part of tas, then it would automatically be included by every
SpinLockAcquire call - unnecessary .text bloat.

Why do you think it's unnecessary? One thing that I find particularly
vague in the quoted documentation is the statement that the PAUSE
instruction is needed to avoid a delay when *exiting* the spin-wait
loop. Doesn't this mean that a PAUSE is needed in the success path
when the first TAS succeeds (i.e, the normal no-contention path)?

IIRC: No.

If not, why not? If so, does it go before or after the lock
instruction?

Neither: somewhere in the failure path.

Also, if the principal effect is a "short delay", do we really need it
at all considering that our inner loop in s_lock is rather more than
an "xchgb" followed by a conditional branch? There will be time for
the write queue to drain while we're incrementing and testing our
spin counter (which I trust is in a register...).

The reason I'm so full of questions is that I spent some time several
days ago looking at exactly this issue, and came away with only the
conclusion that I had to find some more-detailed documentation before
I could figure out what we should do about the spinlocks for Xeons.

I'll try to find some more docs and post links.

The 2nd thing I would change is to add a nonatomic test in the slow
path: locked instructions generate lots of bus traffic, and that's a
waste of resources.

Another question: regardless of the placement of rep;nop - 10% speedup
means that the postgres spends far too much time in the spinlock code.
I've looked at the oprofile dumps, and something like 1.2% of the total
cpu time is spent it the TAS macro in LWLockAcquire. That's the hottest
instruction in the whole profile, it eats more cpu cycles than the
memcpy() calls that transfer data to/from kernel.
Is there an easy way find out which LWLock is contended?
--
Manfred

--
Olivier PRENANT Tel: +33-5-61-50-97-00 (Work)
6, Chemin d'Harraud Turrou +33-5-61-50-97-01 (Fax)
31190 AUTERIVE +33-6-07-63-80-64 (GSM)
FRANCE Email: ohp@pyrenet.fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)

Olivier PRENANT

ohp@pyrenet.fr

over 22 years ago

In reply to: Manfred Spraul (#1)

hackersgeneral

Re: [PATCHES] update i386 spinlock for hyperthreading

Hi Manfred,

I'm using unixware 7 but couldn't compile your source with native cc, I
had to compile it with gcc.

here are the results:

Script started on Sat Dec 27 17:50:49 2003
/tmp 17:50:50: ./a.out
MOVETEST called by non-superuser, running with normal priority.
zerotest: -1073736628 ticks;
zerotest: -4372 ticks;
zerotest: -52 ticks;
rep nop: 508 ticks;
rep nop: 88 ticks;
rep nop: 60 ticks;
nop: 256 ticks;
nop: 224 ticks;
nop: 92 ticks;
nop: 64 ticks;
nop: 12 ticks;
nop: 8 ticks;
nop: 0 ticks;
nop: -4 ticks;
rep nop: 56 ticks;
rep nop: 44 ticks;
rep nop: 40 ticks;
zerotest: -4 ticks;
rep nop: 24 ticks;
zerotest: -4 ticks;
nop: 0 ticks;
zerotest: -4 ticks;
nop: -8 ticks;
zerotest: -8 ticks;
rep nop: 32 ticks;
nop: -16 ticks;
zerotest: -4 ticks;
zerotest: -4 ticks;
rep nop: 24 ticks;
zerotest: -4 ticks;
nop: -12 ticks;
^?
/tmp 17:53:34: exit

script done on Sat Dec 27 17:53:38 2003

Regards
On Sat, 27 Dec 2003, Manfred Spraul wrote:

Date: Sat, 27 Dec 2003 12:22:45 +0100
From: Manfred Spraul <manfred@colorfullife.com>
To: ohp@pyrenet.fr
Subject: Re: [PATCHES] update i386 spinlock for hyperthreading

Hi Olivier,

I have attached rep_nop.cpp. It's a simple C program (unfortunately
filled with linux specific gccisms) that benchmarks the time for a
single rep;nop instruction, compared to a normal nop and nothing at all.

Which OS do you use? I can try to build an app for FreeBSD, and it
should be possible to compile it under Windows with cygwin as well.

Under Linux:
$ gcc -O2 -o rep_nop rep_nop.cpp
$ ./rep_nop

Just run in for a few seconds and send back the output. Please add the
mailing list back into the cc list.
--
Manfred

Import Notes

Reply to msg id not found: 3FED6B85.1090909@colorfullife.com

Manfred Spraul

manfred@colorfullife.com

over 22 years ago

In reply to: Olivier PRENANT (#7)

hackersgeneral

Re: [PATCHES] update i386 spinlock for hyperthreading

ohp@pyrenet.fr wrote:

Hi Manfred,

I'm using unixware 7 but couldn't compile your source with native cc, I
had to compile it with gcc.

here are the results:

Thanks. The test app compares the time needed for three different short
loops: a loop with six empty function calls, a loop with six function
calls and one nop in the middle, and a loop with a "rep;nop;" in the middle.

Result:
- nop needs 0 cycles - executed in parallel.
- rep;nop between 24 and 60 cycles - long enough that the pipeline is
emptied.

I've searched around for further info regarding the recommended spinlock
algorithm:
- The optimization manual (google for "Intel 248966") contains a section
about pause instructions: The memory ordering violation is from the
multiple simultaneous reads that are executed due to pipelining the busy
loop.
- It references the Application Note AP-949 "Using Spin-Loops on Intel
Pentium 4 Processor and Intel Xeon Processor" for further details.
Unfortunately the app notes are stored on cedar.intel.com, and that
server appears to be down :-(

--
Manfred

Nailah Ogeer

ogeer@cs.queensu.ca

over 22 years ago

In reply to: Olivier PRENANT (#7)

hackersgeneral

Connecting to Postgres

How do you connect to postgres if it is running on a different machine. I
am looking for a telnet command in place of postmaster -D data

#10

Martín Marqués

martin@bugs.unl.edu.ar

over 22 years ago

In reply to: Nailah Ogeer (#9)

hackersgeneral

Re: Connecting to Postgres

El Sáb 27 Dic 2003 16:08, Nailah Ogeer escribió:

How do you connect to postgres if it is running on a different machine. I
am looking for a telnet command in place of postmaster -D data

Depends on what you want to do.
One way arounf would be:

1) Edit the pg_hba.conf where the PG server is running so that you can
actually connect to the DB.
2) Run this:

$ psql -d database -h host -U user

where database, host and user are self explained. :-)

If what you want is something different please explain yourself.

--
select 'mmarques' || '@' || 'unl.edu.ar' AS email;
-----------------------------------------------------------------
Martín Marqués | mmarques@unl.edu.ar
Programador, Administrador, DBA | Centro de Telemática
Universidad Nacional
del Litoral
-----------------------------------------------------------------

#11

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Manfred Spraul (#5)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Manfred Spraul <manfred@colorfullife.com> writes:

My guess: Pentium 4 cpu support something like 250 uops in flight - it
will have a dozend of the spinlock loops in it's pipeline. When the
spinlock is released, it must figure out which of the loops should get
it, and gets lost. My guess is that rep;nop delays the cpu buy at least
100 cpu ticks, and thus the pipeline will be empty before it proceeds.

After digging some more in Intel's documentation, it seems that indeed
PAUSE is defined to delay just long enough to empty the pipeline. So it
doesn't really matter where you put it in the wait loop, and there is no
point in inserting it in the success path; that answers my concerns from
before.

There was a w_spinlock.pdf document with reference code. google still
finds it, but the links are dead :-(

I was able to find it as a link from another application note at Intel's
documentation site. Try going to
http://appzone.intel.com/literature/index.asp and searching for AP-949.

Anyway, I've committed your patch with some changes.

The 2nd thing I would change is to add a nonatomic test in the slow
path: locked instructions generate lots of bus traffic, and that's a
waste of resources.

Agreed, but I did not like the way you did it; this concern does not
necessarily apply to all processors, and since we are not using
S_LOCK_FREE at all, it's dubious that it's correctly implemented
everywhere. I modified the IA32 TAS() macro instead.

BTW, I noticed a lot of concern in the Intel app notes about reserving
64 or even 128 bytes for each spinlock to avoid cache line conflicts.
That seems excessive to me (we use a lot of spinlocks for buffers), but
perhaps it is worth looking into.

Is there an easy way find out which LWLock is contended?

Not from oprofile output, as far as I can think. I've suspected for
some time that the BufMgrLock is a major bottleneck, but have no proof.

regards, tom lane

#12

Manfred Spraul

manfred@colorfullife.com

over 22 years ago

In reply to: Tom Lane (#11)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Tom Lane wrote:

Anyway, I've committed your patch with some changes.

Thanks.

BTW, I noticed a lot of concern in the Intel app notes about reserving
64 or even 128 bytes for each spinlock to avoid cache line conflicts.
That seems excessive to me (we use a lot of spinlocks for buffers), but
perhaps it is worth looking into.

This recommendation usually ignored in the Linux kernel. A few very hot
spinlocks have an exclusive cacheline, but most don't.

Is there an easy way find out which LWLock is contended?

Not from oprofile output, as far as I can think. I've suspected for
some time that the BufMgrLock is a major bottleneck, but have no proof.

I'll try to write a patch that dumps the LWLock usage and ask mark to
run it.

--
Manfred

#13

Tony and Bryn Reina

reina_ga@hotmail.com

over 22 years ago

In reply to: Olivier PRENANT (#7)

hackersgeneral

Re: Connecting to Postgres

In postgresql.conf the first entry should be something like 'tcpip'.
Set it to 'true'. That's the equivalent of the postmaster -D.

HTH,
-Tony

ogeer@cs.queensu.ca (Nailah Ogeer) wrote in message news:<Pine.SOL.4.33.0312271407030.3848-100000@innovate>...

Show quoted text

How do you connect to postgres if it is running on a different machine. I
am looking for a telnet command in place of postmaster -D data

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

#14

Manfred Spraul

manfred@colorfullife.com

over 22 years ago

In reply to: Tom Lane (#11)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Tom Lane wrote:

Is there an easy way find out which LWLock is contended?

Not from oprofile output, as far as I can think. I've suspected for
some time that the BufMgrLock is a major bottleneck, but have no proof.

Mark ran a DBT-2 testrun with the attached statistics patch applied: It
collects stats about all lightweight locks and dumps them during
shutdown. The hottest locks are

Lock Acquire %contention sleep calls
8(WALInsertLock) 8679205 0.030410 263934
1(LockMgrLock) 64089418 0.079783 5113215
5(SInvalLock) 68396470 0.001298 88812
0(BufMgrLock) 246307425 0.120293 29629089

The lock numbers are from 7.4, i.e. without the patch that removes
ShmemIndexLock. I've check that 8 is really WALInsertLock in the
assembly output.

The scary part from the system perspective are the 35 million context
switches that were generated by the BufMgrLock and the LockMgrLock. I
remember there were patches that tried other algorithms instead of the
simple LRU for the buffer manager. Has anyone tried to change the
locking of the buffer manager?

The effect of padding the lightweight locks to a full cacheline appears
to be negligable: With the padding, there were around 4 million
performance monitor hits on the 'lock xchg' instructions. Without it
(test run 300), there were 4.2 million hits.

The complete data is at

http://developer.osdl.org/markw/dbt2-pgsql/303/

The db log with the lock stats is at
http://developer.osdl.org/markw/dbt2-pgsql/303/db/log

(Warning: 6.9 MB)

--
Manfred

#15

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Manfred Spraul (#14)

hackersgeneral

Lock contention (was Re: [PATCHES] update i386 spinlock for hyperthreading)

Manfred Spraul <manfred@colorfullife.com> writes:

Mark ran a DBT-2 testrun with the attached statistics patch applied: It
collects stats about all lightweight locks and dumps them during
shutdown. The hottest locks are

Lock Acquire %contention sleep calls
8(WALInsertLock) 8679205 0.030410 263934
5(SInvalLock) 68396470 0.001298 88812
1(LockMgrLock) 64089418 0.079783 5113215
0(BufMgrLock) 246307425 0.120293 29629089

Okay, that more or less squares with my gut feelings about the system.
Good to have some proof.

I remember there were patches that tried other algorithms instead of the
simple LRU for the buffer manager. Has anyone tried to change the
locking of the buffer manager?

LRU etc have nothing to do with this. The majority of trips into the
buffer manager are for ReadBuffer/ReleaseBuffer. What we need is to
figure a way for those operations to use finer-grain locks so they don't
contend so much. However, given that they need to access and possibly
modify global structures (the buffer lookup hashtable and free list),
it's not obvious how to do it. If we could think of a way to divide
those structures into separately lockable portions, we might get somewhere.

The effect of padding the lightweight locks to a full cacheline appears
to be negligable:

Good, that squares with what the Linux kernel people seem to think.

regards, tom lane

#16

Chris Browne

cbbrowne@acm.org

over 22 years ago

In reply to: Manfred Spraul (#1)

hackersgeneral

Some Documentation Changes

1. In keeping with the recent discussion that there should be more
said about views, stored procedures, and triggers, in the tutorial, I
have added a bit of verbiage to that end.

2. Some formatting changes to the datetime discussion, as well as
addition of a citation of a relevant book on calendars.

Index: advanced.sgml
===================================================================
RCS file: /projects/cvsroot/pgsql-server/doc/src/sgml/advanced.sgml,v
retrieving revision 1.38
diff -c -u -r1.38 advanced.sgml
--- advanced.sgml	29 Nov 2003 19:51:36 -0000	1.38
+++ advanced.sgml	30 Dec 2003 01:58:24 -0000
@@ -65,10 +65,24 @@

    <para>
     Views can be used in almost any place a real table can be used.
-    Building views upon other views is not uncommon.
+    Building views upon other views is not uncommon.  You may cut down
+    on the difficulty of building complex queries by constructing them
+    in smaller, easier-to-verify pieces, using views.  Views may be
+    used to reveal specific table columns to users that legitimately
+    need access to some of the data, but who shouldn't be able to look
+    at the whole table.
    </para>
-  </sect1>

+   <para>
+    Views differ from <quote> real tables </quote> in that they are
+    not, by default, updatable.  If they join together several tables,
+    it may be troublesome to update certain columns since the
+    <emphasis>real</emphasis> update that must take place requires
+    identifying the relevant rows in the source tables.  This is
+    discussed further in <xref linkend="rules-views-update">.
+    </para>
+
+  </sect1>

<sect1 id="tutorial-fk">
<title>Foreign Keys</title>
@@ -387,6 +401,169 @@
</para>
</sect1>

+  <sect1 id="tutorial-storedprocs">
+   <title> Stored Procedures </title>
+
+   <indexterm zone="tutorial-storedprocs"> 
+     <primary>stored procedures</primary>
+   </indexterm>
+
+   <para> Stored procedures are code that runs inside the database
+   system.  Numerous languages may be used to implement functions and
+   procedures; most built-in code is implemented in C.  The
+   <quote>basic</quote> loadable procedural language for
+   <productname>PostgreSQL</productname> is <xref linkid="plpgsql">.
+   Numerous other languages may also be used, including <xref
+   linkid="plperl">, <xref linkid="pltcl">, and <xref
+   linkid="plpython">.
+   </para>
+  
+   <para> There are several ways that stored procedures are really
+   helpful:
+
+   <itemizedlist>
+
+   <listitem><para> To centralize data validation code into the
+   database </para>
+
+   <para> Your system may use client software written in several
+   languages, perhaps with a <quote>web application</quote>
+   implemented in PHP, a <quote>server application</quote> implemented
+   in Java, and a <quote> report writer</quote> implemented in Perl.
+   In the absence of stored procedures, you will likely find that data
+   validation code must be implemented multiple times, in multiple
+   languages, once for each application.</para>
+
+   <para> By implementing data validation in stored procedures,
+   running in the database, it can behave uniformly for all these
+   systems, and you do not need to worry about synchronizing
+   validation procedures across the languages.</para>
+
+   </listitem>
+
+   <listitem><para> Reducing round trips between client and server
+   </para>
+
+   <para>A stored procedure may submit multiple queries, looking up
+   information and adding in links to additional tables.  This takes
+   place without requiring that the client submit multiple queries,
+   and without requiring any added network traffic.
+   </para>
+
+   <para> As a matter of course, the queries share a single
+   transaction context, and there may also be savings in the
+   evaluation of query plans, that will be similar between invocations
+   of a given stored procedure.  </para></listitem>
+
+   <listitem><para> To simplify queries. </para>
+
+   <para> For instance, if you are commonly checking the TLD on domain
+   names, you might create a stored procedure for this purpose, and so
+   be able to use queries such as <command> select domain, tld(domain)
+   from domains; </command> instead of having to put verbose code
+   using <function>substr()</function> into each query.
+   </para>
+
+   <para> It is particularly convenient to use scripting languages
+   like Perl, Tcl, and Python to <quote>grovel through strings</quote>
+   since they are designed for <quote>text processing.</quote></para>
+
+   <para> The binding to the R statistical language allows
+   implementing complex statistical queries inside the database,
+   instead of having to draw the data out.
+   </listitem>
+
+   <listitem><para> Increasing the level of abstraction</para>
+
+   <para> If data is accessed exclusively through stored procedures,
+   then the structures of tables may be changed without there needing
+   to be any visible change in the API used by programmers.  In some
+   systems, users are <emphasis>only</emphasis> allowed access to
+   stored procedures to update data, and cannot do direct updates to
+   tables.
+   </para>
+
+   </listitem>
+
+   </itemizedlist>
+   </para>
+
+   <para> These benefits build on one another: careful use of stored
+   procedures can simultaneously improve reliability and performance,
+   whilst simplifying database access code and improving portability
+   across client platforms and languages.  For instance, consider that
+   a stored procedure can cheaply query tables in the database to
+   validate the correctness of data provided as input. </para>
+
+   <para> Instead of requiring a whole series of queries to create an
+   object, and to look up parent/subsidiary objects to link it to, a
+   stored procedure can do all of this efficiently in the database
+   server, improving performance, and eliminating whole classes of
+   errors. </para>
+   
+  </sect1>
+
+  <sect1 id="tutorial-triggers">
+   <title> Triggers </title>
+
+   <indexterm zone="tutorial-triggers"> 
+     <primary>triggers</primary>
+   </indexterm>
+
+   <para> Triggers allow running a function either before or after
+   update (<command>INSERT</command>, <command>DELETE</command>,
+   <command>UPDATE</command>) operations, which can allow you to do
+   some very clever things. </para>
+
+   <itemizedlist>
+
+   <listitem><para> Data Validation  </para>
+
+   <para> Instead of explicitly coding validation checks as part of a
+   stored procedure, they may be introduced as <command>BEFORE</command>
+   triggers.  The trigger function checks the input values, raising an
+   exception if it finds invalid input.</para>
+
+   <para> Note that this is how foreign key checks are implemented in
+   <productname>PostgreSQL</productname>; when you define a foreign
+   key, you will see a message similar to the following:
+<screen>
+NOTICE:  CREATE TABLE will create implicit trigger(s) for FOREIGN KEY check(s)
+</screen></para>
+
+   <para> In some cases, it may be appropriate for a trigger function
+   to insert data in order to <emphasis>make</emphasis> the input valid.  For
+   instance, if a newly created object needs a status code in a status
+   table, the trigger might automatically do that.</para>
+   </listitem>
+
+   <listitem><para> Audit logs </para> 
+
+   <para> One may use <command>AFTER</command> triggers to monitor updates to
+   vital tables, and <command>INSERT</command> entries into log tables to
+   provide a more permanent record of those updates.  </para>
+   </listitem>
+
+   <listitem><para> Replication </para> 
+
+   <para> The <application>RServ</application> replication system uses
+   <command>AFTER</command> triggers to track which rows have changed on the
+   <quote>master</quote> system and therefore need to be copied over to
+   <quote>slave</quote> systems.</para>
+
+   <para> <command> 
+     CREATE TRIGGER "_rserv_trigger_t_" AFTER INSERT OR DELETE OR UPDATE ON "my_table"  
+        FOR EACH ROW EXECUTE PROCEDURE "_rserv_log_" ('10');
+   </command></para>
+   </listitem>
+
+   </itemizedlist>
+
+   <para> Notice that there are strong parallels between what can be
+   accomplished using triggers and stored procedures, particularly in
+   regards to data validation.  </para>
+   
+  </sect1>

   <sect1 id="tutorial-conclusion">
    <title>Conclusion</title>
Index: datetime.sgml
===================================================================
RCS file: /projects/cvsroot/pgsql-server/doc/src/sgml/datetime.sgml,v
retrieving revision 2.39
diff -c -u -r2.39 datetime.sgml
--- datetime.sgml	1 Dec 2003 20:34:53 -0000	2.39
+++ datetime.sgml	30 Dec 2003 01:58:25 -0000
@@ -11,8 +11,8 @@
    strings, and are broken up into distinct fields with a preliminary
    determination of what kind of information may be in the
    field. Each field is interpreted and either assigned a numeric
-   value, ignored, or rejected.
-   The parser contains internal lookup tables for all textual fields,
+   value, ignored, or rejected.</para>
+   <para> The parser contains internal lookup tables for all textual fields,
    including months, days of the week, and time
    zones.
   </para>
@@ -1056,21 +1056,21 @@
    years.
   </para>

-  <para>
-   The papal bull of February 1582 decreed that 10 days should be dropped
-   from October 1582 so that 15 October should follow immediately after
-   4 October.
-   This was observed in Italy, Poland, Portugal, and Spain. Other Catholic
-   countries followed shortly after, but Protestant countries were
-   reluctant to change, and the Greek orthodox countries didn't change
-   until the start of the 20th century.
-
-   The reform was observed by Great Britain and Dominions (including what is
-   now the USA) in 1752.
-   Thus 2 September 1752 was followed by 14 September 1752.
+  <para> The papal bull of February 1582 decreed that 10 days should
+  be dropped from October 1582 so that 15 October should follow
+  immediately after 4 October.</para>
+  
+  <para> This was observed in Italy, Poland, Portugal, and Spain.
+  Other Catholic countries followed shortly after, but Protestant
+  countries were reluctant to change, and the Greek orthodox countries
+  didn't change until the start of the 20th century.</para>
+
+  <para> The reform was observed by Great Britain and Dominions
+  (including what is now the USA) in 1752.  Thus 2 September 1752 was
+  followed by 14 September 1752.</para>

-   This is why Unix systems have the <command>cal</command> program
-   produce the following:
+   <para> This is why Unix systems have the <command>cal</command>
+   program produce the following:

-  <para>
-   Different calendars have been developed in various parts of the
-   world, many predating the Gregorian system.
+  <para> Different calendars have been developed in various parts of
+  the world, many predating the Gregorian system.</para>

-   For example,
-   the beginnings of the Chinese calendar can be traced back to the 14th
-   century BC. Legend has it that the Emperor Huangdi invented the
-   calendar in 2637 BC.
+  <para> For example, the beginnings of the Chinese calendar can be
+  traced back to the 14th century BC. Legend has it that the Emperor
+  Huangdi invented the calendar in 2637 BC.</para>

-   The People's Republic of China uses the Gregorian calendar
-   for civil purposes. The Chinese calendar is used for determining
-   festivals.
+  <para> The People's Republic of China uses the Gregorian calendar
+  for civil purposes. The Chinese calendar is used for determining
+  festivals.
   </para>
+
+  <para> If you are interested in this sort of thing, <citation>
+  Calendrical Calculations: The Millennium Edition </citation> by by
+  Edward M. Reingold and Nachum Dershowitz is an excellent reference,
+  describing some 25 calendars, and providing software for displaying
+  them and converting between them.</para>
+
  </sect1>
 </appendix>
-- 
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/>
Christopher Browne
(416) 646 3304 x124 (land)

#17

Bruce Momjian

bruce@momjian.us

over 22 years ago

In reply to: Manfred Spraul (#14)

hackersgeneral

Re: update i386 spinlock for hyperthreading

Manfred Spraul wrote:

The scary part from the system perspective are the 35 million context
switches that were generated by the BufMgrLock and the LockMgrLock. I
remember there were patches that tried other algorithms instead of the
simple LRU for the buffer manager. Has anyone tried to change the
locking of the buffer manager?

CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer
replacement.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

#18

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Bruce Momjian (#17)

hackersgeneral

Re: [PATCHES] update i386 spinlock for hyperthreading

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Manfred Spraul wrote:

I remember there were patches that tried other algorithms instead of the
simple LRU for the buffer manager. Has anyone tried to change the
locking of the buffer manager?

CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer
replacement.

That's irrelevant to the problem, though. Unless the ARC code uses data
structures that are more amenable to localized locking than the old
global buffer freelist. (Jan?)

regards, tom lane

#19

Jan Wieck

JanWieck@Yahoo.com

over 22 years ago

In reply to: Tom Lane (#18)

hackersgeneral

Re: [PATCHES] update i386 spinlock for hyperthreading

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Manfred Spraul wrote:

I remember there were patches that tried other algorithms instead of the
simple LRU for the buffer manager. Has anyone tried to change the
locking of the buffer manager?

CVS HEAD already has an Adaptive Replacement Cache (ARC) for buffer
replacement.

That's irrelevant to the problem, though. Unless the ARC code uses data
structures that are more amenable to localized locking than the old
global buffer freelist. (Jan?)

regards, tom lane

Not that I know of. The new strategy uses one shared hash table like the
old, and one buffer pool as well. It grabs the same old Bufmgr lock
during the lookup+replacement decision process, gives it up during
eventual IO, grabs it again when done with the IO. As a matter of fact,
the strategy itself does no locking at all. Like the old LRU code it
simply assumes that the buffer manager holds the lock during calls.

Jan

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#20

Tom Lane

tgl@sss.pgh.pa.us

over 22 years ago

In reply to: Jan Wieck (#19)

hackersgeneral

Re: [PATCHES] update i386 spinlock for hyperthreading

Jan Wieck <JanWieck@Yahoo.com> writes:

That's irrelevant to the problem, though. Unless the ARC code uses data
structures that are more amenable to localized locking than the old
global buffer freelist. (Jan?)

the strategy itself does no locking at all. Like the old LRU code it
simply assumes that the buffer manager holds the lock during calls.

Okay, I suspected as much but wasn't sure.

Manfred's numbers definitely say that we need to find a way to break
down the BufMgrLock into multiple finer-grain locks. We already have
all those per-buffer LWLocks, but I don't see how to apply those to
the problem of managing the global lookup and replacement datastructures.

Anyone see an attack path here?

regards, tom lane

#21

Bruce Momjian

bruce@momjian.us

over 22 years ago

In reply to: Tom Lane (#20)

hackersgeneral

#22

Manfred Spraul

manfred@colorfullife.com

over 22 years ago