poor performance with Context Switch Storm at TPC-W.

Started by Katsuhiko Okanoover 19 years ago52 messages
#1Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp

Hi,All.

The problem has occurred in my customer.
poor performance with Context Switch Storm occurred
with the following composition.
Usually, CS is about 5000, WIPS=360.
when CSStorm occurrence, CS is about 100000, WIPS=60 or less.
(WIPS = number of web interactions per second)

It is under investigation using the patch
which collects a LWLock.

I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.

[DB server]*1
Intel Xeon 3.0GHz*4(2CPU * H/T ON)
4GB Memory
Red Hat Enterprise Linux ES release 4(Nahant Update 3)
Linux version 2.6.9-34.ELsmp
PostgreSQL8.1.3 (The version 8.2(head-6/15) was also occurred)
shared_buffers=131072
temp_buffers=1000
max_connections=300

[AP server]*2
200 connection pooling.
TPC-W model workload

[Clinet]*4
TPC-W model workload

(1)
The following discussion were read.
http://archives.postgresql.org/pgsql-hackers/2006-05/msg01003.php
From: Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us>
To: josh ( at ) agliodbs ( dot ) com
Subject: Re: Further reduction of bufmgr lock contention
Date: Wed, 24 May 2006 15:25:26 -0400

If there is a patch for investigation or a technique,
would someone show it to me?

(2)
It seems that much sequential scan has occurred at CSStorm.
When reading a tuple, do the visible satisfy check.
it seems to generate the subtransaction for every transaction.
How much is a possibility that
the LWLock to a subtransaction cause CSStorm?

best regards.
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#2Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Katsuhiko Okano (#1)
Re: poor performance with Context Switch Storm at TPC-W.

"Katsuhiko Okano" <okano.katsuhiko@oss.ntt.co.jp> wrote

The problem has occurred in my customer.
poor performance with Context Switch Storm occurred
with the following composition.
Usually, CS is about 5000, WIPS=360.
when CSStorm occurrence, CS is about 100000, WIPS=60 or less.

Intel Xeon 3.0GHz*4(2CPU * H/T ON)
4GB Memory

Do you have bgwriter on and what's the parameters? I read a theory somewhere
that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so
with HT on, the other backends sharing the physical processor with it also
get thrashed ... So try to turn bgwriter off or turn HT off see what's the
difference.

Regards,
Qingqing

#3Alvaro Herrera
alvherre@commandprompt.com
In reply to: Katsuhiko Okano (#1)
Re: poor performance with Context Switch Storm at TPC-W.

Katsuhiko Okano wrote:

I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.

Please see this thread:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php
(actually it's a single message AFAICT)

This was applied on the 8.2dev code, so I'm surprised that 8.2dev
behaves the same as 8.1.

Does your problem have any relationship to what's described there?

I also wondered whether the problem may be that the number of SLRU
buffers we use for subtrans is too low. But the number was increased
from the default 8 to 32 in 8.2dev as well. Maybe you could try
increasing that even further; say 128 and see if the problem is still
there. (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#4Josh Berkus
josh@agliodbs.com
In reply to: Katsuhiko Okano (#1)
Re: poor performance with Context Switch Storm at TPC-W.

Katsuhiko,

Have you tried turning HT off? HT is not generally considered (even by
Intel) a good idea for database appplications.

--Josh

#5Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Qingqing Zhou (#2)
Re: poor performance with Context Switch Storm at TPC-W.

hello.

Do you have bgwriter on and what's the parameters? I read a theory somewhere
that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so
with HT on, the other backends sharing the physical processor with it also
get thrashed ... So try to turn bgwriter off or turn HT off see what's the
difference.

bgwriter is ON.
at postgresql.conf:

# - Background writer -

bgwriter_delay = 200 # 10-10000 milliseconds between rounds
bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round

I tried turn H/T OFF, but CSStorm occurred.
Usually, CS is about 5000.
when CSStrom occurrence, CS is about 70000.
(CS is a value smaller than the case where H/T is ON.
I think that it is because the performance of CPU fell.)

Regards
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#6Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Alvaro Herrera (#3)
Re: poor performance with Context Switch Storm at TPC-W.

Hi.

Alvaro Herrera wrote:

Katsuhiko Okano wrote:

I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.

Please see this thread:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php
(actually it's a single message AFAICT)

This was applied on the 8.2dev code, so I'm surprised that 8.2dev
behaves the same as 8.1.

Does your problem have any relationship to what's described there?

Probably it is related.
There is no telling are a thing with the bad method of a lock and
whether it is bad that the number of LRU buffers is simply small.

I also wondered whether the problem may be that the number of SLRU
buffers we use for subtrans is too low. But the number was increased
from the default 8 to 32 in 8.2dev as well. Maybe you could try
increasing that even further; say 128 and see if the problem is still
there. (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS).

By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128
and recompile and measured again.
NOT occurrence of CSStorm. The value of WIPS was about 400.
(but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.)

If the number of SLRU buffers is too low,
also in PostgreSQL8.1.4, if the number of buffers is increased
I think that the same result is brought.
(Although the buffer of CLOG or a multi-transaction also increases,
I think that effect is small)

Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4
and is under measurement.

regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#7Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Katsuhiko Okano (#6)
CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)

Katsuhiko Okano wrote:

By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128
and recompile and measured again.
NOT occurrence of CSStorm. The value of WIPS was about 400.

measured again.
not occurrence when measured for 30 minutes.
but occurrence when measured for 3 hours, and 1 hour and 10 minutes passed.
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

If the number of SLRU buffers is too low,
also in PostgreSQL8.1.4, if the number of buffers is increased
I think that the same result is brought.
(Although the buffer of CLOG or a multi-transaction also increases,
I think that effect is small)

Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4
and is under measurement.

Occurrence CSStorm when the version 8.1.4 passed similarly for
1 hour and 10 minutes.

A strange point,
The number of times of a LWLock lock for LRU buffers is 0 times
until CSStorm occurs.
After CSStorm occurs, the share lock and the exclusion lock are required and
most locks are kept waiting.
(exclusion lock for SubtransControlLock is increased rapidly after CSStorm start.)

Is different processing done by whether CSStrom has occurred or not occurred?

regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#8Tom Lane
tgl@sss.pgh.pa.us
In reply to: Katsuhiko Okano (#7)
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:

It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

Can you provide a reproducible test case for this?

regards, tom lane

#9Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Tom Lane (#8)
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor performance with Context Switch Storm at TPC-W.)

"Tom Lane <tgl@sss.pgh.pa.us>" wrote:

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:

It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

Can you provide a reproducible test case for this?

Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)

It was not able to reappear to a multiplex run of pgbench
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.

I will show if there is information required for others.

The patch which outputs the number of times of LWLock was used this time.
The following is old example output. FYI.

# SELECT * FROM pg_stat_lwlocks;
kind | pg_stat_get_lwlock_name | sh_call | sh_wait | ex_call | ex_wait | sleep

------+----------------------------+------------+-----------+-----------+-----------+-------

0 | BufMappingLock | 559375542 | 33542 | 320092 | 24025 | 0

1 | BufFreelistLock | 0 | 0 | 370709 | 47 | 0

2 | LockMgrLock | 0 | 0 | 41718885 | 734502 | 0

3 | OidGenLock | 33 | 0 | 0 | 0 | 0

4 | XidGenLock | 12572279 | 10095 | 11299469 | 20089 | 0

5 | ProcArrayLock | 8371330 | 72052 | 16965667 | 603294 | 0

6 | SInvalLock | 38822428 | 435 | 25917 | 128 | 0

7 | FreeSpaceLock | 0 | 0 | 16787 | 4 | 0

8 | WALInsertLock | 0 | 0 | 1239911 | 885 | 0

9 | WALWriteLock | 0 | 0 | 69907 | 5589 | 0

10 | ControlFileLock | 0 | 0 | 16686 | 1 | 0

11 | CheckpointLock | 0 | 0 | 34 | 0 | 0

12 | CheckpointStartLock | 69509 | 0 | 34 | 1 | 0

13 | CLogControlLock | 0 | 0 | 236763 | 183 | 0

14 | SubtransControlLock | 0 | 0 | 753773945 | 205273395 | 0

15 | MultiXactGenLock | 66 | 0 | 0 | 0 | 0

16 | MultiXactOffsetControlLock | 0 | 0 | 35 | 0 | 0

17 | MultiXactMemberControlLock | 0 | 0 | 34 | 0 | 0

18 | RelCacheInitLock | 0 | 0 | 0 | 0 | 0

19 | BgWriterCommLock | 0 | 0 | 61457 | 1 | 0

20 | TwoPhaseStateLock | 33 | 0 | 0 | 0 | 0

21 | TablespaceCreateLock | 0 | 0 | 0 | 0 | 0

22 | BufferIO | 0 | 0 | 695627 | 16 | 0

23 | BufferContent | 3568231805 | 1897 | 1361394 | 829 | 0

24 | CLog | 0 | 0 | 0 | 0 | 0

25 | SubTrans | 138571621 | 143208883 | 8122181 | 8132646 | 0

26 | MultiXactOffset | 0 | 0 | 0 | 0 | 0

27 | MultiXactMember | 0 | 0 | 0 | 0 | 0

(28 rows)

I am pleased if interested.

regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#10Stefan Kaltenbrunner
stefan@kaltenbrunner.cc
In reply to: Katsuhiko Okano (#9)
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor

Katsuhiko Okano wrote:

"Tom Lane <tgl@sss.pgh.pa.us>" wrote:

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:

It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

Can you provide a reproducible test case for this?

Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)

It was not able to reappear to a multiplex run of pgbench
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.

FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/

Stefan

#11Masanori ITOH
ito.masanori@oss.ntt.co.jp
In reply to: Stefan Kaltenbrunner (#10)
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor

Hi folks,

From: Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>
Subject: Re: CSStorm occurred again by postgreSQL8.2. (Re: [HACKERS] poor
Date: Wed, 19 Jul 2006 12:53:53 +0200

Katsuhiko Okano wrote:

"Tom Lane <tgl@sss.pgh.pa.us>" wrote:

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:

It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

Can you provide a reproducible test case for this?

Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)

It was not able to reappear to a multiplex run of pgbench
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.

FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/

FYI(2):

There is one more (pseudo) TPC-W implementation by OSDL.

http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/

One more comment is that Katsuhiko't team is using their own version of
TPC-W like benchmark suite, and he cannot make it public.

Also, his point is that he tried to reproduce the CSS phenomena using
pgbench and a proguram issuing heavily multiple SELECT queries
on a single table but they didn't work well reproducing CSS.

Regards,
Masanori

Stefan

---
Masanori ITOH NTT OSS Center, Nippon Telegraph and Telephone Corporation
e-mail: ito.masanori@oss.ntt.co.jp
phone : +81-3-5860-5015

#12Jim C. Nasby
jnasby@pervasive.com
In reply to: Katsuhiko Okano (#6)
Re: poor performance with Context Switch Storm at TPC-W.

On Fri, Jul 14, 2006 at 02:58:36PM +0900, Katsuhiko Okano wrote:

NOT occurrence of CSStorm. The value of WIPS was about 400.
(but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.)

If you haven't changed checkpoint timeout, this drop-off every 4-6
minutes indicates that you need to make the bgwriter more aggressive.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

#13Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Masanori ITOH (#11)
Re: CSStorm occurred again by postgreSQL8.2. (Re: poor

If there is a work load tool of a free license, I would like to try.

FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/

FYI(2):

There is one more (pseudo) TPC-W implementation by OSDL.

http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/

Thank you for the information.
I'll try it.

Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#14Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Jim C. Nasby (#12)
Re: poor performance with Context Switch Storm at TPC-W.

"Jim C. Nasby" wrote:

If you haven't changed checkpoint timeout, this drop-off every 4-6
minutes indicates that you need to make the bgwriter more aggressive.

I'll say to a customer when proposing and explaining.
Thank you for the information.

Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#15ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Katsuhiko Okano (#9)
LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

Hi hackers,

I tackled the performance problem on SUBTRANS module with Okano.
He and I reach a conclusion that SubTrans log is heavily read on a specific
access pattern in my TPC-W implementation. There seems to be awful traffic
on SUBTRANS to check visivility of tuples in HeapTupleSatisfiesSnapshot().
I'll report more details later.

BTW, I wrote a patch to collect statistics of Light-weight locks for analysis.
We have already had Trace_lwlocks option, but it can collect statistics with
less impact. The following is an output of the patch (on 8.1).
Are you interested in the feature? and I'll port it to HEAD and post it.

# SELECT * FROM pg_stat_lwlocks;
kind | pg_stat_get_lwlock_name | sh_call | sh_wait | ex_call | ex_wait |
------+----------------------------+------------+-----------+-----------+-----------+-
0 | BufMappingLock | 559375542 | 33542 | 320092 | 24025 |
1 | BufFreelistLock | 0 | 0 | 370709 | 47 |
2 | LockMgrLock | 0 | 0 | 41718885 | 734502 |
3 | OidGenLock | 33 | 0 | 0 | 0 |
4 | XidGenLock | 12572279 | 10095 | 11299469 | 20089 |
5 | ProcArrayLock | 8371330 | 72052 | 16965667 | 603294 |
6 | SInvalLock | 38822428 | 435 | 25917 | 128 |
7 | FreeSpaceLock | 0 | 0 | 16787 | 4 |
8 | WALInsertLock | 0 | 0 | 1239911 | 885 |
9 | WALWriteLock | 0 | 0 | 69907 | 5589 |
10 | ControlFileLock | 0 | 0 | 16686 | 1 |
11 | CheckpointLock | 0 | 0 | 34 | 0 |
12 | CheckpointStartLock | 69509 | 0 | 34 | 1 |
13 | CLogControlLock | 0 | 0 | 236763 | 183 |
14 | SubtransControlLock | 0 | 0 | 753773945 | 205273395 |
15 | MultiXactGenLock | 66 | 0 | 0 | 0 |
16 | MultiXactOffsetControlLock | 0 | 0 | 35 | 0 |
17 | MultiXactMemberControlLock | 0 | 0 | 34 | 0 |
18 | RelCacheInitLock | 0 | 0 | 0 | 0 |
19 | BgWriterCommLock | 0 | 0 | 61457 | 1 |
20 | TwoPhaseStateLock | 33 | 0 | 0 | 0 |
21 | TablespaceCreateLock | 0 | 0 | 0 | 0 |
22 | BufferIO | 0 | 0 | 695627 | 16 |
23 | BufferContent | 3568231805 | 1897 | 1361394 | 829 |
24 | CLog | 0 | 0 | 0 | 0 |
25 | SubTrans | 138571621 | 143208883 | 8122181 | 8132646 |
26 | MultiXactOffset | 0 | 0 | 0 | 0 |
27 | MultiXactMember | 0 | 0 | 0 | 0 |
(28 rows)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#16ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: ITAGAKI Takahiro (#15)
1 attachment(s)
LWLock statistics collector

Hi,

Here is a patch to collect statistics of LWLocks.
The following information are available on pg_stat_lwlocks view:
- sh_call : number of locked for share mode
- sh_wait : number of blocked for share mode
- ex_call : number of locked for exclusive mode
- ex_wait : number of blocked for exclusive mode

This feature is for developers, so available only if LWLOCK_STAT is defined
(default off). Otherwise, stab functions are installed.

There is room for discussion.
- Should not collect information for shared buffers?
If not, we can reduce consumption of memory for stats.
- Information for LWLockConditionalAcquire()
It is not gathered now because of the size limitation of LWLockStat
struct. I intended the total size of LWLock to be less than 64bytes.
- Information for spinlocks
Number of sleeps in s_lock() might be useful for analyzing
spinlock-level lock contentions.
- Documentation
Where to document such a feature for only developers?
...

Comments welcome.

# SELECT * FROM pg_stat_lwlocks;
kind | lwlock | sh_call | sh_wait | ex_call | ex_wait
------+----------------------------+---------+---------+---------+---------
0 | BufFreelistLock | 0 | 0 | 1237 | 0
1 | ShmemIndexLock | 0 | 0 | 437 | 0
2 | OidGenLock | 0 | 0 | 0 | 0
3 | XidGenLock | 80086 | 9 | 8174 | 0
4 | ProcArrayLock | 162491 | 59 | 16231 | 38
5 | SInvalLock | 163423 | 0 | 180 | 0
6 | FreeSpaceLock | 0 | 0 | 399 | 0
7 | WALInsertLock | 0 | 0 | 67214 | 247
8 | WALWriteLock | 0 | 0 | 8028 | 12
9 | ControlFileLock | 0 | 0 | 16 | 0
10 | CheckpointLock | 0 | 0 | 0 | 0
11 | CheckpointStartLock | 8028 | 0 | 0 | 0
12 | CLogControlLock | 22449 | 9 | 8028 | 4
13 | SubtransControlLock | 32731 | 0 | 4 | 0
14 | MultiXactGenLock | 0 | 0 | 0 | 0
15 | MultiXactOffsetControlLock | 0 | 0 | 0 | 0
16 | MultiXactMemberControlLock | 0 | 0 | 0 | 0
17 | RelCacheInitLock | 0 | 0 | 0 | 0
18 | BgWriterCommLock | 0 | 0 | 1185 | 0
19 | TwoPhaseStateLock | 0 | 0 | 0 | 0
20 | TablespaceCreateLock | 0 | 0 | 0 | 0
21 | BtreeVacuumLock | 0 | 0 | 12 | 0
22 | BufMappingLock | 322414 | 1 | 315 | 0
23 | LockMgrLock | 0 | 0 | 207585 | 6927
24 | BufferIO | 0 | 0 | 2295 | 0
25 | BufferContent | 522714 | 362 | 83375 | 324
26 | CLogBuffer | 0 | 0 | 0 | 0
27 | SubTransBuffer | 0 | 0 | 0 | 0
28 | MultiXactOffsetBuffer | 0 | 0 | 0 | 0
29 | MultiXactMemberBuffer | 0 | 0 | 0 | 0
(30 rows)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

pg_stat_lwlocks.patchapplication/octet-stream; name=pg_stat_lwlocks.patchDownload
diff -cpr pgsql-orig/src/backend/access/transam/slru.c pgsql/src/backend/access/transam/slru.c
*** pgsql-orig/src/backend/access/transam/slru.c	Mon Jul 31 09:22:04 2006
--- pgsql/src/backend/access/transam/slru.c	Mon Jul 31 09:37:25 2006
*************** SimpleLruInit(SlruCtl ctl, const char *n
*** 185,190 ****
--- 185,191 ----
  		char	   *ptr;
  		Size		offset;
  		int			slotno;
+ 		LWLockKind	kind;
  
  		Assert(!found);
  
*************** SimpleLruInit(SlruCtl ctl, const char *n
*** 214,226 ****
  		offset += MAXALIGN(nslots * sizeof(LWLockId));
  		ptr += BUFFERALIGN(offset);
  
  		for (slotno = 0; slotno < nslots; slotno++)
  		{
  			shared->page_buffer[slotno] = ptr;
  			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
  			shared->page_dirty[slotno] = false;
  			shared->page_lru_count[slotno] = 0;
! 			shared->buffer_locks[slotno] = LWLockAssign();
  			ptr += BLCKSZ;
  		}
  	}
--- 215,247 ----
  		offset += MAXALIGN(nslots * sizeof(LWLockId));
  		ptr += BUFFERALIGN(offset);
  
+ 		switch (ctllock)
+ 		{
+ 			case CLogControlLock:
+ 				kind = LW_KIND_CLOG;
+ 				break;
+ 			case SubtransControlLock:
+ 				kind = LW_KIND_SUBTRANS;
+ 				break;
+ 			case MultiXactOffsetControlLock:
+ 				kind = LW_KIND_MULTIXACTOFFSET;
+ 				break;
+ 			case MultiXactMemberControlLock:
+ 				kind = LW_KIND_MULTIXACTMEMBER;
+ 				break;
+ 			default:
+ 				elog(ERROR, "invalid SLRU lockid: %d", (int) ctllock);
+ 				kind = 0;
+ 				break;
+ 		}
+ 
  		for (slotno = 0; slotno < nslots; slotno++)
  		{
  			shared->page_buffer[slotno] = ptr;
  			shared->page_status[slotno] = SLRU_PAGE_EMPTY;
  			shared->page_dirty[slotno] = false;
  			shared->page_lru_count[slotno] = 0;
! 			shared->buffer_locks[slotno] = LWLockAssign(kind);
  			ptr += BLCKSZ;
  		}
  	}
diff -cpr pgsql-orig/src/backend/catalog/system_views.sql pgsql/src/backend/catalog/system_views.sql
*** pgsql-orig/src/backend/catalog/system_views.sql	Mon Jul 31 09:22:04 2006
--- pgsql/src/backend/catalog/system_views.sql	Mon Jul 31 09:37:25 2006
*************** CREATE VIEW pg_stat_database AS 
*** 355,360 ****
--- 355,375 ----
              pg_stat_get_db_blocks_hit(D.oid) AS blks_hit 
      FROM pg_database D;
  
+ CREATE VIEW pg_stat_lwlocks AS
+     SELECT
+         kind,
+         pg_stat_get_lwlock_name(kind) AS lwlock,
+         sum(sh_call) AS sh_call,
+         sum(sh_wait) AS sh_wait,
+         sum(ex_call) AS ex_call,
+         sum(ex_wait) AS ex_wait
+     FROM pg_stat_get_lwlocks() AS L(
+          id int, kind int,
+          sh_call int8, sh_wait int8,
+          ex_call int8, ex_wait int8)
+     GROUP BY kind
+     ORDER BY kind;
+ 
  --
  -- Fix up built-in functions that make use of OUT parameters.
  -- We can't currently fill these values in during bootstrap, because
diff -cpr pgsql-orig/src/backend/storage/buffer/buf_init.c pgsql/src/backend/storage/buffer/buf_init.c
*** pgsql-orig/src/backend/storage/buffer/buf_init.c	Mon Jul 31 09:22:05 2006
--- pgsql/src/backend/storage/buffer/buf_init.c	Mon Jul 31 09:37:25 2006
*************** InitBufferPool(void)
*** 125,132 ****
  			 */
  			buf->freeNext = i + 1;
  
! 			buf->io_in_progress_lock = LWLockAssign();
! 			buf->content_lock = LWLockAssign();
  		}
  
  		/* Correct last entry of linked list */
--- 125,132 ----
  			 */
  			buf->freeNext = i + 1;
  
! 			buf->io_in_progress_lock = LWLockAssign(LW_KIND_BUFFERIO);
! 			buf->content_lock = LWLockAssign(LW_KIND_BUFFERCONTENT);
  		}
  
  		/* Correct last entry of linked list */
diff -cpr pgsql-orig/src/backend/storage/lmgr/lwlock.c pgsql/src/backend/storage/lmgr/lwlock.c
*** pgsql-orig/src/backend/storage/lmgr/lwlock.c	Mon Jul 31 09:22:05 2006
--- pgsql/src/backend/storage/lmgr/lwlock.c	Mon Jul 31 09:37:25 2006
*************** typedef struct LWLock
*** 41,46 ****
--- 41,50 ----
  	int			shared;			/* # of shared holders (0..MaxBackends) */
  	PGPROC	   *head;			/* head of list of waiting PGPROCs */
  	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
+ #ifdef LWLOCK_STAT
+ 	LWLockStat	stat;
+ #endif
+ 
  	/* tail is undefined when head is NULL */
  } LWLock;
  
*************** typedef struct LWLock
*** 53,62 ****
   * Opterons.  (Of course, we have to also ensure that the array start
   * address is suitably aligned.)
   *
!  * LWLock is between 16 and 32 bytes on all known platforms, so these two
!  * cases are sufficient.
!  */
! #define LWLOCK_PADDED_SIZE	(sizeof(LWLock) <= 16 ? 16 : 32)
  
  typedef union LWLockPadded
  {
--- 57,69 ----
   * Opterons.  (Of course, we have to also ensure that the array start
   * address is suitably aligned.)
   *
!  * LWLock is between 16 and 32 bytes on all known platforms if LWLOCK_STAT
!  * is not defined, and less than 64 bytes if defined. So these three cases
!  * are sufficient.
!  */
! #define LWLOCK_PADDED_SIZE \
! 	(sizeof(LWLock) <= 16 ? 16 : \
! 	(sizeof(LWLock) <= 32 ? 32 : 64))
  
  typedef union LWLockPadded
  {
*************** typedef union LWLockPadded
*** 71,76 ****
--- 78,84 ----
   */
  NON_EXEC_STATIC LWLockPadded *LWLockArray = NULL;
  
+ #define GetLWLockCounter()	((int *) ((char *) LWLockArray - 2 * sizeof(int)))
  
  /*
   * We use this structure to keep track of locked LWLocks for release
*************** static int *ex_acquire_counts;
*** 90,95 ****
--- 98,149 ----
  static int *block_counts;
  #endif
  
+ #ifdef LWLOCK_STAT
+ 
+ static char* LWLockIdNames[] = 
+ {
+ 	/* System fixed locks */
+ 	"BufFreelistLock",
+ 	"ShmemIndexLock",
+ 	"OidGenLock",
+ 	"XidGenLock",
+ 	"ProcArrayLock",
+ 	"SInvalLock",
+ 	"FreeSpaceLock",
+ 	"WALInsertLock",
+ 	"WALWriteLock",
+ 	"ControlFileLock",
+ 	"CheckpointLock",
+ 	"CheckpointStartLock",
+ 	"CLogControlLock",
+ 	"SubtransControlLock",
+ 	"MultiXactGenLock",
+ 	"MultiXactOffsetControlLock",
+ 	"MultiXactMemberControlLock",
+ 	"RelCacheInitLock",
+ 	"BgWriterCommLock",
+ 	"TwoPhaseStateLock",
+ 	"TablespaceCreateLock",
+ 	"BtreeVacuumLock",
+ 	"BufMappingLock",
+ 	"LockMgrLock",
+ 	/* Dynamic locks */
+ 	"BufferIO",
+ 	"BufferContent",
+ 	"CLogBuffer",
+ 	"SubTransBuffer",
+ 	"MultiXactOffsetBuffer",
+ 	"MultiXactMemberBuffer",
+ };
+ 
+ #define StaticAssert(cond)	extern char static_assertion_failure[(cond) ? 1 : -1]
+ 
+ StaticAssert( lengthof(LWLockIdNames) == NUM_LW_KINDS );
+ 
+ #define LWLockKinds	((LWLockKind *) \
+ 	((char *) LWLockArray + sizeof(LWLockPadded) * GetLWLockCounter()[1]))
+ 
+ #endif /* LWLOCK_STAT */
  
  #ifdef LOCK_DEBUG
  bool		Trace_lwlocks = false;
*************** static void
*** 121,127 ****
  print_lwlock_stats(int code, Datum arg)
  {
  	int			i;
! 	int		   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
  	int			numLocks = LWLockCounter[1];
  
  	/* Grab an LWLock to keep different backends from mixing reports */
--- 175,181 ----
  print_lwlock_stats(int code, Datum arg)
  {
  	int			i;
! 	int		   *LWLockCounter = GetLWLockCounter();
  	int			numLocks = LWLockCounter[1];
  
  	/* Grab an LWLock to keep different backends from mixing reports */
*************** LWLockShmemSize(void)
*** 193,198 ****
--- 247,257 ----
  	/* Space for dynamic allocation counter, plus room for alignment. */
  	size = add_size(size, 2 * sizeof(int) + LWLOCK_PADDED_SIZE);
  
+ #ifdef LWLOCK_STAT
+ 	/* Space for the LWLockKind array. */
+ 	size = add_size(size, numLocks * sizeof(LWLockKind));
+ #endif
+ 
  	return size;
  }
  
*************** CreateLWLocks(void)
*** 232,246 ****
  		lock->lock.shared = 0;
  		lock->lock.head = NULL;
  		lock->lock.tail = NULL;
  	}
  
  	/*
  	 * Initialize the dynamic-allocation counter, which is stored just before
  	 * the first LWLock.
  	 */
! 	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
  	LWLockCounter[0] = (int) NumFixedLWLocks;
  	LWLockCounter[1] = numLocks;
  }
  
  
--- 291,318 ----
  		lock->lock.shared = 0;
  		lock->lock.head = NULL;
  		lock->lock.tail = NULL;
+ #ifdef LWLOCK_STAT
+ 		memset(&lock->lock.stat, 0, sizeof(LWLockStat));
+ #endif
  	}
  
  	/*
  	 * Initialize the dynamic-allocation counter, which is stored just before
  	 * the first LWLock.
  	 */
! 	LWLockCounter = GetLWLockCounter();
  	LWLockCounter[0] = (int) NumFixedLWLocks;
  	LWLockCounter[1] = numLocks;
+ 
+ #ifdef LWLOCK_STAT
+ 	/* Initialize the named fixed locks, BufMappingLocks and LockMgrLocks. */
+ 	for (id = 0; id <= (int) LW_KIND_LASTFIXED; id++)
+ 		LWLockKinds[id] = (LWLockKind) id;
+ 	for (id = 0; id < NUM_BUFFER_PARTITIONS; id++)
+ 		LWLockKinds[FirstBufMappingLock + id] = LW_KIND_BUFFER;
+ 	for (id = 0; id < NUM_LOCK_PARTITIONS; id++)
+ 		LWLockKinds[FirstLockMgrLock + id] = LW_KIND_LOCK;
+ #endif
  }
  
  
*************** CreateLWLocks(void)
*** 253,266 ****
   * LWLocks after startup.
   */
  LWLockId
! LWLockAssign(void)
  {
  	LWLockId	result;
  
  	/* use volatile pointer to prevent code rearrangement */
  	volatile int *LWLockCounter;
  
! 	LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
  	SpinLockAcquire(ShmemLock);
  	if (LWLockCounter[0] >= LWLockCounter[1])
  	{
--- 325,338 ----
   * LWLocks after startup.
   */
  LWLockId
! LWLockAssign(LWLockKind kind)
  {
  	LWLockId	result;
  
  	/* use volatile pointer to prevent code rearrangement */
  	volatile int *LWLockCounter;
  
! 	LWLockCounter = GetLWLockCounter();
  	SpinLockAcquire(ShmemLock);
  	if (LWLockCounter[0] >= LWLockCounter[1])
  	{
*************** LWLockAssign(void)
*** 269,274 ****
--- 341,351 ----
  	}
  	result = (LWLockId) (LWLockCounter[0]++);
  	SpinLockRelease(ShmemLock);
+ 
+ #ifdef LWLOCK_STAT
+ 	LWLockKinds[result] = kind;
+ #endif
+ 
  	return result;
  }
  
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 294,300 ****
  	/* Set up local count state first time through in a given process */
  	if (counts_for_pid != MyProcPid)
  	{
! 		int	   *LWLockCounter = (int *) ((char *) LWLockArray - 2 * sizeof(int));
  		int		numLocks = LWLockCounter[1];
  
  		sh_acquire_counts = calloc(numLocks, sizeof(int));
--- 371,377 ----
  	/* Set up local count state first time through in a given process */
  	if (counts_for_pid != MyProcPid)
  	{
! 		int	   *LWLockCounter = GetLWLockCounter();
  		int		numLocks = LWLockCounter[1];
  
  		sh_acquire_counts = calloc(numLocks, sizeof(int));
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 351,356 ****
--- 428,438 ----
  		/* Acquire mutex.  Time spent holding mutex should be short! */
  		SpinLockAcquire(&lock->mutex);
  
+ #ifdef LWLOCK_STAT
+ 		if (!retry)
+ 			lock->stat.ncall[mode]++;
+ #endif /* LWLOCK_STAT */
+ 
  		/* If retrying, allow LWLockRelease to release waiters again */
  		if (retry)
  			lock->releaseOK = true;
*************** LWLockAcquire(LWLockId lockid, LWLockMod
*** 380,385 ****
--- 462,471 ----
  		if (!mustwait)
  			break;				/* got the lock */
  
+ #ifdef LWLOCK_STAT
+ 		lock->stat.nwait[mode]++;
+ #endif /* LWLOCK_STAT */
+ 
  		/*
  		 * Add myself to wait queue.
  		 *
*************** LWLockHeldByMe(LWLockId lockid)
*** 665,667 ****
--- 751,800 ----
  	}
  	return false;
  }
+ 
+ #ifdef LWLOCK_STAT
+ 
+ const char *
+ LWLockName(LWLockKind lwkind)
+ {
+ 	if (0 <= lwkind && lwkind < NUM_LW_KINDS)
+ 		return LWLockIdNames[lwkind];
+ 	return "Unknown";
+ }
+ 
+ LWLockKind
+ LWLockGetStat(LWLockId lockid, LWLockStat *dst)
+ {
+ 	int	   *LWLockCounter = GetLWLockCounter();
+ 	int		nAvailableLWLocks = LWLockCounter[0];
+ 	volatile LWLock *lock;
+ 		
+ 	if (lockid >= nAvailableLWLocks)
+ 		return LW_KIND_INVALID;
+ 
+ 	lock = &LWLockArray[lockid].lock;
+ 	SpinLockAcquire(&lock->mutex);
+ 	memcpy(dst, (const void*) &lock->stat, sizeof(LWLockStat));
+ 	SpinLockRelease(&lock->mutex);
+ 	return LWLockKinds[lockid];
+ }
+ 
+ void
+ LWLockResetStat(void)
+ {
+ 	int	   *LWLockCounter = GetLWLockCounter();
+ 	int		nAvailableLWLocks = LWLockCounter[0];
+ 	int		i;
+ 
+ 	for (i = 0; i < nAvailableLWLocks; i++)
+ 	{
+ 		volatile LWLock *lock = &LWLockArray[i].lock;
+ 		volatile LWLockStat *lwstat = &lock->stat;
+ 
+ 		SpinLockAcquire(&lock->mutex);
+ 		memset((void*) lwstat, 0, sizeof(LWLockStat));
+ 		SpinLockRelease(&lock->mutex);
+ 	}
+ }
+ 
+ #endif /* LWLOCK_STAT */
diff -cpr pgsql-orig/src/backend/utils/adt/pgstatfuncs.c pgsql/src/backend/utils/adt/pgstatfuncs.c
*** pgsql-orig/src/backend/utils/adt/pgstatfuncs.c	Mon Jul 31 09:22:05 2006
--- pgsql/src/backend/utils/adt/pgstatfuncs.c	Mon Jul 31 09:37:25 2006
***************
*** 14,19 ****
--- 14,21 ----
   */
  #include "postgres.h"
  
+ #include "access/heapam.h"
+ #include "catalog/pg_type.h"
  #include "funcapi.h"
  #include "miscadmin.h"
  #include "pgstat.h"
*************** extern Datum pg_stat_get_db_xact_rollbac
*** 53,58 ****
--- 55,63 ----
  extern Datum pg_stat_get_db_blocks_fetched(PG_FUNCTION_ARGS);
  extern Datum pg_stat_get_db_blocks_hit(PG_FUNCTION_ARGS);
  
+ extern Datum pg_stat_get_lwlock_name(PG_FUNCTION_ARGS);
+ extern Datum pg_stat_get_lwlocks(PG_FUNCTION_ARGS);
+ extern Datum pg_stat_reset_lwlocks(PG_FUNCTION_ARGS);
  
  Datum
  pg_stat_get_numscans(PG_FUNCTION_ARGS)
*************** pg_stat_get_db_blocks_hit(PG_FUNCTION_AR
*** 600,602 ****
--- 605,736 ----
  
  	PG_RETURN_INT64(result);
  }
+ 
+ /*
+  * LWLOCK_STAT staff
+  */
+ 
+ #ifdef LWLOCK_STAT
+ 
+ static HeapTuple
+ formLWLockStatTuple(FuncCallContext *funcctx)
+ {
+ 	int32	lockid = funcctx->call_cntr;
+ 	LWLockStat	stat;
+ 	char		nulls[7];
+ 	Datum		values[7];
+ 	LWLockKind	kind;
+ 
+ 	kind = LWLockGetStat(lockid, &stat);
+ 	if (kind == LW_KIND_INVALID)
+ 		return NULL;
+ 	
+ 	MemSet(values, 0, sizeof(values));
+ 	MemSet(nulls, ' ', sizeof(nulls));
+ 
+ 	values[0] = Int32GetDatum(lockid);
+ 	values[1] = Int32GetDatum((int32) kind);
+ 	values[2] = Int64GetDatum(stat.ncall[LW_SHARED]);
+ 	values[3] = Int64GetDatum(stat.nwait[LW_SHARED]);
+ 	values[4] = Int64GetDatum(stat.ncall[LW_EXCLUSIVE]);
+ 	values[5] = Int64GetDatum(stat.nwait[LW_EXCLUSIVE]);
+ 
+ 	return heap_formtuple(funcctx->tuple_desc, values, nulls);
+ }
+ 
+ /* pg_stat_get_lwlock_name(integer) : text */
+ Datum
+ pg_stat_get_lwlock_name(PG_FUNCTION_ARGS)
+ {
+ 	LWLockKind	lwkind;
+ 	const char *name;
+ 	int			name_len;
+ 	int			ret_len;
+ 	text	   *result;
+ 
+ 	lwkind = (LWLockKind) PG_GETARG_INT32(0);
+ 
+ 	name = LWLockName(lwkind);
+ 	name_len = strlen(name);
+ 	ret_len = VARHDRSZ + name_len;
+ 	result = palloc(ret_len);
+ 	VARATT_SIZEP(result) = ret_len;
+ 	memcpy(VARDATA(result), name, name_len);
+ 
+ 	PG_RETURN_TEXT_P(result);
+ }
+ 
+ /* pg_stat_get_lwlock(void) : set of lwlock_stat */
+ Datum
+ pg_stat_get_lwlocks(PG_FUNCTION_ARGS)
+ {
+ 	FuncCallContext	   *funcctx;
+ 	HeapTuple			tuple;
+ 
+ 	if (SRF_IS_FIRSTCALL())
+ 	{
+ 		MemoryContext	oldcontext;
+ 		TupleDesc		desc;
+ 
+ 		funcctx = SRF_FIRSTCALL_INIT();
+ 		oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
+ 		desc = CreateTemplateTupleDesc(6, false);
+ 
+ 		TupleDescInitEntry(desc, (AttrNumber)1, "id", INT4OID, -1, 0);
+ 		TupleDescInitEntry(desc, (AttrNumber)2, "kind", INT4OID, -1, 0);
+ 		TupleDescInitEntry(desc, (AttrNumber)3, "sh_call", INT8OID, -1, 0);
+ 		TupleDescInitEntry(desc, (AttrNumber)4, "sh_wait", INT8OID, -1, 0);
+ 		TupleDescInitEntry(desc, (AttrNumber)5, "ex_call", INT8OID, -1, 0);
+ 		TupleDescInitEntry(desc, (AttrNumber)6, "ex_wait", INT8OID, -1, 0);
+ 
+ 		funcctx->tuple_desc = BlessTupleDesc(desc);
+ 		funcctx->max_calls = (uint32)NumLWLocks();
+ 
+ 		MemoryContextSwitchTo(oldcontext);
+ 	}
+ 
+ 	funcctx = SRF_PERCALL_SETUP();
+ 
+ 	if (funcctx->call_cntr < funcctx->max_calls
+ 		&& (tuple = formLWLockStatTuple(funcctx)) != NULL)
+ 	{
+ 		SRF_RETURN_NEXT(funcctx, HeapTupleGetDatum(tuple));
+ 	}
+ 	else
+ 	{
+ 		SRF_RETURN_DONE(funcctx);
+ 	}
+ }
+ 
+ /* pg_stat_reset_lwlocks(void) : boolean */
+ Datum
+ pg_stat_reset_lwlocks(PG_FUNCTION_ARGS)
+ {
+ 	LWLockResetStat();
+ 	PG_RETURN_BOOL(true);
+ }
+ 
+ #else  /* LWLOCK_STAT */
+ 
+ Datum
+ pg_stat_get_lwlock_name(PG_FUNCTION_ARGS)
+ {
+ 	elog(ERROR, "LWLOCK_STAT is disabled");
+ 	PG_RETURN_NULL();
+ }
+ 
+ Datum
+ pg_stat_get_lwlocks(PG_FUNCTION_ARGS)
+ {
+ 	elog(ERROR, "LWLOCK_STAT is disabled");
+ 	PG_RETURN_NULL();
+ }
+ 
+ Datum
+ pg_stat_reset_lwlocks(PG_FUNCTION_ARGS)
+ {
+ 	elog(ERROR, "LWLOCK_STAT is disabled");
+ 	PG_RETURN_NULL();
+ }
+ 
+ #endif /* LWLOCK_STAT */
diff -cpr pgsql-orig/src/include/catalog/pg_proc.h pgsql/src/include/catalog/pg_proc.h
*** pgsql-orig/src/include/catalog/pg_proc.h	Mon Jul 31 09:22:08 2006
--- pgsql/src/include/catalog/pg_proc.h	Mon Jul 31 09:37:25 2006
*************** DESCR("Statistics: Blocks fetched for da
*** 2917,2922 ****
--- 2917,2929 ----
  DATA(insert OID = 1945 (  pg_stat_get_db_blocks_hit		PGNSP PGUID 12 f f t f s 1 20 "26" _null_ _null_ _null_ pg_stat_get_db_blocks_hit - _null_ ));
  DESCR("Statistics: Blocks found in cache for database");
  
+ DATA(insert OID = 2830 (  pg_stat_get_lwlock_name	PGNSP PGUID 12 f f t f s 1 25 "23" _null_ _null_ _null_ pg_stat_get_lwlock_name - _null_ ));
+ DESCR("Statistics: Name of lwlock");
+ DATA(insert OID = 2831 (  pg_stat_get_lwlocks	PGNSP PGUID 12 f f t t v 0 2249 "" _null_ _null_ _null_ pg_stat_get_lwlocks - _null_ ));
+ DESCR("Statistics: Set of lwlock_stat tuple");
+ DATA(insert OID = 2832 (  pg_stat_reset_lwlocks	PGNSP PGUID 12 f f f f v 0 16  "" _null_ _null_ _null_ 	pg_stat_reset_lwlocks - _null_ ));
+ DESCR("Statistics: Reset collected statistics");
+ 
  DATA(insert OID = 1946 (  encode						PGNSP PGUID 12 f f t f i 2 25 "17 25" _null_ _null_ _null_	binary_encode - _null_ ));
  DESCR("Convert bytea value into some ascii-only text string");
  DATA(insert OID = 1947 (  decode						PGNSP PGUID 12 f f t f i 2 17 "25 25" _null_ _null_ _null_	binary_decode - _null_ ));
diff -cpr pgsql-orig/src/include/pg_config_manual.h pgsql/src/include/pg_config_manual.h
*** pgsql-orig/src/include/pg_config_manual.h	Mon Jul 31 09:22:09 2006
--- pgsql/src/include/pg_config_manual.h	Mon Jul 31 09:37:25 2006
***************
*** 245,250 ****
--- 245,255 ----
  /* #define WAL_DEBUG */
  
  /*
+  * Enable statistics collector for LWLock operations
+  */
+ /*#define LWLOCK_STAT*/
+ 
+ /*
   * Enable tracing of resource consumption during sort operations;
   * see also the trace_sort GUC var.  For 8.1 this is enabled by default.
   */
diff -cpr pgsql-orig/src/include/storage/lwlock.h pgsql/src/include/storage/lwlock.h
*** pgsql-orig/src/include/storage/lwlock.h	Mon Jul 31 09:22:09 2006
--- pgsql/src/include/storage/lwlock.h	Mon Jul 31 09:37:25 2006
*************** typedef enum LWLockMode
*** 76,87 ****
  	LW_SHARED
  } LWLockMode;
  
  
  #ifdef LOCK_DEBUG
  extern bool Trace_lwlocks;
  #endif
  
! extern LWLockId LWLockAssign(void);
  extern void LWLockAcquire(LWLockId lockid, LWLockMode mode);
  extern bool LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode);
  extern void LWLockRelease(LWLockId lockid);
--- 76,102 ----
  	LW_SHARED
  } LWLockMode;
  
+ typedef enum LWLockKind
+ {
+ 	LW_KIND_INVALID = -1,
+ 	LW_KIND_FIRSTFIXED = 0,
+ 	LW_KIND_LASTFIXED = FirstBufMappingLock - 1,
+ 	LW_KIND_BUFFER,			/* BufferMappingLock(s) */
+ 	LW_KIND_LOCK,			/* LockMgrLock(s) */
+ 	LW_KIND_BUFFERIO,		/* BufferDesc.io_in_progress_lock */
+ 	LW_KIND_BUFFERCONTENT,	/* BufferDesc.content_lock */
+ 	LW_KIND_CLOG,			/* ClogCtl */
+ 	LW_KIND_SUBTRANS,		/* SubTransCtl */
+ 	LW_KIND_MULTIXACTOFFSET,/* MultiXactOffsetCtl */
+ 	LW_KIND_MULTIXACTMEMBER,/* MultiXactMemberCtl */
+ 	NUM_LW_KINDS,			/* last kind */
+ } LWLockKind;
  
  #ifdef LOCK_DEBUG
  extern bool Trace_lwlocks;
  #endif
  
! extern LWLockId LWLockAssign(LWLockKind kind);
  extern void LWLockAcquire(LWLockId lockid, LWLockMode mode);
  extern bool LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode);
  extern void LWLockRelease(LWLockId lockid);
*************** extern bool LWLockHeldByMe(LWLockId lock
*** 91,95 ****
--- 106,124 ----
  extern int	NumLWLocks(void);
  extern Size LWLockShmemSize(void);
  extern void CreateLWLocks(void);
+ 
+ #ifdef LWLOCK_STAT
+ 
+ typedef struct LWLockStat
+ {
+ 	int64	ncall[2];	/* # of calls (0:Exclusive, 1:Shared) */
+ 	int64	nwait[2];	/* # of waits (0:Exclusive, 1:Shared) */
+ } LWLockStat;
+ 
+ extern const char *LWLockName(LWLockKind lwkind);
+ extern LWLockKind LWLockGetStat(LWLockId lockid, LWLockStat *dst);
+ extern void LWLockResetStat(void);
+ 
+ #endif /* LWLOCK_STAT */
  
  #endif   /* LWLOCK_H */
#17Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: ITAGAKI Takahiro (#15)
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

Hi,All.

Since the cause was found and the provisional patch was made
and solved about the CSStorm problem in previous mails, it reports.

Subject: [HACKERS] poor performance with Context Switch Storm at TPC-W.
Date: Tue, 11 Jul 2006 20:09:24 +0900
From: Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp>

poor performance with Context Switch Storm occurred
with the following composition.

Premise knowledge :
PostgreSQL8.0 to SAVEPOINT was supported.
All the transactions have one or more subtransactions in an inside.
When judging VISIBILITY of a tupple, XID which inserted the tupple
needs to judge a top transaction or a subtransaction.
(if it's XMIN committed)
In order to judge, it is necessary to access SubTrans.
(data structure which manages the parents of transaction ID)
SubTrans is accessed via a LRU buffer.

Occurrence conditions of this phenomenon :
The occurrence conditions of this phenomenon are the following.
- There is transaction which refers to the tupple in quantity frequency (typically seq scan).
- (Appropriate frequency) There is updating transaction.
- (Appropriate length) There is long live transaction.

Point of view :
(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)

(B) Accessing at every judgment of VISIBILITY of a tupple is frequent.
If many processes wait LWLock using semop, CSStorm will occur.

Result :
As opposed to (A),
I created a patch which the page of read/write IN PROGRESS does not
make an exchange candidate.
(It has "betterslot" supposing the case where all the pages are set
to IN PROGRESS.)
The patch was applied.
However, it recurred. it did not become fundamental solution.

As opposed to (B),
A patch which is changed so that it may consider that all the
transactions are top transactions was created.
(Thank you, ITAGAKI) The patch was applied. 8 hours was measured.
CSStorm problem was stopped.

Argument :
(1)Since neither SAVEPOINT nor the error trap using PL/pgSQL is done,
the subtransaction is unnecessary.
Is it better to implement the mode not using a subtransaction?

(2)It is the better if a cache can be carried out by structure
like CLOG that it seems that it is not necessary to check
a LRU buffer at every occasion.

Are there a problem and other ideas?
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#18Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Katsuhiko Okano (#17)
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

Katsuhiko Okano wrote:

Since the cause was found and the provisional patch was made
and solved about the CSStorm problem in previous mails, it reports.

(snip)

(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)

The following is the patch.

diff -cpr postgresql-8.1.4-orig/src/backend/access/transam/slru.c postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c

*** postgresql-8.1.4-orig/src/backend/access/transam/slru.c 2006-01-21 13:38:27.000000000 +0900

--- postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c	2006-07-25 18:02:49.000000000 +0900

*************** SlruSelectLRUPage(SlruCtl ctl, int pagen

*** 703,710 ****

for (;;)

{

int slotno;

! int bestslot = 0;

unsigned int bestcount = 0;

/* See if page already has a buffer assigned */

for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)

--- 703,712 ----

for (;;)

{

int slotno;

! int bestslot = -1;

! int betterslot = -1;

unsigned int bestcount = 0;

+ unsigned int bettercount = 0;

/* See if page already has a buffer assigned */

for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)

*************** SlruSelectLRUPage(SlruCtl ctl, int pagen

*** 720,732 ****

*/

for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)

{

! if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)

! return slotno;

! if (shared->page_lru_count[slotno] > bestcount &&

! shared->page_number[slotno] != shared->latest_page_number)

! {

! bestslot = slotno;

! bestcount = shared->page_lru_count[slotno];

}

}

--- 722,746 ----

*/

for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)

{

! switch (shared->page_status[slotno])

! {

! case SLRU_PAGE_EMPTY:

! return slotno;

! case SLRU_PAGE_READ_IN_PROGRESS:

! case SLRU_PAGE_WRITE_IN_PROGRESS:

! if (shared->page_lru_count[slotno] > bettercount &&

! shared->page_number[slotno] != shared->latest_page_number)

! {

! betterslot = slotno;

! bettercount = shared->page_lru_count[slotno];

! }

! default: /* SLRU_PAGE_CLEAN,SLRU_PAGE_DIRTY */

! if (shared->page_lru_count[slotno] > bestcount &&

! shared->page_number[slotno] != shared->latest_page_number)

! {

! bestslot = slotno;

! bestcount = shared->page_lru_count[slotno];

! }

}

}

*************** SlruSelectLRUPage(SlruCtl ctl, int pagen

*** 736,741 ****

--- 750,758 ----

if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)

return bestslot;

+ if (bestslot == -1)

+ bestslot = betterslot;

+

/*

* We need to do I/O. Normal case is that we have to write it out,

* but it's possible in the worst case to have selected a read-busy

Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#19Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#16)
Re: LWLock statistics collector

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Here is a patch to collect statistics of LWLocks.

This seems fairly invasive, as well as confused about whether it's an
#ifdef'able thing or not. You can't have system views and pg_proc
entries conditional on a compile-time #ifdef, so in a default build
we would have a lot of nonfunctional cruft exposed to users.

Do we really need this compared to the simplistic dump-to-stderr
counting support that's in there now? That stuff doesn't leave any
cruft behind when not enabled, and it has at least one significant
advantage over your proposal, which is that it's possible to get
per-process statistics when needed.

If I thought that average users would have a need for LWLock statistics,
I'd be more sympathetic to expending effort on a nice frontend for
viewing the statistics, but this is and always will be just a concern
for hardcore hackers ...

regards, tom lane

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Katsuhiko Okano (#18)
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:

(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)

The following is the patch.

I'm confused ... is this patch being proposed for inclusion? I
understood your previous message to say that it didn't help much.

The patch is buggy as posted, because it will try to do this:
if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)
return bestslot;
while bestslot could still be -1.

I see your concern about multiple processes selecting the same buffer
for replacement, but what will actually happen is that all but the first
will block for the first one's I/O to complete using SimpleLruWaitIO,
and then all of them will repeat the outer loop and recheck what to do.
If they were all trying to swap in the same page this is actually
optimal. If they were trying to swap in different pages then the losing
processes will again try to initiate I/O on a different buffer. (They
will pick a different buffer, because the guy who got the buffer will
have done SlruRecentlyUsed on it before releasing the control lock ---
so I don't believe the worry that we get a buffer thrash scenario here.
Look at the callers of SlruSelectLRUPage not just the function itself.)

It's possible that letting different processes initiate I/O on different
buffers would be a win, but it might just result in excess writes,
depending on the relative probability of requests for the same page
vs. requests for different pages.

Also, I think the patch as posted would still cause processes to gang up
on the same buffer, it would just be a different one from before. The
right thing would be to locate the overall-oldest buffer and return it
if clean; otherwise to initiate I/O on the oldest buffer that isn't
either clean or write-busy, if there is one; otherwise just do WaitIO
on the oldest buffer. This would ensure that different processes try
to push different buffers to disk. They'd still go back and make their
decisions from the top after doing their I/O. Whether this is a win or
not is not clear to me, but at least it would attack the guessed-at
problem correctly.

regards, tom lane

#21Kevin Brown
kevin@sysexperts.com
In reply to: Tom Lane (#19)
Re: [PATCHES] LWLock statistics collector

Tom Lane wrote:

If I thought that average users would have a need for LWLock statistics,
I'd be more sympathetic to expending effort on a nice frontend for
viewing the statistics, but this is and always will be just a concern
for hardcore hackers ...

That may be true of the output, but that's not a very strong argument
against making it much easier to gather and display the LWLock
statistics. I can easily imagine the patch be a useful performance
troubleshooting tool in a high load environment. Depends on how
easy/intrusive it is to enable/use the stderr method on a production
system, though, as well as how much of a performance impact the
measurements have on overall operation...

--
Kevin Brown kevin@sysexperts.com

#22ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Tom Lane (#19)
Re: LWLock statistics collector

Tom Lane <tgl@sss.pgh.pa.us> wrote:

This seems fairly invasive, as well as confused about whether it's an
#ifdef'able thing or not. You can't have system views and pg_proc
entries conditional on a compile-time #ifdef, so in a default build
we would have a lot of nonfunctional cruft exposed to users.

Is it acceptable if pg_stat_lwlocks view and other functions are not
installed and invisible when LWLOCK_STAT is not defined? We don't have
such a feature now, but we can.

Do we really need this compared to the simplistic dump-to-stderr
counting support that's in there now? That stuff doesn't leave any
cruft behind when not enabled, and it has at least one significant
advantage over your proposal, which is that it's possible to get
per-process statistics when needed.

There is one advantage in my proposal. We can watch the time-varying
activities of LWLocks easily.
Is per-process statistics actually needed? If we use connection
pooling, we lose the advantage. I think connection pooling is commonly
used in such a high-load case where we encounter LWLock contentions.

If I thought that average users would have a need for LWLock statistics,
I'd be more sympathetic to expending effort on a nice frontend for
viewing the statistics, but this is and always will be just a concern
for hardcore hackers ...

I assume the ones who analyze the output of the patch are hardcore hacker,
but the ones who actually use it and collect information are average users.
The dump-to-stderr method is hard to use because it will increase syslogs
and requires re-parsing efforts.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#23Katsuhiko Okano
okano.katsuhiko@oss.ntt.co.jp
In reply to: Tom Lane (#20)
Re: LWLock statistics collector (was: CSStorm occurred again by postgreSQL8.2)

Tom Lane wrote:

I'm confused ... is this patch being proposed for inclusion? I
understood your previous message to say that it didn't help much.

This is only the patch for carving where there is any problem.

The patch is buggy as posted, because it will try to do this:
if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)
return bestslot;
while bestslot could still be -1.

A check is required. understood.

(They
will pick a different buffer, because the guy who got the buffer will
have done SlruRecentlyUsed on it before releasing the control lock ---
so I don't believe the worry that we get a buffer thrash scenario here.
Look at the callers of SlruSelectLRUPage not just the function itself.)

umm,I read a code again.

otherwise to initiate I/O on the oldest buffer that isn't
either clean or write-busy, if there is one;

Understanding is a difficult point although it is important.
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#22)
Re: LWLock statistics collector

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

Tom Lane <tgl@sss.pgh.pa.us> wrote:

This seems fairly invasive, as well as confused about whether it's an
#ifdef'able thing or not. You can't have system views and pg_proc
entries conditional on a compile-time #ifdef, so in a default build
we would have a lot of nonfunctional cruft exposed to users.

Is it acceptable if pg_stat_lwlocks view and other functions are not
installed and invisible when LWLOCK_STAT is not defined? We don't have
such a feature now, but we can.

Then you'd need an initdb to go between having stats and not, which
carries its own downsides. If I thought there were a wide market for
this then I'd say sure, let's just have it there ... but I don't.

I think the actual wave of the future for analyzing behavior at the
LWLock level is going to be DTrace. It seems way more flexible than
an aggregate-statistics view can be.

regards, tom lane

#25Robert Lor
Robert.Lor@Sun.COM
In reply to: Tom Lane (#24)
Re: LWLock statistics collector

Tom Lane wrote:

I think the actual wave of the future for analyzing behavior at the
LWLock level is going to be DTrace. It seems way more flexible than
an aggregate-statistics view can be.

CVS head now has the following LWLock probes, and more can easily be
added. These probes can be enabled using the sample DTrace scripts at
http://pgfoundry.org/projects/dtrace/

lwlock-acquire
lwlock-release
lwlock-startwait
lwlock-endwait
lwlock-condacquire
lwlock-condacquire-fail

If anyone wants to have access to a Solaris machine to play with DTrace,
let me know.

Regards,
-Robert

#26ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Robert Lor (#25)
Re: LWLock statistics collector

Robert Lor <Robert.Lor@Sun.COM> wrote:

CVS head now has the following LWLock probes, and more can easily be
added. These probes can be enabled using the sample DTrace scripts at
http://pgfoundry.org/projects/dtrace/

lwlock-acquire
lwlock-release
lwlock-startwait
lwlock-endwait
lwlock-condacquire
lwlock-condacquire-fail

If anyone wants to have access to a Solaris machine to play with DTrace,
let me know.

I assure you that DTrace is a powerful tool, but as for LWLock statistics,
can we gather them well with it? There is a "color coding" problem in the
persent dump-to-stderr and DTrace methods. I assume we want to gather the
statistics per resource (represented by LWLockKind in my patch), not per
LWLockId.

Even if we use DTrace, do we need some supports for coloring of lwlocks?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#26)
Re: LWLock statistics collector

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

I assume we want to gather the statistics per resource (represented by
LWLockKind in my patch), not per LWLockId.

Why? I haven't yet seen any problem where that looked useful.
The named LWLocks are certainly sui generis, and as for things like
per-buffer locks I haven't seen any need for aggregate statistics ---
I'd rather know about "hot spots" if there are any buffers that are
not like the rest.

regards, tom lane

#28ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Tom Lane (#8)
Re: CSStorm occurred again by postgreSQL8.2

Tom Lane <tgl@sss.pgh.pa.us> wrote:

It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.

Can you provide a reproducible test case for this?

This is the reproducible test case:
- Occurs on both 8.1.4 and HEAD.
- On smp machine. I used dual opterons.
CSStrom becomes worse on dual xeon with hyper-threading.
- Tuning parameters are default. Whole data are cached in shared buffers.
(shared_buffers=32MB, data of pgbench (scale=1) are less than 15MB.)
- Using custom pgbench. One client doing UPDATE with indexscan
and multiple clients doing SELECT with seqscan/indexscan.

$ pgbench -i
$ pgbench -n -c 1 -t 100000 -f cs_update.sql &
$ pgbench -n -c 50 -t 100000 -f cs_indexscan.sql &
$ pgbench -n -c 35 -t 100000 -f cs_seqscan.sql &
(The scripts are attached at end of this message.)

In above workload, context switches are 2000-10000/sec and cpu usage is
user=100%. Then, start a long open transaction on another connection.

$ psql
# begin; -- Long open transaction

After a lapse of 30-60 seconds, context switches become 50000/sec over
(120000 over on xeons) and cpu usage is user=66% / sys=21% / idle=13%.
If we increase the frequency of UPDATE, the duration becomes shorter.

This is a human-induced workload, but I can see the same condition in
TPC-W -- even though it is a benchmark. TPC-W requires full-text search
and it is implementd using "LIKE %foo%" in my implementation (DBT-1, too).
Also, it requires periodical aggregations. They might behave as long
transactions.

The cause seems to be a lock contention. The number of locks on
SubtransControlLock and SubTransBuffer are significantly increased
by comparison with BufMappingLocks.

# Before starting a long transaction.
kind | lwlock | sh_call | sh_wait | ex_call | ex_wait
------+---------------------+----------+---------+---------+---------
13 | SubtransControlLock | 28716 | 2 | 54 | 0
22 | BufMappingLock | 11637884 | 0 | 2492 | 0
27 | SubTransBuffer | 0 | 0 | 11 | 0

# After
kind | lwlock | sh_call | sh_wait | ex_call | ex_wait
------+---------------------+----------+---------+---------+---------
13 | SubtransControlLock | 4139111 | 65059 | 3926691 | 390838
22 | BufMappingLock | 32348073 | 0 | 2509 | 0
27 | SubTransBuffer | 939646 | 960341 | 1419152 | 61

The invokers of SubTrans module are two SubTransGetTopmostTransaction()
in HeapTupleSatisfiesSnapshot(). When I disabled the calls, CSStorm did
not occur. SubTransGetTopmostTransaction returns the argument without
change when we don't use SAVEPOINTs.

If we optimize for non-subtransactions, we can avoid to lock SubTrans
for check visiblities of tuples inserted by top transactions.
If we want to resolve the probmen fundamentally, we might have to
improve SubTrans using a better buffer management algorithm or so.

Do you have any idea to avoid such a problem?

-- cs_update.sql
\set naccounts 100000 * :tps
\setrandom aid 1 :naccounts
\setrandom delta -5000 5000
UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT pg_sleep(0.1);

-- cs_seqscan.sql
\set naccounts 100000 * :tps
\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid::int8 = :aid; -- cast to force seqscan

-- cs_indexscan.sql
\set naccounts 100000 * :tps
\setrandom aid 1 :naccounts
SELECT abalance FROM accounts WHERE aid = :aid;

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#28)
Re: CSStorm occurred again by postgreSQL8.2

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

The invokers of SubTrans module are two SubTransGetTopmostTransaction()
in HeapTupleSatisfiesSnapshot(). When I disabled the calls, CSStorm did
not occur. SubTransGetTopmostTransaction returns the argument without
change when we don't use SAVEPOINTs.

If we optimize for non-subtransactions, we can avoid to lock SubTrans
for check visiblities of tuples inserted by top transactions.

Only for top transactions still in progress, so I doubt that would
help much.

I'm wondering about doing something similar to what
TransactionIdIsInProgress does, ie, make use of the PGPROC lists
of live subtransactions. Suppose that GetSnapshotData copies not
only top xids but live subxids into the snapshot, and adds a flag
indicating whether the subxids are complete (ie, none of the subxid
lists have overflowed). Then if the flag is set, tqual.c doesn't
need to do SubTransGetTopmostTransaction() before searching the
list.

regards, tom lane

#30Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#29)
Re: CSStorm occurred again by postgreSQL8.2

Tom Lane wrote:

I'm wondering about doing something similar to what
TransactionIdIsInProgress does, ie, make use of the PGPROC lists
of live subtransactions. Suppose that GetSnapshotData copies not
only top xids but live subxids into the snapshot, and adds a flag
indicating whether the subxids are complete (ie, none of the subxid
lists have overflowed). Then if the flag is set, tqual.c doesn't
need to do SubTransGetTopmostTransaction() before searching the
list.

Well, that sounds awfully more expensive than setting
local-to-my-database Xmins as well as global (all databases) Xmins :-)

On the other hand, ISTM as soon as one cache overflows, you have to go
check pg_subtrans which means the entire optimization buys nothing in
that case. It would be nice if the optimization degraded more
gracefully. I don't have any concrete suggestion though. The changes
proposed in the other CS-storm thread by the NTT person may help.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#31Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#30)
Re: CSStorm occurred again by postgreSQL8.2

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

I'm wondering about doing something similar to what
TransactionIdIsInProgress does, ie, make use of the PGPROC lists
of live subtransactions.

Well, that sounds awfully more expensive than setting
local-to-my-database Xmins as well as global (all databases) Xmins :-)

Only when you've got a lot of subtransactions. The point of this
discussion is to optimize for the few-or-none case. In any case,
the problem with the local/global bit was that you'd be expending
foreground-process cycles without any foreground-process return.
Being able to use a snapshot without consulting pg_subtrans will
certainly buy back some cycles...

regards, tom lane

#32Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#31)
Re: CSStorm occurred again by postgreSQL8.2

Tom Lane wrote:

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

I'm wondering about doing something similar to what
TransactionIdIsInProgress does, ie, make use of the PGPROC lists
of live subtransactions.

Well, that sounds awfully more expensive than setting
local-to-my-database Xmins as well as global (all databases) Xmins :-)

Only when you've got a lot of subtransactions. The point of this
discussion is to optimize for the few-or-none case. In any case,
the problem with the local/global bit was that you'd be expending
foreground-process cycles without any foreground-process return.
Being able to use a snapshot without consulting pg_subtrans will
certainly buy back some cycles...

I can buy that. Some idle thoughts:

I was thinking at what time was the most appropiate to insert or remove
an Xid from the cache. We can do without any excl-locking because 1) we
already assume the storing of an Xid to be atomic, and 2) no one can be
interested in querying for an Xid before the originating transaction has
had the chance to write a tuple with that Xid anyway.

OTOH I think we only need to store live Xids and those finished that
wrote a WAL record; we can drop subaborted and subcommitted if they
didn't.

On the third hand, are we going to sh-acquire the ProcArray lock while a
GetSnapshotData copies all subxact Xids of all running transactions?
ProcArrayLock will become more of a contention point than it already is.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#32)
Re: CSStorm occurred again by postgreSQL8.2

Alvaro Herrera <alvherre@commandprompt.com> writes:

I was thinking at what time was the most appropiate to insert or remove
an Xid from the cache. We can do without any excl-locking because 1) we
already assume the storing of an Xid to be atomic, and 2) no one can be
interested in querying for an Xid before the originating transaction has
had the chance to write a tuple with that Xid anyway.

Actually ... that fails if GetSnapshotData is going to copy subtrans
XIDs. So this area needs more thought.

On the third hand, are we going to sh-acquire the ProcArray lock while a
GetSnapshotData copies all subxact Xids of all running transactions?
ProcArrayLock will become more of a contention point than it already is.

Yeah, but sharelock is better than exclusive lock ...

regards, tom lane

#34ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: ITAGAKI Takahiro (#28)
Re: CSStorm occurred again by postgreSQL8.2

This is an additional information.

I wrote:

If we want to resolve the probmen fundamentally, we might have to
improve SubTrans using a better buffer management algorithm or so.

The above is maybe wrong. I checked each lwlock of pg_subtrans's buffers.
All lwlocks are uniformly acquired and I could not see any differences
among buffers. So the cause seems not to be a buffer management algorithm,
but just a lack of SLRU buffer pages.

NUM_SUBTRANS_BUFFERS is defined as 32 in HEAD. If we increase it,
we can avoid the old transaction problem for a certain time.
However, it doesn't help much on high-load -- for example, on a workload
with 2000 tps, we will use up 1000 pg_subtrans pages in 15 minites.
I suppose it is not enough for online and batch/maintenance mix.

Also, the simple scanning way in SLRU will likely cause another performance
issue when we highly increase the number of buffers. A sequential scanning
is used in SLRU, so it will not work well against many buffers.

I hope some cares in upper layer, snapshot, hitbits or something,
being discussed in the recent thread.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

#35Bruce Momjian
bruce@momjian.us
In reply to: ITAGAKI Takahiro (#34)
Re: CSStorm occurred again by postgreSQL8.2

Is there anything to do for 8.2 here?

---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:

This is an additional information.

I wrote:

If we want to resolve the probmen fundamentally, we might have to
improve SubTrans using a better buffer management algorithm or so.

The above is maybe wrong. I checked each lwlock of pg_subtrans's buffers.
All lwlocks are uniformly acquired and I could not see any differences
among buffers. So the cause seems not to be a buffer management algorithm,
but just a lack of SLRU buffer pages.

NUM_SUBTRANS_BUFFERS is defined as 32 in HEAD. If we increase it,
we can avoid the old transaction problem for a certain time.
However, it doesn't help much on high-load -- for example, on a workload
with 2000 tps, we will use up 1000 pg_subtrans pages in 15 minites.
I suppose it is not enough for online and batch/maintenance mix.

Also, the simple scanning way in SLRU will likely cause another performance
issue when we highly increase the number of buffers. A sequential scanning
is used in SLRU, so it will not work well against many buffers.

I hope some cares in upper layer, snapshot, hitbits or something,
being discussed in the recent thread.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#36ITAGAKI Takahiro
itagaki.takahiro@oss.ntt.co.jp
In reply to: Bruce Momjian (#35)
1 attachment(s)
Re: [HACKERS] CSStorm occurred again by postgreSQL8.2

Bruce Momjian <bruce@momjian.us> wrote:

Is there anything to do for 8.2 here?

I'm working on Tom's idea. It is not a feature and does not change
the on-disk-structures, so I hope it meet the 8.2 deadline...

Tom Lane <tgl@sss.pgh.pa.us> wrote:

I'm wondering about doing something similar to what
TransactionIdIsInProgress does, ie, make use of the PGPROC lists
of live subtransactions. Suppose that GetSnapshotData copies not
only top xids but live subxids into the snapshot, and adds a flag
indicating whether the subxids are complete (ie, none of the subxid
lists have overflowed). Then if the flag is set, tqual.c doesn't
need to do SubTransGetTopmostTransaction() before searching the
list.

I added a subxid array to Snapshot and running subxids are gathered from
PGPROC->subxids cache. There are two overflowed case; any of PGPROC->subxids
are overflowed or the number of total subxids exceeds pre-allocated buffers.
If overflowed, we cannot avoid to call SubTransGetTopmostTransaction.

I tested the patch and did not see any context switch storm which comes
from pg_subtrans, but there may be some bugs in the visibility checking.
It would be very nice if you could review or test the patch.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachments:

snapshot_subtrans.patchapplication/octet-stream; name=snapshot_subtrans.patchDownload
diff -cpr pgsql-orig/src/backend/storage/ipc/procarray.c pgsql/src/backend/storage/ipc/procarray.c
*** pgsql-orig/src/backend/storage/ipc/procarray.c	Mon Aug 28 09:36:08 2006
--- pgsql/src/backend/storage/ipc/procarray.c	Mon Aug 28 17:38:05 2006
*************** GetSnapshotData(Snapshot snapshot, bool 
*** 499,504 ****
--- 499,505 ----
  	TransactionId globalxmin;
  	int			index;
  	int			count = 0;
+ 	int			subcount = 0;
  
  	Assert(snapshot != NULL);
  
*************** GetSnapshotData(Snapshot snapshot, bool 
*** 599,604 ****
--- 600,642 ----
  		if (TransactionIdIsNormal(xid))
  			if (TransactionIdPrecedes(xid, globalxmin))
  				globalxmin = xid;
+ 
+ 		/* Query subtransaction ids */
+ 		if (subcount >= 0)
+ 		{
+ 			if (proc->subxids.overflowed)
+ 			{
+ 				subcount = -1;	/* overflowed */
+ 			}
+ 			else if (proc->subxids.nxids > 0)
+ 			{
+ 				int		sub;
+ 
+ 				/* XXX: or use some multiplied value? */
+ 				int		maxSubCount = arrayP->maxProcs;
+ 
+ 				if (snapshot->subxip == NULL)
+ 				{
+ 					snapshot->subxip = (TransactionId *)
+ 						malloc(maxSubCount * sizeof(TransactionId));
+ 					if (snapshot->subxip == NULL)
+ 						ereport(ERROR,
+ 								(errcode(ERRCODE_OUT_OF_MEMORY),
+ 								errmsg("out of memory")));
+ 				}
+ 				
+ 				for (sub = 0; sub < proc->subxids.nxids; sub++)
+ 				{
+ 					snapshot->subxip[subcount] = proc->subxids.xids[sub];
+ 					subcount++;
+ 					if (subcount >= maxSubCount)
+ 					{
+ 						subcount = -1;	/* overflowed */
+ 						break;
+ 					}
+ 				}
+ 			}
+ 		}
  	}
  
  	if (serializable)
*************** GetSnapshotData(Snapshot snapshot, bool 
*** 621,626 ****
--- 659,665 ----
  	snapshot->xmin = xmin;
  	snapshot->xmax = xmax;
  	snapshot->xcnt = count;
+ 	snapshot->subxcnt = subcount;
  
  	snapshot->curcid = GetCurrentCommandId();
  
diff -cpr pgsql-orig/src/backend/utils/time/tqual.c pgsql/src/backend/utils/time/tqual.c
*** pgsql-orig/src/backend/utils/time/tqual.c	Mon Aug 28 09:36:10 2006
--- pgsql/src/backend/utils/time/tqual.c	Mon Aug 28 17:55:58 2006
*************** TransactionId TransactionXmin = InvalidT
*** 72,77 ****
--- 72,78 ----
  TransactionId RecentXmin = InvalidTransactionId;
  TransactionId RecentGlobalXmin = InvalidTransactionId;
  
+ static bool XidInSnapshot(TransactionId xid, Snapshot snapshot);
  
  /*
   * HeapTupleSatisfiesItself
*************** HeapTupleSatisfiesSnapshot(HeapTupleHead
*** 954,979 ****
  	if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmin(tuple),
  									 snapshot->xmin))
  	{
- 		TransactionId parentXid;
- 
  		if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmin(tuple),
  										 snapshot->xmax))
  			return false;
  
! 		parentXid = SubTransGetTopmostTransaction(HeapTupleHeaderGetXmin(tuple));
! 
! 		if (TransactionIdFollowsOrEquals(parentXid, snapshot->xmin))
! 		{
! 			uint32		i;
! 
! 			/* no point in checking parentXid against xmax here */
! 
! 			for (i = 0; i < snapshot->xcnt; i++)
! 			{
! 				if (TransactionIdEquals(parentXid, snapshot->xip[i]))
! 					return false;
! 			}
! 		}
  	}
  
  	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
--- 955,966 ----
  	if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmin(tuple),
  									 snapshot->xmin))
  	{
  		if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmin(tuple),
  										 snapshot->xmax))
  			return false;
  
! 		if (XidInSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot))
! 			return false;
  	}
  
  	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
*************** HeapTupleSatisfiesSnapshot(HeapTupleHead
*** 1023,1048 ****
  	if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmax(tuple),
  									 snapshot->xmin))
  	{
- 		TransactionId parentXid;
- 
  		if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmax(tuple),
  										 snapshot->xmax))
  			return true;
  
! 		parentXid = SubTransGetTopmostTransaction(HeapTupleHeaderGetXmax(tuple));
! 
! 		if (TransactionIdFollowsOrEquals(parentXid, snapshot->xmin))
! 		{
! 			uint32		i;
! 
! 			/* no point in checking parentXid against xmax here */
! 
! 			for (i = 0; i < snapshot->xcnt; i++)
! 			{
! 				if (TransactionIdEquals(parentXid, snapshot->xip[i]))
! 					return true;
! 			}
! 		}
  	}
  
  /* This is to be used only for disaster recovery and requires serious analysis. */
--- 1010,1021 ----
  	if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmax(tuple),
  									 snapshot->xmin))
  	{
  		if (TransactionIdFollowsOrEquals(HeapTupleHeaderGetXmax(tuple),
  										 snapshot->xmax))
  			return true;
  
! 		if (XidInSnapshot(HeapTupleHeaderGetXmax(tuple), snapshot))
! 			return true;
  	}
  
  /* This is to be used only for disaster recovery and requires serious analysis. */
*************** Snapshot
*** 1299,1309 ****
  CopySnapshot(Snapshot snapshot)
  {
  	Snapshot	newsnap;
  
  	/* We allocate any XID array needed in the same palloc block. */
! 	newsnap = (Snapshot) palloc(sizeof(SnapshotData) +
! 								snapshot->xcnt * sizeof(TransactionId));
  	memcpy(newsnap, snapshot, sizeof(SnapshotData));
  	if (snapshot->xcnt > 0)
  	{
  		newsnap->xip = (TransactionId *) (newsnap + 1);
--- 1272,1290 ----
  CopySnapshot(Snapshot snapshot)
  {
  	Snapshot	newsnap;
+ 	Size		subxipoff;
+ 	Size		size;
+ 
+ 	size = subxipoff = sizeof(SnapshotData) +
+ 		snapshot->xcnt * sizeof(TransactionId);
+ 	if (snapshot->subxcnt > 0)
+ 		size += snapshot->subxcnt * sizeof(TransactionId);
  
  	/* We allocate any XID array needed in the same palloc block. */
! 	newsnap = (Snapshot) palloc(size);
  	memcpy(newsnap, snapshot, sizeof(SnapshotData));
+ 
+ 	/* setup XID array */
  	if (snapshot->xcnt > 0)
  	{
  		newsnap->xip = (TransactionId *) (newsnap + 1);
*************** CopySnapshot(Snapshot snapshot)
*** 1313,1318 ****
--- 1294,1309 ----
  	else
  		newsnap->xip = NULL;
  
+ 	/* setup sub XID array */
+ 	if (snapshot->subxcnt > 0)
+ 	{
+ 		newsnap->subxip = (TransactionId *) ((char *) newsnap + subxipoff);
+ 		memcpy(newsnap->subxip, snapshot->subxip,
+ 			   snapshot->subxcnt * sizeof(TransactionId));
+ 	}
+ 	else
+ 		newsnap->subxip = NULL;
+ 
  	return newsnap;
  }
  
*************** FreeXactSnapshot(void)
*** 1346,1349 ****
--- 1337,1386 ----
  	SerializableSnapshot = NULL;
  	LatestSnapshot = NULL;
  	ActiveSnapshot = NULL;		/* just for cleanliness */
+ }
+ 
+ static bool
+ XidInSnapshot(TransactionId xid, Snapshot snapshot)
+ {
+ 	TransactionId parentXid;
+ 
+ 	if (snapshot->subxcnt < 0)
+ 	{
+ 		/*
+ 		 * The snapshot does not have enough information. We will get
+ 		 * information from pg_subtrans log.
+ 		 */
+ 		parentXid = SubTransGetTopmostTransaction(xid);
+ 	}
+ 	else
+ 	{
+ 		int32		i;
+ 
+ 		for (i = 0; i < snapshot->subxcnt; i++)
+ 		{
+ 			if (TransactionIdEquals(xid, snapshot->subxip[i]))
+ 				return true;
+ 		}
+ 
+ 		/*
+ 		 * XID may not be a top transaction id. However, they will not match
+ 		 * any top XIDs. So we can merely ignore any subxids not in subxip.
+ 		 */
+ 		parentXid = xid;
+ 	}
+ 
+ 	if (TransactionIdFollowsOrEquals(parentXid, snapshot->xmin))
+ 	{
+ 		uint32		i;
+ 
+ 		/* no point in checking parentXid against xmax here */
+ 
+ 		for (i = 0; i < snapshot->xcnt; i++)
+ 		{
+ 			if (TransactionIdEquals(parentXid, snapshot->xip[i]))
+ 				return true;
+ 		}
+ 	}
+ 
+ 	return false;
  }
diff -cpr pgsql-orig/src/include/utils/tqual.h pgsql/src/include/utils/tqual.h
*** pgsql-orig/src/include/utils/tqual.h	Mon Aug 28 09:36:13 2006
--- pgsql/src/include/utils/tqual.h	Mon Aug 28 17:24:04 2006
*************** typedef struct SnapshotData
*** 39,44 ****
--- 39,46 ----
  	TransactionId xmax;			/* XID >= xmax are invisible to me */
  	uint32		xcnt;			/* # of xact ids in xip[] */
  	TransactionId *xip;			/* array of xact IDs in progress */
+ 	int32		subxcnt;		/* # of xact ids in subxip[], -1 on overflow */
+ 	TransactionId *subxip;		/* array of subxact IDs in progress */
  	/* note: all ids in xip[] satisfy xmin <= xip[i] < xmax */
  	CommandId	curcid;			/* in my xact, CID < curcid are visible */
  } SnapshotData;
#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: ITAGAKI Takahiro (#36)
Re: CSStorm occurred again by postgreSQL8.2

ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:

I added a subxid array to Snapshot and running subxids are gathered from
PGPROC->subxids cache. There are two overflowed case; any of PGPROC->subxids
are overflowed or the number of total subxids exceeds pre-allocated buffers.
If overflowed, we cannot avoid to call SubTransGetTopmostTransaction.

Applied after some editorialization (you really need to pay more
attention to keeping comments in sync with code ;-))

I cannot measure any consistent speed difference in plain pgbench
scenarios with and without the patch, so at least as a rough
approximation the extra cycles in GetSnapshotData aren't hurting.
And I confirm that the test case you posted before no longer exhibits
a context-swap storm.

This change makes it even more obvious than before that we really want
to stay out of the subxids-overflowed regime. I don't especially want
to make those arrays even bigger, but I wonder if there isn't more we
can do to use them efficiently. In particular, it seems like in the
case where RecordSubTransactionCommit detects that the subxact has not
stored its XID anywhere, we could immediately remove the XID from
the PGPROC array, just as if it had aborted. This would avoid chewing
subxid slots for cases such as exception blocks in plpgsql that are
not modifying the database, but just catching computational errors.
Comments?

regards, tom lane

#38Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#32)
Re: CSStorm occurred again by postgreSQL8.2

Alvaro Herrera <alvherre@commandprompt.com> writes:

OTOH I think we only need to store live Xids and those finished that
wrote a WAL record; we can drop subaborted and subcommitted if they
didn't.

While reviewing this thread, I see Alvaro already had the idea I just
came to...

regards, tom lane

#39Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#37)
Re: CSStorm occurred again by postgreSQL8.2

I wrote:

... it seems like in the
case where RecordSubTransactionCommit detects that the subxact has not
stored its XID anywhere, we could immediately remove the XID from
the PGPROC array, just as if it had aborted. This would avoid chewing
subxid slots for cases such as exception blocks in plpgsql that are
not modifying the database, but just catching computational errors.

(and later realized that Alvaro had had the same idea awhile back, but
I don't have his message at hand).

I looked into this a bit more; it seems like basically it should only
take addition of

else
XidCacheRemoveRunningXids(xid, 0, NULL);

to the bottom of RecordSubTransactionCommit(), plus suitable adjustment
of the comments in both routines. However, there's a problem: if we
delete a second-level subxact's XID from PGPROC, and later its upper
subtransaction aborts, XidCacheRemoveRunningXids will emit scary
warnings when it doesn't find the sub-subxact in PGPROC. This could
doubtless be fixed with sufficient jiggery-pokery --- simply removing
the debug warnings would be a brute-force answer, but I'd like to find
something a bit less brute-force. Maybe drop the sub-subxact from its
parent's list immediately, instead of carrying it forward?

Anyway, given that there's this one nonobvious gotcha, there might be
others. My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.

regards, tom lane

#40Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#39)
Re: CSStorm occurred again by postgreSQL8.2

Tom Lane <tgl@sss.pgh.pa.us> writes:

Anyway, given that there's this one nonobvious gotcha, there might be
others. My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.

I wonder if Theo's recent reported problem with 4.3M child xids changes the
calculus on this.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#41Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#40)
Re: CSStorm occurred again by postgreSQL8.2

Gregory Stark <stark@enterprisedb.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

Anyway, given that there's this one nonobvious gotcha, there might be
others. My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.

I wonder if Theo's recent reported problem with 4.3M child xids changes the
calculus on this.

Yeah, I was just looking at that. Removing useless entries from the
child-xid list would presumably help him. Considering we're not even
formally in beta yet, I'm probably being too conservative to recommend
we not touch it.

regards, tom lane

#42Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#41)
Re: CSStorm occurred again by postgreSQL8.2

I wrote:

Yeah, I was just looking at that. Removing useless entries from the
child-xid list would presumably help him. Considering we're not even
formally in beta yet, I'm probably being too conservative to recommend
we not touch it.

Actually ... wait a minute. We do not assign an XID to a subtransaction
at all unless it writes a tuple to disk (see GetCurrentTransactionId
and its callers). So this whole "optimization" idea is redundant.

I see a bug though, which is that RecordSubTransactionAbort() calls
GetCurrentTransactionId() before having verified that it needs to do
anything. This means that we'll generate and then discard an XID
uselessly in a failed subxact that didn't touch disk. Worth fixing,
but it doesn't look like this is Theo's problem.

Unless I'm missing something, Theo's problem must involve having done
tuple updates in 4.6M different subtransactions.

regards, tom lane

#43Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#42)
Re: CSStorm occurred again by postgreSQL8.2

I wrote:

I see a bug though, which is that RecordSubTransactionAbort() calls
GetCurrentTransactionId() before having verified that it needs to do
anything. This means that we'll generate and then discard an XID
uselessly in a failed subxact that didn't touch disk.

Well, it would be a bug except that RecordSubTransactionAbort isn't
called unless the current subxact has an XID. Perhaps a comment would
be appropriate but there's nothing to fix here.

I think Theo's problem is probably somewhere else, too --- apparently
it's not so much that TransactionIdIsCurrentTransactionId takes a long
time as that something is calling it lots of times with no check for
interrupt.

regards, tom lane

#44Alvaro Herrera
alvherre@commandprompt.com
In reply to: Tom Lane (#43)
Re: CSStorm occurred again by postgreSQL8.2

Tom Lane wrote:

I wrote:

I see a bug though, which is that RecordSubTransactionAbort() calls
GetCurrentTransactionId() before having verified that it needs to do
anything. This means that we'll generate and then discard an XID
uselessly in a failed subxact that didn't touch disk.

Well, it would be a bug except that RecordSubTransactionAbort isn't
called unless the current subxact has an XID. Perhaps a comment would
be appropriate but there's nothing to fix here.

I think Theo's problem is probably somewhere else, too --- apparently
it's not so much that TransactionIdIsCurrentTransactionId takes a long
time as that something is calling it lots of times with no check for
interrupt.

Could it be something like heap_lock_tuple? It calls MultiXactIdWait,
which calls GetMultXactIdMembers and TransactionIdIsCurrentTransactionId
on each member. (heap_update and heap_delete do the same thing). I
must admit I didn't read Theo's description on his scenario though.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

#45Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#44)
Re: CSStorm occurred again by postgreSQL8.2

Alvaro Herrera <alvherre@commandprompt.com> writes:

Tom Lane wrote:

I think Theo's problem is probably somewhere else, too --- apparently
it's not so much that TransactionIdIsCurrentTransactionId takes a long
time as that something is calling it lots of times with no check for
interrupt.

Could it be something like heap_lock_tuple? It calls MultiXactIdWait,
which calls GetMultXactIdMembers and TransactionIdIsCurrentTransactionId
on each member. (heap_update and heap_delete do the same thing). I
must admit I didn't read Theo's description on his scenario though.

He shows HeapTupleSatisfiesSnapshot as the next thing down the call
stack, so those scenarios don't seem quite right. I'm wondering about a
CHECK_FOR_INTERRUPTS-free loop in either plperl or trigger handling,
myself.

Anyway, I was thinking some more about Theo's original suggestion that
the linked-list representation of childXids was too inefficient.  I'm
disinclined to use a hash as he suggests, but it strikes me that we
could very easily change the list into a dynamically extended array
--- and because the entries are surely added in increasing XID order,
such an array could be binary-searched.  This wouldn't be a win for
very small numbers of child XIDs, but for large numbers it would.

OTOH, there are probably enough other inefficiencies in handling large
numbers of subxact XIDs that speeding up TransactionIdIsCurrentTransactionId
might be a useless exercise. It would be good to profile a test case
before spending much effort here.

regards, tom lane

#46Gregory Stark
stark@enterprisedb.com
In reply to: Tom Lane (#45)
Re: CSStorm occurred again by postgreSQL8.2

Tom Lane <tgl@sss.pgh.pa.us> writes:

--- and because the entries are surely added in increasing XID order,
such an array could be binary-searched.  

If they're only added if they write to disk then isn't it possible to add them
out of order? Start a child transaction, start a child of that one and write
to disk, then exit the grandchild and write to disk in the first child? I'm
just going on your description, I'm not familiar with this part of the code at
all.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

#47Tom Lane
tgl@sss.pgh.pa.us
In reply to: Gregory Stark (#46)
Re: CSStorm occurred again by postgreSQL8.2

Gregory Stark <stark@enterprisedb.com> writes:

Tom Lane <tgl@sss.pgh.pa.us> writes:

--- and because the entries are surely added in increasing XID order,
such an array could be binary-searched.  

If they're only added if they write to disk then isn't it possible to add them
out of order? Start a child transaction, start a child of that one and write
to disk, then exit the grandchild and write to disk in the first
child?

No, because we enforce child XID > parent XID. In the case above, the
child xact would be given an XID when the grandchild needs one --- see
recursion in AssignSubTransactionId(). The actually slightly shaky
assumption above is that children of the same parent xact must subcommit
in numerical order ... but as long as we have strict nesting of subxacts
I think this must be so.

regards, tom lane

#48Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#39)
Re: CSStorm occurred again by postgreSQL8.2

On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote:

Anyway, given that there's this one nonobvious gotcha, there might be
others. My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.

Well, its still 8.3 just...

As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
and Scalability", XidCacheRemoveRunningXids() is now the only holder of
an X lock during normal processing, so I would like to remove it.
Here's how:

Currently, we take the lock, remove the subxact and then shuffle down
all the other subxactIds so that the subxact cache is contiguous.

I propose that we simply zero out the subxact entry without re-arranging
the cache; this will be atomic, so we need not acquire an X lock. We
then increment ndeletedxids. When we enter a new subxact into the cache,
if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId
that we can use, then decrement ndeletedxids. So ndeletedxids is just a
hint, not an absolute requirement. nxids then becomes the number of
cache entries and never goes down until EOXact. The subxact cache is no
longer in order, but then it doesn't need to be either.

When we take a snapshot we will end up taking a copy of zeroed cache
entries, so the snapshots will be slightly larger than previously.
Though still no larger than the max. The size reduction was not so large
as to make a significant difference across the whole array, so
scalability is the main issue to resolve.

The snapshots will be valid with no change, since InvalidTransactionId
will never match against any recorded Xid.

I would also like to make the size of the subxact cache configurable
with a parameter such as subtransaction_cache_size = 64 (default), valid
range 4-256.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#49Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#48)
Re: CSStorm occurred again by postgreSQL8.2

Simon Riggs <simon@2ndquadrant.com> writes:

As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
and Scalability", XidCacheRemoveRunningXids() is now the only holder of
an X lock during normal processing,

Nonsense. Main transaction exit also takes an exclusive lock, and is
far more likely to be exercised in typical workloads than a
subtransaction abort.

In any case: there has still not been any evidence presented by anyone
that optimizing XidCacheRemoveRunningXids will help one bit. Given the
difficulty of measuring any benefit from the last couple of
optimizations in this general area, I'm thinking that such evidence
will be hard to come by. And we have got way more than enough on our
plates already. Can we let go of this for 8.3, please?

regards, tom lane

#50Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#49)
Re: CSStorm occurred again by postgreSQL8.2

On Tue, 2007-09-11 at 09:58 -0400, Tom Lane wrote:

Can we let go of this for 8.3, please?

OK, we've moved forward, so its a good place to break.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

#51Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#48)
Re: CSStorm occurred again by postgreSQL8.2

Is this a TODO? Tom's reply was:

Nonsense. Main transaction exit also takes an exclusive lock, and is
far more likely to be exercised in typical workloads than a
subtransaction abort.

In any case: there has still not been any evidence presented by anyone
that optimizing XidCacheRemoveRunningXids will help one bit. Given the
difficulty of measuring any benefit from the last couple of
optimizations in this general area, I'm thinking that such evidence
will be hard to come by. And we have got way more than enough on our
plates already. Can we let go of this for 8.3, please?

---------------------------------------------------------------------------

Simon Riggs wrote:

On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote:

Anyway, given that there's this one nonobvious gotcha, there might be
others. My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.

Well, its still 8.3 just...

As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
and Scalability", XidCacheRemoveRunningXids() is now the only holder of
an X lock during normal processing, so I would like to remove it.
Here's how:

Currently, we take the lock, remove the subxact and then shuffle down
all the other subxactIds so that the subxact cache is contiguous.

I propose that we simply zero out the subxact entry without re-arranging
the cache; this will be atomic, so we need not acquire an X lock. We
then increment ndeletedxids. When we enter a new subxact into the cache,
if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId
that we can use, then decrement ndeletedxids. So ndeletedxids is just a
hint, not an absolute requirement. nxids then becomes the number of
cache entries and never goes down until EOXact. The subxact cache is no
longer in order, but then it doesn't need to be either.

When we take a snapshot we will end up taking a copy of zeroed cache
entries, so the snapshots will be slightly larger than previously.
Though still no larger than the max. The size reduction was not so large
as to make a significant difference across the whole array, so
scalability is the main issue to resolve.

The snapshots will be valid with no change, since InvalidTransactionId
will never match against any recorded Xid.

I would also like to make the size of the subxact cache configurable
with a parameter such as subtransaction_cache_size = 64 (default), valid
range 4-256.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://postgres.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#52Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#51)
Re: CSStorm occurred again by postgreSQL8.2

On Wed, 2008-03-12 at 20:13 -0400, Bruce Momjian wrote:

Is this a TODO? Tom's reply was:

The general topic, yes. The caveats still apply.

Nonsense. Main transaction exit also takes an exclusive lock, and is
far more likely to be exercised in typical workloads than a
subtransaction abort.

In any case: there has still not been any evidence presented by anyone
that optimizing XidCacheRemoveRunningXids will help one bit. Given the
difficulty of measuring any benefit from the last couple of
optimizations in this general area, I'm thinking that such evidence
will be hard to come by. And we have got way more than enough on our
plates already. Can we let go of this for 8.3, please?

---------------------------------------------------------------------------

Simon Riggs wrote:

On Wed, 2006-09-13 at 21:45 -0400, Tom Lane wrote:

Anyway, given that there's this one nonobvious gotcha, there might be
others. My recommendation is that we take this off the open-items list
for 8.2 and revisit it in the 8.3 cycle when there's more time.

Well, its still 8.3 just...

As discussed in the other thread "Final Thoughts for 8.3 on LWLocking
and Scalability", XidCacheRemoveRunningXids() is now the only holder of
an X lock during normal processing, so I would like to remove it.
Here's how:

Currently, we take the lock, remove the subxact and then shuffle down
all the other subxactIds so that the subxact cache is contiguous.

I propose that we simply zero out the subxact entry without re-arranging
the cache; this will be atomic, so we need not acquire an X lock. We
then increment ndeletedxids. When we enter a new subxact into the cache,
if ndeletedxids > 0 we scan the cache to find an InvalidTransactionId
that we can use, then decrement ndeletedxids. So ndeletedxids is just a
hint, not an absolute requirement. nxids then becomes the number of
cache entries and never goes down until EOXact. The subxact cache is no
longer in order, but then it doesn't need to be either.

When we take a snapshot we will end up taking a copy of zeroed cache
entries, so the snapshots will be slightly larger than previously.
Though still no larger than the max. The size reduction was not so large
as to make a significant difference across the whole array, so
scalability is the main issue to resolve.

The snapshots will be valid with no change, since InvalidTransactionId
will never match against any recorded Xid.

I would also like to make the size of the subxact cache configurable
with a parameter such as subtransaction_cache_size = 64 (default), valid
range 4-256.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
choose an index scan if your joining column's datatypes do not
match

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk