poor performance with Context Switch Storm at TPC-W.
Hi,All.
The problem has occurred in my customer.
poor performance with Context Switch Storm occurred
with the following composition.
Usually, CS is about 5000, WIPS=360.
when CSStorm occurrence, CS is about 100000, WIPS=60 or less.
(WIPS = number of web interactions per second)
It is under investigation using the patch
which collects a LWLock.
I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.
[DB server]*1
Intel Xeon 3.0GHz*4(2CPU * H/T ON)
4GB Memory
Red Hat Enterprise Linux ES release 4(Nahant Update 3)
Linux version 2.6.9-34.ELsmp
PostgreSQL8.1.3 (The version 8.2(head-6/15) was also occurred)
shared_buffers=131072
temp_buffers=1000
max_connections=300
[AP server]*2
200 connection pooling.
TPC-W model workload
[Clinet]*4
TPC-W model workload
(1)
The following discussion were read.
http://archives.postgresql.org/pgsql-hackers/2006-05/msg01003.php
From: Tom Lane <tgl ( at ) sss ( dot ) pgh ( dot ) pa ( dot ) us>
To: josh ( at ) agliodbs ( dot ) com
Subject: Re: Further reduction of bufmgr lock contention
Date: Wed, 24 May 2006 15:25:26 -0400
If there is a patch for investigation or a technique,
would someone show it to me?
(2)
It seems that much sequential scan has occurred at CSStorm.
When reading a tuple, do the visible satisfy check.
it seems to generate the subtransaction for every transaction.
How much is a possibility that
the LWLock to a subtransaction cause CSStorm?
best regards.
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
"Katsuhiko Okano" <okano.katsuhiko@oss.ntt.co.jp> wrote
The problem has occurred in my customer.
poor performance with Context Switch Storm occurred
with the following composition.
Usually, CS is about 5000, WIPS=360.
when CSStorm occurrence, CS is about 100000, WIPS=60 or less.Intel Xeon 3.0GHz*4(2CPU * H/T ON)
4GB Memory
Do you have bgwriter on and what's the parameters? I read a theory somewhere
that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so
with HT on, the other backends sharing the physical processor with it also
get thrashed ... So try to turn bgwriter off or turn HT off see what's the
difference.
Regards,
Qingqing
Katsuhiko Okano wrote:
I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.
Please see this thread:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php
(actually it's a single message AFAICT)
This was applied on the 8.2dev code, so I'm surprised that 8.2dev
behaves the same as 8.1.
Does your problem have any relationship to what's described there?
I also wondered whether the problem may be that the number of SLRU
buffers we use for subtrans is too low. But the number was increased
from the default 8 to 32 in 8.2dev as well. Maybe you could try
increasing that even further; say 128 and see if the problem is still
there. (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS).
--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.
Katsuhiko,
Have you tried turning HT off? HT is not generally considered (even by
Intel) a good idea for database appplications.
--Josh
hello.
Do you have bgwriter on and what's the parameters? I read a theory somewhere
that bgwriter scan a large portion of memory and cause L1/L2 thrushing, so
with HT on, the other backends sharing the physical processor with it also
get thrashed ... So try to turn bgwriter off or turn HT off see what's the
difference.
bgwriter is ON.
at postgresql.conf:
# - Background writer -
bgwriter_delay = 200 # 10-10000 milliseconds between rounds
bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round
I tried turn H/T OFF, but CSStorm occurred.
Usually, CS is about 5000.
when CSStrom occurrence, CS is about 70000.
(CS is a value smaller than the case where H/T is ON.
I think that it is because the performance of CPU fell.)
Regards
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
Hi.
Alvaro Herrera wrote:
Katsuhiko Okano wrote:
I suspected conflict of BufMappingLock.
but, collected results are seen,
occurrence of CSStorm and the increase of BufMappingLock counts
seem not to correspond.
Instead, SubtransControlLock and SubTrans were increasing.
I do not understand what in the cause of CSStorm.Please see this thread:
http://archives.postgresql.org/pgsql-hackers/2005-11/msg01547.php
(actually it's a single message AFAICT)This was applied on the 8.2dev code, so I'm surprised that 8.2dev
behaves the same as 8.1.Does your problem have any relationship to what's described there?
Probably it is related.
There is no telling are a thing with the bad method of a lock and
whether it is bad that the number of LRU buffers is simply small.
I also wondered whether the problem may be that the number of SLRU
buffers we use for subtrans is too low. But the number was increased
from the default 8 to 32 in 8.2dev as well. Maybe you could try
increasing that even further; say 128 and see if the problem is still
there. (src/include/access/subtrans.h, NUM_SUBTRANS_BUFFERS).
By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128
and recompile and measured again.
NOT occurrence of CSStorm. The value of WIPS was about 400.
(but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.)
If the number of SLRU buffers is too low,
also in PostgreSQL8.1.4, if the number of buffers is increased
I think that the same result is brought.
(Although the buffer of CLOG or a multi-transaction also increases,
I think that effect is small)
Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4
and is under measurement.
regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
Katsuhiko Okano wrote:
By PostgreSQL8.2, NUM_SUBTRANS_BUFFERS was changed into 128
and recompile and measured again.
NOT occurrence of CSStorm. The value of WIPS was about 400.
measured again.
not occurrence when measured for 30 minutes.
but occurrence when measured for 3 hours, and 1 hour and 10 minutes passed.
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.
If the number of SLRU buffers is too low,
also in PostgreSQL8.1.4, if the number of buffers is increased
I think that the same result is brought.
(Although the buffer of CLOG or a multi-transaction also increases,
I think that effect is small)Now, NUM_SLRU_BUFFERS is changed into 128 in PostgreSQL8.1.4
and is under measurement.
Occurrence CSStorm when the version 8.1.4 passed similarly for
1 hour and 10 minutes.
A strange point,
The number of times of a LWLock lock for LRU buffers is 0 times
until CSStorm occurs.
After CSStorm occurs, the share lock and the exclusion lock are required and
most locks are kept waiting.
(exclusion lock for SubtransControlLock is increased rapidly after CSStorm start.)
Is different processing done by whether CSStrom has occurred or not occurred?
regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.
Can you provide a reproducible test case for this?
regards, tom lane
"Tom Lane <tgl@sss.pgh.pa.us>" wrote:
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.Can you provide a reproducible test case for this?
Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)
It was not able to reappear to a multiplex run of pgbench
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.
I will show if there is information required for others.
The patch which outputs the number of times of LWLock was used this time.
The following is old example output. FYI.
# SELECT * FROM pg_stat_lwlocks;
kind | pg_stat_get_lwlock_name | sh_call | sh_wait | ex_call | ex_wait | sleep
------+----------------------------+------------+-----------+-----------+-----------+-------
0 | BufMappingLock | 559375542 | 33542 | 320092 | 24025 | 0
1 | BufFreelistLock | 0 | 0 | 370709 | 47 | 0
2 | LockMgrLock | 0 | 0 | 41718885 | 734502 | 0
3 | OidGenLock | 33 | 0 | 0 | 0 | 0
4 | XidGenLock | 12572279 | 10095 | 11299469 | 20089 | 0
5 | ProcArrayLock | 8371330 | 72052 | 16965667 | 603294 | 0
6 | SInvalLock | 38822428 | 435 | 25917 | 128 | 0
7 | FreeSpaceLock | 0 | 0 | 16787 | 4 | 0
8 | WALInsertLock | 0 | 0 | 1239911 | 885 | 0
9 | WALWriteLock | 0 | 0 | 69907 | 5589 | 0
10 | ControlFileLock | 0 | 0 | 16686 | 1 | 0
11 | CheckpointLock | 0 | 0 | 34 | 0 | 0
12 | CheckpointStartLock | 69509 | 0 | 34 | 1 | 0
13 | CLogControlLock | 0 | 0 | 236763 | 183 | 0
14 | SubtransControlLock | 0 | 0 | 753773945 | 205273395 | 0
15 | MultiXactGenLock | 66 | 0 | 0 | 0 | 0
16 | MultiXactOffsetControlLock | 0 | 0 | 35 | 0 | 0
17 | MultiXactMemberControlLock | 0 | 0 | 34 | 0 | 0
18 | RelCacheInitLock | 0 | 0 | 0 | 0 | 0
19 | BgWriterCommLock | 0 | 0 | 61457 | 1 | 0
20 | TwoPhaseStateLock | 33 | 0 | 0 | 0 | 0
21 | TablespaceCreateLock | 0 | 0 | 0 | 0 | 0
22 | BufferIO | 0 | 0 | 695627 | 16 | 0
23 | BufferContent | 3568231805 | 1897 | 1361394 | 829 | 0
24 | CLog | 0 | 0 | 0 | 0 | 0
25 | SubTrans | 138571621 | 143208883 | 8122181 | 8132646 | 0
26 | MultiXactOffset | 0 | 0 | 0 | 0 | 0
27 | MultiXactMember | 0 | 0 | 0 | 0 | 0
(28 rows)
I am pleased if interested.
regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
Katsuhiko Okano wrote:
"Tom Lane <tgl@sss.pgh.pa.us>" wrote:
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.Can you provide a reproducible test case for this?
Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)It was not able to reappear to a multiplex run of pgbench
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.
FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/
Stefan
Hi folks,
From: Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>
Subject: Re: CSStorm occurred again by postgreSQL8.2. (Re: [HACKERS] poor
Date: Wed, 19 Jul 2006 12:53:53 +0200
Katsuhiko Okano wrote:
"Tom Lane <tgl@sss.pgh.pa.us>" wrote:
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
It does not solve, even if it increases the number of NUM_SUBTRANS_BUFFERS.
The problem was only postponed.Can you provide a reproducible test case for this?
Seven machines are required in order to perform measurement.
(DB*1,AP*2,CLient*4)
Enough work load was not able to be given in two machines.
(DB*1,{AP+CL}*1)It was not able to reappear to a multiplex run of pgbench
or a simple SELECT query.
TPC-W of a work load tool used this time is a full scratch.
Regrettably it cannot open to the public.
If there is a work load tool of a free license, I would like to try.FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/
FYI(2):
There is one more (pseudo) TPC-W implementation by OSDL.
http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/
One more comment is that Katsuhiko't team is using their own version of
TPC-W like benchmark suite, and he cannot make it public.
Also, his point is that he tried to reproduce the CSS phenomena using
pgbench and a proguram issuing heavily multiple SELECT queries
on a single table but they didn't work well reproducing CSS.
Regards,
Masanori
Stefan
---
Masanori ITOH NTT OSS Center, Nippon Telegraph and Telephone Corporation
e-mail: ito.masanori@oss.ntt.co.jp
phone : +81-3-5860-5015
On Fri, Jul 14, 2006 at 02:58:36PM +0900, Katsuhiko Okano wrote:
NOT occurrence of CSStorm. The value of WIPS was about 400.
(but the value of WIPS fell about to 320 at intervals of 4 to 6 minutes.)
If you haven't changed checkpoint timeout, this drop-off every 4-6
minutes indicates that you need to make the bgwriter more aggressive.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
If there is a work load tool of a free license, I would like to try.
FYI: there is a free tpc-w implementation done by Jan available at:
http://pgfoundry.org/projects/tpc-w-php/FYI(2):
There is one more (pseudo) TPC-W implementation by OSDL.
http://www.osdl.org/lab_activities/kernel_testing/osdl_database_test_suite/osdl_dbt-1/
Thank you for the information.
I'll try it.
Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
"Jim C. Nasby" wrote:
If you haven't changed checkpoint timeout, this drop-off every 4-6
minutes indicates that you need to make the bgwriter more aggressive.
I'll say to a customer when proposing and explaining.
Thank you for the information.
Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
Hi hackers,
I tackled the performance problem on SUBTRANS module with Okano.
He and I reach a conclusion that SubTrans log is heavily read on a specific
access pattern in my TPC-W implementation. There seems to be awful traffic
on SUBTRANS to check visivility of tuples in HeapTupleSatisfiesSnapshot().
I'll report more details later.
BTW, I wrote a patch to collect statistics of Light-weight locks for analysis.
We have already had Trace_lwlocks option, but it can collect statistics with
less impact. The following is an output of the patch (on 8.1).
Are you interested in the feature? and I'll port it to HEAD and post it.
# SELECT * FROM pg_stat_lwlocks;
kind | pg_stat_get_lwlock_name | sh_call | sh_wait | ex_call | ex_wait |
------+----------------------------+------------+-----------+-----------+-----------+-
0 | BufMappingLock | 559375542 | 33542 | 320092 | 24025 |
1 | BufFreelistLock | 0 | 0 | 370709 | 47 |
2 | LockMgrLock | 0 | 0 | 41718885 | 734502 |
3 | OidGenLock | 33 | 0 | 0 | 0 |
4 | XidGenLock | 12572279 | 10095 | 11299469 | 20089 |
5 | ProcArrayLock | 8371330 | 72052 | 16965667 | 603294 |
6 | SInvalLock | 38822428 | 435 | 25917 | 128 |
7 | FreeSpaceLock | 0 | 0 | 16787 | 4 |
8 | WALInsertLock | 0 | 0 | 1239911 | 885 |
9 | WALWriteLock | 0 | 0 | 69907 | 5589 |
10 | ControlFileLock | 0 | 0 | 16686 | 1 |
11 | CheckpointLock | 0 | 0 | 34 | 0 |
12 | CheckpointStartLock | 69509 | 0 | 34 | 1 |
13 | CLogControlLock | 0 | 0 | 236763 | 183 |
14 | SubtransControlLock | 0 | 0 | 753773945 | 205273395 |
15 | MultiXactGenLock | 66 | 0 | 0 | 0 |
16 | MultiXactOffsetControlLock | 0 | 0 | 35 | 0 |
17 | MultiXactMemberControlLock | 0 | 0 | 34 | 0 |
18 | RelCacheInitLock | 0 | 0 | 0 | 0 |
19 | BgWriterCommLock | 0 | 0 | 61457 | 1 |
20 | TwoPhaseStateLock | 33 | 0 | 0 | 0 |
21 | TablespaceCreateLock | 0 | 0 | 0 | 0 |
22 | BufferIO | 0 | 0 | 695627 | 16 |
23 | BufferContent | 3568231805 | 1897 | 1361394 | 829 |
24 | CLog | 0 | 0 | 0 | 0 |
25 | SubTrans | 138571621 | 143208883 | 8122181 | 8132646 |
26 | MultiXactOffset | 0 | 0 | 0 | 0 |
27 | MultiXactMember | 0 | 0 | 0 | 0 |
(28 rows)
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Hi,
Here is a patch to collect statistics of LWLocks.
The following information are available on pg_stat_lwlocks view:
- sh_call : number of locked for share mode
- sh_wait : number of blocked for share mode
- ex_call : number of locked for exclusive mode
- ex_wait : number of blocked for exclusive mode
This feature is for developers, so available only if LWLOCK_STAT is defined
(default off). Otherwise, stab functions are installed.
There is room for discussion.
- Should not collect information for shared buffers?
If not, we can reduce consumption of memory for stats.
- Information for LWLockConditionalAcquire()
It is not gathered now because of the size limitation of LWLockStat
struct. I intended the total size of LWLock to be less than 64bytes.
- Information for spinlocks
Number of sleeps in s_lock() might be useful for analyzing
spinlock-level lock contentions.
- Documentation
Where to document such a feature for only developers?
...
Comments welcome.
# SELECT * FROM pg_stat_lwlocks;
kind | lwlock | sh_call | sh_wait | ex_call | ex_wait
------+----------------------------+---------+---------+---------+---------
0 | BufFreelistLock | 0 | 0 | 1237 | 0
1 | ShmemIndexLock | 0 | 0 | 437 | 0
2 | OidGenLock | 0 | 0 | 0 | 0
3 | XidGenLock | 80086 | 9 | 8174 | 0
4 | ProcArrayLock | 162491 | 59 | 16231 | 38
5 | SInvalLock | 163423 | 0 | 180 | 0
6 | FreeSpaceLock | 0 | 0 | 399 | 0
7 | WALInsertLock | 0 | 0 | 67214 | 247
8 | WALWriteLock | 0 | 0 | 8028 | 12
9 | ControlFileLock | 0 | 0 | 16 | 0
10 | CheckpointLock | 0 | 0 | 0 | 0
11 | CheckpointStartLock | 8028 | 0 | 0 | 0
12 | CLogControlLock | 22449 | 9 | 8028 | 4
13 | SubtransControlLock | 32731 | 0 | 4 | 0
14 | MultiXactGenLock | 0 | 0 | 0 | 0
15 | MultiXactOffsetControlLock | 0 | 0 | 0 | 0
16 | MultiXactMemberControlLock | 0 | 0 | 0 | 0
17 | RelCacheInitLock | 0 | 0 | 0 | 0
18 | BgWriterCommLock | 0 | 0 | 1185 | 0
19 | TwoPhaseStateLock | 0 | 0 | 0 | 0
20 | TablespaceCreateLock | 0 | 0 | 0 | 0
21 | BtreeVacuumLock | 0 | 0 | 12 | 0
22 | BufMappingLock | 322414 | 1 | 315 | 0
23 | LockMgrLock | 0 | 0 | 207585 | 6927
24 | BufferIO | 0 | 0 | 2295 | 0
25 | BufferContent | 522714 | 362 | 83375 | 324
26 | CLogBuffer | 0 | 0 | 0 | 0
27 | SubTransBuffer | 0 | 0 | 0 | 0
28 | MultiXactOffsetBuffer | 0 | 0 | 0 | 0
29 | MultiXactMemberBuffer | 0 | 0 | 0 | 0
(30 rows)
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
Attachments:
pg_stat_lwlocks.patchapplication/octet-stream; name=pg_stat_lwlocks.patchDownload+370-29
Hi,All.
Since the cause was found and the provisional patch was made
and solved about the CSStorm problem in previous mails, it reports.
Subject: [HACKERS] poor performance with Context Switch Storm at TPC-W.
Date: Tue, 11 Jul 2006 20:09:24 +0900
From: Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp>poor performance with Context Switch Storm occurred
with the following composition.
Premise knowledge :
PostgreSQL8.0 to SAVEPOINT was supported.
All the transactions have one or more subtransactions in an inside.
When judging VISIBILITY of a tupple, XID which inserted the tupple
needs to judge a top transaction or a subtransaction.
(if it's XMIN committed)
In order to judge, it is necessary to access SubTrans.
(data structure which manages the parents of transaction ID)
SubTrans is accessed via a LRU buffer.
Occurrence conditions of this phenomenon :
The occurrence conditions of this phenomenon are the following.
- There is transaction which refers to the tupple in quantity frequency (typically seq scan).
- (Appropriate frequency) There is updating transaction.
- (Appropriate length) There is long live transaction.
Point of view :
(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)
(B) Accessing at every judgment of VISIBILITY of a tupple is frequent.
If many processes wait LWLock using semop, CSStorm will occur.
Result :
As opposed to (A),
I created a patch which the page of read/write IN PROGRESS does not
make an exchange candidate.
(It has "betterslot" supposing the case where all the pages are set
to IN PROGRESS.)
The patch was applied.
However, it recurred. it did not become fundamental solution.
As opposed to (B),
A patch which is changed so that it may consider that all the
transactions are top transactions was created.
(Thank you, ITAGAKI) The patch was applied. 8 hours was measured.
CSStorm problem was stopped.
Argument :
(1)Since neither SAVEPOINT nor the error trap using PL/pgSQL is done,
the subtransaction is unnecessary.
Is it better to implement the mode not using a subtransaction?
(2)It is the better if a cache can be carried out by structure
like CLOG that it seems that it is not necessary to check
a LRU buffer at every occasion.
Are there a problem and other ideas?
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
Katsuhiko Okano wrote:
Since the cause was found and the provisional patch was made
and solved about the CSStorm problem in previous mails, it reports.
(snip)
(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)
The following is the patch.
diff -cpr postgresql-8.1.4-orig/src/backend/access/transam/slru.c postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c
*** postgresql-8.1.4-orig/src/backend/access/transam/slru.c 2006-01-21 13:38:27.000000000 +0900
--- postgresql-8.1.4-SlruSelectLRUPage-fix/src/backend/access/transam/slru.c 2006-07-25 18:02:49.000000000 +0900
*************** SlruSelectLRUPage(SlruCtl ctl, int pagen
*** 703,710 ****
for (;;)
{
int slotno;
! int bestslot = 0;
unsigned int bestcount = 0;
/* See if page already has a buffer assigned */
for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
--- 703,712 ----
for (;;)
{
int slotno;
! int bestslot = -1;
! int betterslot = -1;
unsigned int bestcount = 0;
+ unsigned int bettercount = 0;
/* See if page already has a buffer assigned */
for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
*************** SlruSelectLRUPage(SlruCtl ctl, int pagen
*** 720,732 ****
*/
for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
{
! if (shared->page_status[slotno] == SLRU_PAGE_EMPTY)
! return slotno;
! if (shared->page_lru_count[slotno] > bestcount &&
! shared->page_number[slotno] != shared->latest_page_number)
! {
! bestslot = slotno;
! bestcount = shared->page_lru_count[slotno];
}
}
--- 722,746 ----
*/
for (slotno = 0; slotno < NUM_SLRU_BUFFERS; slotno++)
{
! switch (shared->page_status[slotno])
! {
! case SLRU_PAGE_EMPTY:
! return slotno;
! case SLRU_PAGE_READ_IN_PROGRESS:
! case SLRU_PAGE_WRITE_IN_PROGRESS:
! if (shared->page_lru_count[slotno] > bettercount &&
! shared->page_number[slotno] != shared->latest_page_number)
! {
! betterslot = slotno;
! bettercount = shared->page_lru_count[slotno];
! }
! default: /* SLRU_PAGE_CLEAN,SLRU_PAGE_DIRTY */
! if (shared->page_lru_count[slotno] > bestcount &&
! shared->page_number[slotno] != shared->latest_page_number)
! {
! bestslot = slotno;
! bestcount = shared->page_lru_count[slotno];
! }
}
}
*************** SlruSelectLRUPage(SlruCtl ctl, int pagen
*** 736,741 ****
--- 750,758 ----
if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)
return bestslot;
+ if (bestslot == -1)
+ bestslot = betterslot;
+
/*
* We need to do I/O. Normal case is that we have to write it out,
* but it's possible in the worst case to have selected a read-busy
Regards,
--------
Katsuhiko Okano
okano katsuhiko _at_ oss ntt co jp
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes:
Here is a patch to collect statistics of LWLocks.
This seems fairly invasive, as well as confused about whether it's an
#ifdef'able thing or not. You can't have system views and pg_proc
entries conditional on a compile-time #ifdef, so in a default build
we would have a lot of nonfunctional cruft exposed to users.
Do we really need this compared to the simplistic dump-to-stderr
counting support that's in there now? That stuff doesn't leave any
cruft behind when not enabled, and it has at least one significant
advantage over your proposal, which is that it's possible to get
per-process statistics when needed.
If I thought that average users would have a need for LWLock statistics,
I'd be more sympathetic to expending effort on a nice frontend for
viewing the statistics, but this is and always will be just a concern
for hardcore hackers ...
regards, tom lane
Katsuhiko Okano <okano.katsuhiko@oss.ntt.co.jp> writes:
(A) The algorithm which replaces a buffer is bad.
A time stamp does not become new until swapout completes
the swapout page.
If access is during swap at other pages, the swapout page will be
in the state where it is not used most,
It is again chosen as the page for swapout.
(When work load is high)
The following is the patch.
I'm confused ... is this patch being proposed for inclusion? I
understood your previous message to say that it didn't help much.
The patch is buggy as posted, because it will try to do this:
if (shared->page_status[bestslot] == SLRU_PAGE_CLEAN)
return bestslot;
while bestslot could still be -1.
I see your concern about multiple processes selecting the same buffer
for replacement, but what will actually happen is that all but the first
will block for the first one's I/O to complete using SimpleLruWaitIO,
and then all of them will repeat the outer loop and recheck what to do.
If they were all trying to swap in the same page this is actually
optimal. If they were trying to swap in different pages then the losing
processes will again try to initiate I/O on a different buffer. (They
will pick a different buffer, because the guy who got the buffer will
have done SlruRecentlyUsed on it before releasing the control lock ---
so I don't believe the worry that we get a buffer thrash scenario here.
Look at the callers of SlruSelectLRUPage not just the function itself.)
It's possible that letting different processes initiate I/O on different
buffers would be a win, but it might just result in excess writes,
depending on the relative probability of requests for the same page
vs. requests for different pages.
Also, I think the patch as posted would still cause processes to gang up
on the same buffer, it would just be a different one from before. The
right thing would be to locate the overall-oldest buffer and return it
if clean; otherwise to initiate I/O on the oldest buffer that isn't
either clean or write-busy, if there is one; otherwise just do WaitIO
on the oldest buffer. This would ensure that different processes try
to push different buffers to disk. They'd still go back and make their
decisions from the top after doing their I/O. Whether this is a win or
not is not clear to me, but at least it would attack the guessed-at
problem correctly.
regards, tom lane