Double partition lock in bufmgr

Started by Konstantin Knizhnikabout 5 years ago5 messages

k.knizhnik@postgrespro.ru

about 5 years ago

Hi hackers,

I am investigating incident with one of out customers: performance of
the system isdropped dramatically.
Stack traces of all backends can be found here:
http://www.garret.ru/diag_20201217_102056.stacks_59644
(this file is 6Mb so I have not attached it to this mail).

What I have see in this stack traces is that 642 backends and blocked in
LWLockAcquire,
mostly in obtaining shared buffer lock:

#0 0x00007f0e7fe7a087 in semop () from /lib64/libc.so.6
#1 0x0000000000682fb1 in PGSemaphoreLock
(sema=sema@entry=0x7f0e1c1f63a0) at pg_sema.c:387
#2 0x00000000006ed60b in LWLockAcquire (lock=lock@entry=0x7e8b6176d800,
mode=mode@entry=LW_SHARED) at lwlock.c:1338
#3 0x00000000006c88a7 in BufferAlloc (foundPtr=0x7ffcc3c8de9b "\001",
strategy=0x0, blockNum=997, forkNum=MAIN_FORKNUM, relpersistence=112
'p', smgr=0x2fb2df8) at bufmgr.c:1177
#4 ReadBuffer_common (smgr=0x2fb2df8, relpersistence=<optimized out>,
relkind=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM,
blockNum=blockNum@entry=997, mode=RBM_NORMAL, strategy=0x0,
hit=hit@entry=0x7ffcc3c8df97 "") at bufmgr.c:894
#5 0x00000000006c928b in ReadBufferExtended (reln=0x32c7ed0,
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=997,
mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at bufmgr.c:753
#6 0x00000000006c93ab in ReadBuffer (blockNum=<optimized out>,
reln=<optimized out>) at bufmgr.c:685
...

Only 11 locks from this 642 are unique.
Moreover: 358 backends are waiting for one lock and 183 - for another.

There are two backends (pids 291121 and 285927) which are trying to
obtain exclusive lock while already holding another exclusive lock.
And them block all other backends.

This is single place in bufmgr (and in postgres) where process tries to
lock two buffers:

        /*
        * To change the association of a valid buffer, we'll need to have
        * exclusive lock on both the old and new mapping partitions.
        */
        if (oldFlags & BM_TAG_VALID)
        {
            ...
            /*
            * Must lock the lower-numbered partition first to avoid
            * deadlocks.
            */
            if (oldPartitionLock < newPartitionLock)
            {
                LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
                LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
            }
            else if (oldPartitionLock > newPartitionLock)
            {
                LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
                LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
            }

This two backends are blocked in the second lock request.
I read all connects in bufmgr.c and README file but didn't find
explanation why do we need to lock both partitions.
Why it is not possible first free old buffer (as it is done in
InvalidateBuffer) and then repeat attempt to allocate the buffer?

Yes, it may require more efforts than just "gabbing" the buffer.
But in this case there is no need to keep two locks.

I wonder if somebody in the past faced with the similar symptoms and
was this problem with holding locks of two partitions in bufmgr already
discussed?

P.S.
The customer is using 9.6 version of Postgres, but I have checked that
the same code fragment is present in the master.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Zhihong Yu

zyu@yugabyte.com

about 5 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Double partition lock in bufmgr

Hi,
w.r.t. the code in BufferAlloc(), the pointers are compared.

Should we instead compare the tranche Id of the two LWLock ?

Cheers

Import Notes

Resolved by subject fallback

Konstantin Knizhnik

k.knizhnik@postgrespro.ru

about 5 years ago

In reply to: Zhihong Yu (#2)

1 attachment(s)

Re: Double partition lock in bufmgr

On 19.12.2020 10:53, Zhihong Yu wrote:

Hi,
w.r.t. the code in BufferAlloc(), the pointers are compared.

Should we instead compare the tranche Id of the two LWLock ?

Cheers

As far as LWlocks are stored in the array, comparing indexes in this
array (tranche Id) is equivalent to comparing element's pointers.
So I do not see any problem here.

Just as experiment I tried a version of BufferAlloc without double
locking (patch is attached).
I am not absolutely sure that my patch is correct: my main intention was
to estimate influence of this buffer reassignment on performance.
I just run standard pgbench for database with scale 100 and default
shared buffers size (256Mb). So there are should be a lot of page
replacements.
I do not see any noticeable difference:

vanilla: 13087.596845
patch: 13184.442130

Attachments:

bufmgr.patchtext/x-patch; charset=UTF-8; name=bufmgr.patchDownload

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ad0d1a9..91eb93d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -70,6 +70,8 @@
 
 #define RELS_BSEARCH_THRESHOLD		20
 
+#define NO_DOUBLE_BUFFER_LOCK   1
+
 typedef struct PrivateRefCountEntry
 {
 	Buffer		buffer;
@@ -197,6 +199,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -1093,12 +1096,22 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);
 
+#if NO_DOUBLE_BUFFER_LOCK
+		if (buf_state & BM_TAG_VALID)
+		{
+			InvalidateBuffer(buf);	/* releases spinlock */
+			continue;
+		}
+#endif
 		/* Must copy buffer flags while we still hold the spinlock */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 
 		/* Pin the buffer and then release the buffer spinlock */
 		PinBuffer_Locked(buf);
 
+#if NO_DOUBLE_BUFFER_LOCK
+		Assert(!(oldFlags & (BM_DIRTY|BM_TAG_VALID)));
+#else
 		/*
 		 * If the buffer was dirty, try to write it out.  There is a race
 		 * condition here, in that someone might dirty it after we released it
@@ -1216,6 +1229,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			}
 		}
 		else
+#endif
 		{
 			/* if it wasn't valid, we need only the new partition */
 			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);

Masahiko Sawada

sawada.mshk@gmail.com

about 5 years ago

In reply to: Konstantin Knizhnik (#3)

Re: Double partition lock in bufmgr

Hi Konstantin,

On Sat, Dec 19, 2020 at 9:50 PM Konstantin Knizhnik
<k.knizhnik@postgrespro.ru> wrote:

On 19.12.2020 10:53, Zhihong Yu wrote:

Hi,
w.r.t. the code in BufferAlloc(), the pointers are compared.

Should we instead compare the tranche Id of the two LWLock ?

Cheers

As far as LWlocks are stored in the array, comparing indexes in this
array (tranche Id) is equivalent to comparing element's pointers.
So I do not see any problem here.

Just as experiment I tried a version of BufferAlloc without double
locking (patch is attached).
I am not absolutely sure that my patch is correct: my main intention was
to estimate influence of this buffer reassignment on performance.
I just run standard pgbench for database with scale 100 and default
shared buffers size (256Mb). So there are should be a lot of page
replacements.
I do not see any noticeable difference:

vanilla: 13087.596845
patch: 13184.442130

You sent in your patch, bufmgr.patch to pgsql-hackers on Dec 19, but
you did not post it to the next CommitFest[1]https://commitfest.postgresql.org/31/. If this was
intentional, then you need to take no action. However, if you want
your patch to be reviewed as part of the upcoming CommitFest, then you
need to add it yourself before 2021-01-01 AoE[2]https://en.wikipedia.org/wiki/Anywhere_on_Earth. Thanks for your
contributions.

Regards,

[1]: https://commitfest.postgresql.org/31/
[2]: https://en.wikipedia.org/wiki/Anywhere_on_Earth

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

Yura Sokolov

y.sokolov@postgrespro.ru

over 4 years ago

In reply to: Konstantin Knizhnik (#1)

Re: Double partition lock in bufmgr

В Пт, 18/12/2020 в 15:20 +0300, Konstantin Knizhnik пишет:

Hi hackers,

I am investigating incident with one of out customers: performance of
the system isdropped dramatically.
Stack traces of all backends can be found here:
http://www.garret.ru/diag_20201217_102056.stacks_59644
(this file is 6Mb so I have not attached it to this mail).

What I have see in this stack traces is that 642 backends and blocked
in
LWLockAcquire,
mostly in obtaining shared buffer lock:

#0 0x00007f0e7fe7a087 in semop () from /lib64/libc.so.6
#1 0x0000000000682fb1 in PGSemaphoreLock
(sema=sema@entry=0x7f0e1c1f63a0) at pg_sema.c:387
#2 0x00000000006ed60b in LWLockAcquire (lock=lock@entry=0x7e8b6176d80
0,
mode=mode@entry=LW_SHARED) at lwlock.c:1338
#3 0x00000000006c88a7 in BufferAlloc (foundPtr=0x7ffcc3c8de9b
"\001",
strategy=0x0, blockNum=997, forkNum=MAIN_FORKNUM, relpersistence=112
'p', smgr=0x2fb2df8) at bufmgr.c:1177
#4 ReadBuffer_common (smgr=0x2fb2df8, relpersistence=<optimized
out>,
relkind=<optimized out>, forkNum=forkNum@entry=MAIN_FORKNUM,
blockNum=blockNum@entry=997, mode=RBM_NORMAL, strategy=0x0,
hit=hit@entry=0x7ffcc3c8df97 "") at bufmgr.c:894
#5 0x00000000006c928b in ReadBufferExtended (reln=0x32c7ed0,
forkNum=forkNum@entry=MAIN_FORKNUM, blockNum=997,
mode=mode@entry=RBM_NORMAL, strategy=strategy@entry=0x0) at
bufmgr.c:753
#6 0x00000000006c93ab in ReadBuffer (blockNum=<optimized out>,
reln=<optimized out>) at bufmgr.c:685
...

Only 11 locks from this 642 are unique.
Moreover: 358 backends are waiting for one lock and 183 - for another.

There are two backends (pids 291121 and 285927) which are trying to
obtain exclusive lock while already holding another exclusive lock.
And them block all other backends.

This is single place in bufmgr (and in postgres) where process tries
to
lock two buffers:

/*
* To change the association of a valid buffer, we'll need to
have
* exclusive lock on both the old and new mapping partitions.
*/
if (oldFlags & BM_TAG_VALID)
{
...
/*
* Must lock the lower-numbered partition first to avoid
* deadlocks.
*/
if (oldPartitionLock < newPartitionLock)
{
LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
}
else if (oldPartitionLock > newPartitionLock)
{
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
}

This two backends are blocked in the second lock request.
I read all connects in bufmgr.c and README file but didn't find
explanation why do we need to lock both partitions.
Why it is not possible first free old buffer (as it is done in
InvalidateBuffer) and then repeat attempt to allocate the buffer?

Yes, it may require more efforts than just "gabbing" the buffer.
But in this case there is no need to keep two locks.

I wonder if somebody in the past faced with the similar symptoms and
was this problem with holding locks of two partitions in bufmgr
already
discussed?

Looks like there is no real need for this double lock. And the change to
consequitive lock acquisition really provides scalability gain:
https://bit.ly/3AytNoN

regards
Sokolov Yura
y.sokolov@postgrespro.ru
funny.falcon@gmail.com