Replace buffer I/O locks with condition variables (reviving an old patch)
Hello hackers,
Back in 2016, Robert Haas proposed to replace I/O locks with condition
variables[1]/messages/by-id/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com. Condition variables went in and have found lots of
uses, but this patch to replace a bunch of LWLocks and some busy
looping did not. Since then, it has been tested quite a lot as part
of the AIO project[2]/messages/by-id/20210223100344.llw5an2aklengrmn@alap3.anarazel.de, which currently depends on it. That's why I'm
interested in following up now. I asked Robert if he planned to
re-propose it and he said I should go for it, so... here I go.
At the time, Tom Lane said:
Hmm. I fear the only reason you see an advantage there is that you don't
(yet) have any general-purpose mechanism for an aborting transaction to
satisfy its responsibilities vis-a-vis waiters on condition variables.
Instead, this wins specifically because you stuck some bespoke logic into
AbortBufferIO. OK ... but that sounds like we're going to end up with
every single condition variable that ever exists in the system needing to
be catered for separately and explicitly during transaction abort cleanup.
Which does not sound promising from a reliability standpoint. On the
other hand, I don't know what the equivalent rule to "release all LWLocks
during abort" might look like for condition variables, so I don't know
if it's even possible to avoid that.
It's true that cases like this one need bespoke logic, but that was
already the case: you have to make sure you call TerminateBufferIO()
as before, it's just that BM_IO_IN_PROGRESS-clearing is now a
CV-broadcastable event. That seems reasonable to me. As for the more
general point about the danger of waiting on CVs when potential
broadcasters might abort, and with the considerable benefit of a few
years of hindsight: I think the existing users of CVs mostly fall
into the category of waiters that will be shut down by a higher
authority if the expected broadcaster aborts. Examples: Parallel
query's interrupt-based error system will abort every back end waiting
at a parallel hash join barrier if any process involved in the query
aborts, and the whole cluster will be shut down if you're waiting for
a checkpoint when the checkpointer dies.
It looks like there may be a nearby opportunity to improve another
(rare?) busy loop, when InvalidateBuffer() encounters a pinned buffer,
based on this comment:
* ... Note that if the other guy has pinned the buffer but not
* yet done StartBufferIO, WaitIO will fall through and we'll effectively
* be busy-looping here.)
[1]: /messages/by-id/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com
[2]: /messages/by-id/20210223100344.llw5an2aklengrmn@alap3.anarazel.de
Attachments:
0001-Replace-buffer-I-O-locks-with-condition-variables.patchtext/x-patch; charset=US-ASCII; name=0001-Replace-buffer-I-O-locks-with-condition-variables.patchDownload
From 64af4b141b8b96cd18b4c9afa700880cd4d01d87 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 26 Feb 2021 18:01:28 +1300
Subject: [PATCH] Replace buffer I/O locks with condition variables.
1. Backends waiting for buffer I/O are now interruptible.
2. If something goes wrong in a backend that is currently performing
I/O, other backends no longer wake up until that backend reaches
AbortBufferIO() and broadcasts on the CV. Previously, any waiters would
wake up (because the I/O lock was automatically released) and then
busy-loop until AbortBufferIO() cleared BM_IO_IN_PROGRESS.
Author: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com
---
src/backend/postmaster/pgstat.c | 3 ++
src/backend/storage/buffer/buf_init.c | 19 ++++----
src/backend/storage/buffer/bufmgr.c | 62 +++++++-----------------
src/backend/storage/lmgr/lwlock.c | 2 -
src/include/pgstat.h | 1 +
src/include/storage/buf_internals.h | 7 +--
src/include/storage/condition_variable.h | 11 +++++
src/include/storage/lwlock.h | 1 -
8 files changed, 45 insertions(+), 61 deletions(-)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..d399b9c3e5 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4190,6 +4190,9 @@ pgstat_get_wait_io(WaitEventIO w)
case WAIT_EVENT_BUFFILE_TRUNCATE:
event_name = "BufFileTruncate";
break;
+ case WAIT_EVENT_BUFFILE_WAITIO:
+ event_name = "BufFileWaitIO";
+ break;
case WAIT_EVENT_CONTROL_FILE_READ:
event_name = "ControlFileRead";
break;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e9e4f35bb5..850ef8af4f 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -19,7 +19,7 @@
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
-LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
+ConditionVariableMinimallyPadded *BufferIOCVArray = NULL;
WritebackContext BackendWritebackContext;
CkptSortItem *CkptBufferIds;
@@ -68,7 +68,7 @@ InitBufferPool(void)
{
bool foundBufs,
foundDescs,
- foundIOLocks,
+ foundIOCV,
foundBufCkpt;
/* Align descriptors to a cacheline boundary. */
@@ -82,10 +82,10 @@ InitBufferPool(void)
NBuffers * (Size) BLCKSZ, &foundBufs);
/* Align lwlocks to cacheline boundary */
- BufferIOLWLockArray = (LWLockMinimallyPadded *)
- ShmemInitStruct("Buffer IO Locks",
- NBuffers * (Size) sizeof(LWLockMinimallyPadded),
- &foundIOLocks);
+ BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+ ShmemInitStruct("Buffer IO Condition Variables",
+ NBuffers * (Size) sizeof(ConditionVariableMinimallyPadded),
+ &foundIOCV);
/*
* The array used to sort to-be-checkpointed buffer ids is located in
@@ -98,10 +98,10 @@ InitBufferPool(void)
ShmemInitStruct("Checkpoint BufferIds",
NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
- if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
+ if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
{
/* should find all of these, or none of them */
- Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
+ Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
/* note: this path is only taken in EXEC_BACKEND case */
}
else
@@ -131,8 +131,7 @@ InitBufferPool(void)
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
- LWLockInitialize(BufferDescriptorGetIOLock(buf),
- LWTRANCHE_BUFFER_IO);
+ ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
/* Correct last entry of linked list */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..a3ec89707b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1352,8 +1352,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to get the io_in_progress
- * lock. If StartBufferIO returns false, then someone else managed to
+ * Buffer contents are currently invalid. Try to obtain the right to start
+ * I/O. If StartBufferIO returns false, then someone else managed to
* read it before we did, so there's nothing left for BufferAlloc() to do.
*/
if (StartBufferIO(buf, true))
@@ -1777,9 +1777,8 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
*/
VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
- /* I'd better not still hold any locks on the buffer */
+ /* I'd better not still hold the buffer content lock */
Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
- Assert(!LWLockHeldByMe(BufferDescriptorGetIOLock(buf)));
/*
* Decrement the shared reference count.
@@ -2742,9 +2741,9 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
uint32 buf_state;
/*
- * Acquire the buffer's io_in_progress lock. If StartBufferIO returns
- * false, then someone else flushed the buffer before we could, so we need
- * not do anything.
+ * Try to start an I/O operation. If StartBufferIO returns false, then
+ * someone else flushed the buffer before we could, so we need not do
+ * anything.
*/
if (!StartBufferIO(buf, false))
return;
@@ -2800,7 +2799,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
- * we have the io_in_progress lock.
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
*/
bufBlock = BufHdrGetBlock(buf);
@@ -2835,7 +2834,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
- * end the io_in_progress state.
+ * end the BM_IO_IN_PROGRESS state.
*/
TerminateBufferIO(buf, true, 0);
@@ -4271,7 +4270,7 @@ IsBufferCleanupOK(Buffer buffer)
* Functions for buffer I/O handling
*
* Note: We assume that nested buffer I/O never occurs.
- * i.e at most one io_in_progress lock is held per proc.
+ * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
*
* Also note that these are used only for shared buffers, not local ones.
*/
@@ -4282,13 +4281,10 @@ IsBufferCleanupOK(Buffer buffer)
static void
WaitIO(BufferDesc *buf)
{
- /*
- * Changed to wait until there's no IO - Inoue 01/13/2000
- *
- * Note this is *necessary* because an error abort in the process doing
- * I/O could release the io_in_progress_lock prematurely. See
- * AbortBufferIO.
- */
+ ConditionVariable *cv = BufferDescriptorGetIOCV(buf);
+
+ ConditionVariablePrepareToSleep(cv);
+
for (;;)
{
uint32 buf_state;
@@ -4303,9 +4299,9 @@ WaitIO(BufferDesc *buf)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_SHARED);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
+ ConditionVariableSleep(cv, WAIT_EVENT_BUFFILE_WAITIO);
}
+ ConditionVariableCancelSleep();
}
/*
@@ -4317,7 +4313,7 @@ WaitIO(BufferDesc *buf)
* In some scenarios there are race conditions in which multiple backends
* could attempt the same I/O operation concurrently. If someone else
* has already started I/O on this buffer then we will block on the
- * io_in_progress lock until he's done.
+ * I/O condition variable until he's done.
*
* Input operations are only attempted on buffers that are not BM_VALID,
* and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -4335,25 +4331,11 @@ StartBufferIO(BufferDesc *buf, bool forInput)
for (;;)
{
- /*
- * Grab the io_in_progress lock so that other processes can wait for
- * me to finish the I/O.
- */
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
buf_state = LockBufHdr(buf);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
-
- /*
- * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
- * lock isn't held is if the process doing the I/O is recovering from
- * an error (see AbortBufferIO). If that's the case, we must wait for
- * him to get unwedged.
- */
UnlockBufHdr(buf, buf_state);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
WaitIO(buf);
}
@@ -4363,7 +4345,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
{
/* someone else already did the I/O */
UnlockBufHdr(buf, buf_state);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
return false;
}
@@ -4381,7 +4362,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
* (Assumptions)
* My process is executing IO for the buffer
* BM_IO_IN_PROGRESS bit is set for the buffer
- * We hold the buffer's io_in_progress lock
* The buffer is Pinned
*
* If clear_dirty is true and BM_JUST_DIRTIED is not set, we clear the
@@ -4413,7 +4393,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
InProgressBuf = NULL;
- LWLockRelease(BufferDescriptorGetIOLock(buf));
+ ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
}
/*
@@ -4434,14 +4414,6 @@ AbortBufferIO(void)
{
uint32 buf_state;
- /*
- * Since LWLockReleaseAll has already been called, we're not holding
- * the buffer's io_in_progress_lock. We have to re-acquire it so that
- * we can use TerminateBufferIO. Anyone who's executing WaitIO on the
- * buffer will be in a busy spin until we succeed in doing this.
- */
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
buf_state = LockBufHdr(buf);
Assert(buf_state & BM_IO_IN_PROGRESS);
if (IsForInput)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..60600d7a0b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -146,8 +146,6 @@ static const char *const BuiltinTrancheNames[] = {
"WALInsert",
/* LWTRANCHE_BUFFER_CONTENT: */
"BufferContent",
- /* LWTRANCHE_BUFFER_IO: */
- "BufferIO",
/* LWTRANCHE_REPLICATION_ORIGIN_STATE: */
"ReplicationOriginState",
/* LWTRANCHE_REPLICATION_SLOT_IO: */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..6adf542e22 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1028,6 +1028,7 @@ typedef enum
WAIT_EVENT_BUFFILE_READ,
WAIT_EVENT_BUFFILE_WRITE,
WAIT_EVENT_BUFFILE_TRUNCATE,
+ WAIT_EVENT_BUFFILE_WAITIO,
WAIT_EVENT_CONTROL_FILE_READ,
WAIT_EVENT_CONTROL_FILE_SYNC,
WAIT_EVENT_CONTROL_FILE_SYNC_UPDATE,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f6b5782965..532711200a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -18,6 +18,7 @@
#include "port/atomics.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
@@ -221,12 +222,12 @@ typedef union BufferDescPadded
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
-#define BufferDescriptorGetIOLock(bdesc) \
- (&(BufferIOLWLockArray[(bdesc)->buf_id]).lock)
+#define BufferDescriptorGetIOCV(bdesc) \
+ (&(BufferIOCVArray[(bdesc)->buf_id]).cv)
#define BufferDescriptorGetContentLock(bdesc) \
((LWLock*) (&(bdesc)->content_lock))
-extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
+extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
/*
* The freeNext field is either the index of the next freelist entry,
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 0b7578f8c4..4cba0eccff 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -31,6 +31,17 @@ typedef struct
proclist_head wakeup; /* list of wake-able processes */
} ConditionVariable;
+/*
+ * Pad a condition variable to a power-of-two size so that an array of
+ * condition variables does not cross a cache line boundary.
+ */
+#define CV_MINIMAL_SIZE (sizeof(ConditionVariable) <= 16 ? 16 : 32)
+typedef union ConditionVariableMinimallyPadded
+{
+ ConditionVariable cv;
+ char pad[CV_MINIMAL_SIZE];
+} ConditionVariableMinimallyPadded;
+
/* Initialize a condition variable. */
extern void ConditionVariableInit(ConditionVariable *cv);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..0ed190f73b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -202,7 +202,6 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SERIAL_BUFFER,
LWTRANCHE_WAL_INSERT,
LWTRANCHE_BUFFER_CONTENT,
- LWTRANCHE_BUFFER_IO,
LWTRANCHE_REPLICATION_ORIGIN_STATE,
LWTRANCHE_REPLICATION_SLOT_IO,
LWTRANCHE_LOCK_FASTPATH,
--
2.30.0
On Fri, Feb 26, 2021 at 7:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Back in 2016, Robert Haas proposed to replace I/O locks with condition
variables[1]. Condition variables went in and have found lots of
uses, but this patch to replace a bunch of LWLocks and some busy
looping did not. Since then, it has been tested quite a lot as part
of the AIO project[2], which currently depends on it. That's why I'm
interested in following up now. I asked Robert if he planned to
re-propose it and he said I should go for it, so... here I go.
I removed a redundant (Size) cast, fixed the wait event name and
category (WAIT_EVENT_BUFFILE_XXX is for buffile.c stuff, not bufmgr.c
stuff, and this is really an IPC wait, not an IO wait despite the
name), updated documentation and pgindented.
Attachments:
v2-0001-Replace-buffer-I-O-locks-with-condition-variables.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Replace-buffer-I-O-locks-with-condition-variables.patchDownload
From cb4f3c943c47bb09864723c22cc0504c54dc9a3a Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 26 Feb 2021 18:01:28 +1300
Subject: [PATCH v2] Replace buffer I/O locks with condition variables.
1. Backends waiting for buffer I/O are now interruptible.
2. If something goes wrong in a backend that is currently performing
I/O, other backends no longer wake up until that backend reaches
AbortBufferIO() and broadcasts on the CV. Previously, any waiters would
wake up (because the I/O lock was automatically released) and then
busy-loop until AbortBufferIO() cleared BM_IO_IN_PROGRESS.
Author: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (2016)
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Discussion: https://postgr.es/m/CA%2BhUKGJ8nBFrjLuCTuqKN0pd2PQOwj9b_jnsiGFFMDvUxahj_A%40mail.gmail.com
Discussion: https://postgr.es/m/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com
---
doc/src/sgml/monitoring.sgml | 8 +--
src/backend/postmaster/pgstat.c | 3 ++
src/backend/storage/buffer/buf_init.c | 19 ++++---
src/backend/storage/buffer/bufmgr.c | 64 +++++++-----------------
src/backend/storage/lmgr/lwlock.c | 2 -
src/include/pgstat.h | 1 +
src/include/storage/buf_internals.h | 7 +--
src/include/storage/condition_variable.h | 11 ++++
src/include/storage/lwlock.h | 1 -
src/tools/pgindent/typedefs.list | 1 +
10 files changed, 51 insertions(+), 66 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..b37d6484a4 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1581,6 +1581,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting for the page number needed to continue a parallel B-tree
scan to become available.</entry>
</row>
+ <row>
+ <entry><literal>BufferWaitIO</literal></entry>
+ <entry>Waiting for buffer I/O to complete.</entry>
+ </row>
<row>
<entry><literal>CheckpointDone</literal></entry>
<entry>Waiting for a checkpoint to complete.</entry>
@@ -1871,10 +1875,6 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry><literal>BufferContent</literal></entry>
<entry>Waiting to access a data page in memory.</entry>
</row>
- <row>
- <entry><literal>BufferIO</literal></entry>
- <entry>Waiting for I/O on a data page.</entry>
- </row>
<row>
<entry><literal>BufferMapping</literal></entry>
<entry>Waiting to associate a data block with a buffer in the buffer
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..26cd2ce196 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4011,6 +4011,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_BTREE_PAGE:
event_name = "BtreePage";
break;
+ case WAIT_EVENT_BUFFER_WAIT_IO:
+ event_name = "BufferWaitIO";
+ break;
case WAIT_EVENT_CHECKPOINT_DONE:
event_name = "CheckpointDone";
break;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e9e4f35bb5..51b250fe16 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -19,7 +19,7 @@
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
-LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
+ConditionVariableMinimallyPadded *BufferIOCVArray = NULL;
WritebackContext BackendWritebackContext;
CkptSortItem *CkptBufferIds;
@@ -68,7 +68,7 @@ InitBufferPool(void)
{
bool foundBufs,
foundDescs,
- foundIOLocks,
+ foundIOCV,
foundBufCkpt;
/* Align descriptors to a cacheline boundary. */
@@ -82,10 +82,10 @@ InitBufferPool(void)
NBuffers * (Size) BLCKSZ, &foundBufs);
/* Align lwlocks to cacheline boundary */
- BufferIOLWLockArray = (LWLockMinimallyPadded *)
- ShmemInitStruct("Buffer IO Locks",
- NBuffers * (Size) sizeof(LWLockMinimallyPadded),
- &foundIOLocks);
+ BufferIOCVArray = (ConditionVariableMinimallyPadded *)
+ ShmemInitStruct("Buffer IO Condition Variables",
+ NBuffers * sizeof(ConditionVariableMinimallyPadded),
+ &foundIOCV);
/*
* The array used to sort to-be-checkpointed buffer ids is located in
@@ -98,10 +98,10 @@ InitBufferPool(void)
ShmemInitStruct("Checkpoint BufferIds",
NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
- if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
+ if (foundDescs || foundBufs || foundIOCV || foundBufCkpt)
{
/* should find all of these, or none of them */
- Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
+ Assert(foundDescs && foundBufs && foundIOCV && foundBufCkpt);
/* note: this path is only taken in EXEC_BACKEND case */
}
else
@@ -131,8 +131,7 @@ InitBufferPool(void)
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
- LWLockInitialize(BufferDescriptorGetIOLock(buf),
- LWTRANCHE_BUFFER_IO);
+ ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
/* Correct last entry of linked list */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..5ba6891c64 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1352,9 +1352,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to get the io_in_progress
- * lock. If StartBufferIO returns false, then someone else managed to
- * read it before we did, so there's nothing left for BufferAlloc() to do.
+ * Buffer contents are currently invalid. Try to obtain the right to
+ * start I/O. If StartBufferIO returns false, then someone else managed
+ * to read it before we did, so there's nothing left for BufferAlloc() to
+ * do.
*/
if (StartBufferIO(buf, true))
*foundPtr = false;
@@ -1777,9 +1778,8 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
*/
VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
- /* I'd better not still hold any locks on the buffer */
+ /* I'd better not still hold the buffer content lock */
Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
- Assert(!LWLockHeldByMe(BufferDescriptorGetIOLock(buf)));
/*
* Decrement the shared reference count.
@@ -2742,9 +2742,9 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
uint32 buf_state;
/*
- * Acquire the buffer's io_in_progress lock. If StartBufferIO returns
- * false, then someone else flushed the buffer before we could, so we need
- * not do anything.
+ * Try to start an I/O operation. If StartBufferIO returns false, then
+ * someone else flushed the buffer before we could, so we need not do
+ * anything.
*/
if (!StartBufferIO(buf, false))
return;
@@ -2800,7 +2800,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
- * we have the io_in_progress lock.
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
*/
bufBlock = BufHdrGetBlock(buf);
@@ -2835,7 +2835,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
- * end the io_in_progress state.
+ * end the BM_IO_IN_PROGRESS state.
*/
TerminateBufferIO(buf, true, 0);
@@ -4271,7 +4271,7 @@ IsBufferCleanupOK(Buffer buffer)
* Functions for buffer I/O handling
*
* Note: We assume that nested buffer I/O never occurs.
- * i.e at most one io_in_progress lock is held per proc.
+ * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
*
* Also note that these are used only for shared buffers, not local ones.
*/
@@ -4282,13 +4282,9 @@ IsBufferCleanupOK(Buffer buffer)
static void
WaitIO(BufferDesc *buf)
{
- /*
- * Changed to wait until there's no IO - Inoue 01/13/2000
- *
- * Note this is *necessary* because an error abort in the process doing
- * I/O could release the io_in_progress_lock prematurely. See
- * AbortBufferIO.
- */
+ ConditionVariable *cv = BufferDescriptorGetIOCV(buf);
+
+ ConditionVariablePrepareToSleep(cv);
for (;;)
{
uint32 buf_state;
@@ -4303,9 +4299,9 @@ WaitIO(BufferDesc *buf)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_SHARED);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
+ ConditionVariableSleep(cv, WAIT_EVENT_BUFFER_WAIT_IO);
}
+ ConditionVariableCancelSleep();
}
/*
@@ -4317,7 +4313,7 @@ WaitIO(BufferDesc *buf)
* In some scenarios there are race conditions in which multiple backends
* could attempt the same I/O operation concurrently. If someone else
* has already started I/O on this buffer then we will block on the
- * io_in_progress lock until he's done.
+ * I/O condition variable until he's done.
*
* Input operations are only attempted on buffers that are not BM_VALID,
* and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -4335,25 +4331,11 @@ StartBufferIO(BufferDesc *buf, bool forInput)
for (;;)
{
- /*
- * Grab the io_in_progress lock so that other processes can wait for
- * me to finish the I/O.
- */
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
buf_state = LockBufHdr(buf);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
-
- /*
- * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
- * lock isn't held is if the process doing the I/O is recovering from
- * an error (see AbortBufferIO). If that's the case, we must wait for
- * him to get unwedged.
- */
UnlockBufHdr(buf, buf_state);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
WaitIO(buf);
}
@@ -4363,7 +4345,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
{
/* someone else already did the I/O */
UnlockBufHdr(buf, buf_state);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
return false;
}
@@ -4381,7 +4362,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
* (Assumptions)
* My process is executing IO for the buffer
* BM_IO_IN_PROGRESS bit is set for the buffer
- * We hold the buffer's io_in_progress lock
* The buffer is Pinned
*
* If clear_dirty is true and BM_JUST_DIRTIED is not set, we clear the
@@ -4413,7 +4393,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
InProgressBuf = NULL;
- LWLockRelease(BufferDescriptorGetIOLock(buf));
+ ConditionVariableBroadcast(BufferDescriptorGetIOCV(buf));
}
/*
@@ -4434,14 +4414,6 @@ AbortBufferIO(void)
{
uint32 buf_state;
- /*
- * Since LWLockReleaseAll has already been called, we're not holding
- * the buffer's io_in_progress_lock. We have to re-acquire it so that
- * we can use TerminateBufferIO. Anyone who's executing WaitIO on the
- * buffer will be in a busy spin until we succeed in doing this.
- */
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
buf_state = LockBufHdr(buf);
Assert(buf_state & BM_IO_IN_PROGRESS);
if (IsForInput)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..60600d7a0b 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -146,8 +146,6 @@ static const char *const BuiltinTrancheNames[] = {
"WALInsert",
/* LWTRANCHE_BUFFER_CONTENT: */
"BufferContent",
- /* LWTRANCHE_BUFFER_IO: */
- "BufferIO",
/* LWTRANCHE_REPLICATION_ORIGIN_STATE: */
"ReplicationOriginState",
/* LWTRANCHE_REPLICATION_SLOT_IO: */
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..e8efbab3ae 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -961,6 +961,7 @@ typedef enum
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
WAIT_EVENT_BTREE_PAGE,
+ WAIT_EVENT_BUFFER_WAIT_IO,
WAIT_EVENT_CHECKPOINT_DONE,
WAIT_EVENT_CHECKPOINT_START,
WAIT_EVENT_EXECUTE_GATHER,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f6b5782965..532711200a 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -18,6 +18,7 @@
#include "port/atomics.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
@@ -221,12 +222,12 @@ typedef union BufferDescPadded
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
-#define BufferDescriptorGetIOLock(bdesc) \
- (&(BufferIOLWLockArray[(bdesc)->buf_id]).lock)
+#define BufferDescriptorGetIOCV(bdesc) \
+ (&(BufferIOCVArray[(bdesc)->buf_id]).cv)
#define BufferDescriptorGetContentLock(bdesc) \
((LWLock*) (&(bdesc)->content_lock))
-extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
+extern PGDLLIMPORT ConditionVariableMinimallyPadded *BufferIOCVArray;
/*
* The freeNext field is either the index of the next freelist entry,
diff --git a/src/include/storage/condition_variable.h b/src/include/storage/condition_variable.h
index 0b7578f8c4..6310ca230a 100644
--- a/src/include/storage/condition_variable.h
+++ b/src/include/storage/condition_variable.h
@@ -31,6 +31,17 @@ typedef struct
proclist_head wakeup; /* list of wake-able processes */
} ConditionVariable;
+/*
+ * Pad a condition variable to a power-of-two size so that an array of
+ * condition variables does not cross a cache line boundary.
+ */
+#define CV_MINIMAL_SIZE (sizeof(ConditionVariable) <= 16 ? 16 : 32)
+typedef union ConditionVariableMinimallyPadded
+{
+ ConditionVariable cv;
+ char pad[CV_MINIMAL_SIZE];
+} ConditionVariableMinimallyPadded;
+
/* Initialize a condition variable. */
extern void ConditionVariableInit(ConditionVariable *cv);
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..0ed190f73b 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -202,7 +202,6 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SERIAL_BUFFER,
LWTRANCHE_WAL_INSERT,
LWTRANCHE_BUFFER_CONTENT,
- LWTRANCHE_BUFFER_IO,
LWTRANCHE_REPLICATION_ORIGIN_STATE,
LWTRANCHE_REPLICATION_SLOT_IO,
LWTRANCHE_LOCK_FASTPATH,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8bd95aefa1..9c3b978c54 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -398,6 +398,7 @@ CompressionAlgorithm
CompressorState
ComputeXidHorizonsResult
ConditionVariable
+ConditionVariableMinimallyPadded
ConditionalStack
ConfigData
ConfigVariable
--
2.30.0
On Fri, Mar 5, 2021 at 12:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Feb 26, 2021 at 7:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Back in 2016, Robert Haas proposed to replace I/O locks with condition
variables[1]. Condition variables went in and have found lots of
uses, but this patch to replace a bunch of LWLocks and some busy
looping did not. Since then, it has been tested quite a lot as part
of the AIO project[2], which currently depends on it. That's why I'm
interested in following up now. I asked Robert if he planned to
re-propose it and he said I should go for it, so... here I go.I removed a redundant (Size) cast, fixed the wait event name and
category (WAIT_EVENT_BUFFILE_XXX is for buffile.c stuff, not bufmgr.c
stuff, and this is really an IPC wait, not an IO wait despite the
name), updated documentation and pgindented.
More review and some proposed changes:
The old I/O lock array was the only user of struct
LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda
strange to leave it in the tree with no user. Of course it's remotely
possible there are extensions using it (know of any?). In the
attached, I've ripped that + associated commentary out, because it's
fun to delete dead code. Objections?
Since the whole reason for that out-of-line array in the first place
was to keep BufferDesc inside one cache line, and since it is in fact
possible to put a new condition variable into BufferDesc without
exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that
instead? I haven't yet considered other architectures or potential
member orders. It's also possible that some other project already had
designs on those BufferDesc bytes. This drops quite a few lines from
the tree, including the comment about how nice it'd be to be able to
put the lock in BufferDesc.
I wonder if we should try to preserve user experience a little harder,
for the benefit of people who have monitoring queries that look for
this condition. Instead of inventing a new wait_event value, let's
just keep showing "BufferIO" in that column. In other words, the
change is that wait_event_type changes from "LWLock" to "IPC", which
is a pretty good summary of this patch. Done in the attached. Does
this make sense?
Please see attached, which gets us to: 8 files changed, 30
insertions(+), 113 deletions(-)
PS: An idea I thought about while studying this patch is that we
should be able to make signaling an empty condition variable
free/cheap (no spinlock acquisition or other extra memory
barrier-containing operation); I'll write about that separately.
Attachments:
v3-0001-Replace-buffer-I-O-locks-with-condition-variables.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Replace-buffer-I-O-locks-with-condition-variables.patchDownload
From b6c4e8919dcca0bbf97ccd9489d6876e9890e4b4 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 26 Feb 2021 18:01:28 +1300
Subject: [PATCH v3] Replace buffer I/O locks with condition variables.
1. Backends waiting for buffer I/O are now interruptible.
2. If something goes wrong in a backend that is currently performing
I/O, other backends no longer wake up until that backend reaches
AbortBufferIO() and broadcasts on the CV. Previously, any waiters would
wake up (because the I/O lock was automatically released) and then
busy-loop until AbortBufferIO() cleared BM_IO_IN_PROGRESS.
3. Remove LWLockMinimallyPadded, as it had no other user.
Author: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Thomas Munro <thomas.munro@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (earlier version, 2016)
Discussion: https://postgr.es/m/CA%2BhUKGJ8nBFrjLuCTuqKN0pd2PQOwj9b_jnsiGFFMDvUxahj_A%40mail.gmail.com
Discussion: https://postgr.es/m/CA+Tgmoaj2aPti0yho7FeEf2qt-JgQPRWb0gci_o1Hfr=C56Xng@mail.gmail.com
---
doc/src/sgml/monitoring.sgml | 8 ++--
src/backend/postmaster/pgstat.c | 3 ++
src/backend/storage/buffer/buf_init.c | 28 ++----------
src/backend/storage/buffer/bufmgr.c | 62 +++++++--------------------
src/backend/storage/lmgr/lwlock.c | 5 +--
src/include/pgstat.h | 1 +
src/include/storage/buf_internals.h | 7 +--
src/include/storage/lwlock.h | 29 -------------
8 files changed, 30 insertions(+), 113 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 3513e127b7..ef3177f0e0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1581,6 +1581,10 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry>Waiting for the page number needed to continue a parallel B-tree
scan to become available.</entry>
</row>
+ <row>
+ <entry><literal>BufferIO</literal></entry>
+ <entry>Waiting for buffer I/O to complete.</entry>
+ </row>
<row>
<entry><literal>CheckpointDone</literal></entry>
<entry>Waiting for a checkpoint to complete.</entry>
@@ -1871,10 +1875,6 @@ postgres 27093 0.0 0.0 30096 2752 ? Ss 11:34 0:00 postgres: ser
<entry><literal>BufferContent</literal></entry>
<entry>Waiting to access a data page in memory.</entry>
</row>
- <row>
- <entry><literal>BufferIO</literal></entry>
- <entry>Waiting for I/O on a data page.</entry>
- </row>
<row>
<entry><literal>BufferMapping</literal></entry>
<entry>Waiting to associate a data block with a buffer in the buffer
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index f75b52719d..f050a1ddb1 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -4011,6 +4011,9 @@ pgstat_get_wait_ipc(WaitEventIPC w)
case WAIT_EVENT_BTREE_PAGE:
event_name = "BtreePage";
break;
+ case WAIT_EVENT_BUFFER_IO:
+ event_name = "BufferIO";
+ break;
case WAIT_EVENT_CHECKPOINT_DONE:
event_name = "CheckpointDone";
break;
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index e9e4f35bb5..f815462ec0 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -19,7 +19,6 @@
BufferDescPadded *BufferDescriptors;
char *BufferBlocks;
-LWLockMinimallyPadded *BufferIOLWLockArray = NULL;
WritebackContext BackendWritebackContext;
CkptSortItem *CkptBufferIds;
@@ -68,7 +67,6 @@ InitBufferPool(void)
{
bool foundBufs,
foundDescs,
- foundIOLocks,
foundBufCkpt;
/* Align descriptors to a cacheline boundary. */
@@ -81,12 +79,6 @@ InitBufferPool(void)
ShmemInitStruct("Buffer Blocks",
NBuffers * (Size) BLCKSZ, &foundBufs);
- /* Align lwlocks to cacheline boundary */
- BufferIOLWLockArray = (LWLockMinimallyPadded *)
- ShmemInitStruct("Buffer IO Locks",
- NBuffers * (Size) sizeof(LWLockMinimallyPadded),
- &foundIOLocks);
-
/*
* The array used to sort to-be-checkpointed buffer ids is located in
* shared memory, to avoid having to allocate significant amounts of
@@ -98,10 +90,10 @@ InitBufferPool(void)
ShmemInitStruct("Checkpoint BufferIds",
NBuffers * sizeof(CkptSortItem), &foundBufCkpt);
- if (foundDescs || foundBufs || foundIOLocks || foundBufCkpt)
+ if (foundDescs || foundBufs || foundBufCkpt)
{
/* should find all of these, or none of them */
- Assert(foundDescs && foundBufs && foundIOLocks && foundBufCkpt);
+ Assert(foundDescs && foundBufs && foundBufCkpt);
/* note: this path is only taken in EXEC_BACKEND case */
}
else
@@ -131,8 +123,7 @@ InitBufferPool(void)
LWLockInitialize(BufferDescriptorGetContentLock(buf),
LWTRANCHE_BUFFER_CONTENT);
- LWLockInitialize(BufferDescriptorGetIOLock(buf),
- LWTRANCHE_BUFFER_IO);
+ ConditionVariableInit(&buf->io_cv);
}
/* Correct last entry of linked list */
@@ -169,19 +160,6 @@ BufferShmemSize(void)
/* size of stuff controlled by freelist.c */
size = add_size(size, StrategyShmemSize());
- /*
- * It would be nice to include the I/O locks in the BufferDesc, but that
- * would increase the size of a BufferDesc to more than one cache line,
- * and benchmarking has shown that keeping every BufferDesc aligned on a
- * cache line boundary is important for performance. So, instead, the
- * array of I/O locks is allocated in a separate tranche. Because those
- * locks are not highly contended, we lay out the array with minimal
- * padding.
- */
- size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
- /* to allow aligning the above */
- size = add_size(size, PG_CACHE_LINE_SIZE);
-
/* size of checkpoint sort array in bufmgr.c */
size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem)));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 561c212092..cb533f66cd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1352,9 +1352,10 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
LWLockRelease(newPartitionLock);
/*
- * Buffer contents are currently invalid. Try to get the io_in_progress
- * lock. If StartBufferIO returns false, then someone else managed to
- * read it before we did, so there's nothing left for BufferAlloc() to do.
+ * Buffer contents are currently invalid. Try to obtain the right to
+ * start I/O. If StartBufferIO returns false, then someone else managed
+ * to read it before we did, so there's nothing left for BufferAlloc() to
+ * do.
*/
if (StartBufferIO(buf, true))
*foundPtr = false;
@@ -1777,9 +1778,8 @@ UnpinBuffer(BufferDesc *buf, bool fixOwner)
*/
VALGRIND_MAKE_MEM_NOACCESS(BufHdrGetBlock(buf), BLCKSZ);
- /* I'd better not still hold any locks on the buffer */
+ /* I'd better not still hold the buffer content lock */
Assert(!LWLockHeldByMe(BufferDescriptorGetContentLock(buf)));
- Assert(!LWLockHeldByMe(BufferDescriptorGetIOLock(buf)));
/*
* Decrement the shared reference count.
@@ -2742,9 +2742,9 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
uint32 buf_state;
/*
- * Acquire the buffer's io_in_progress lock. If StartBufferIO returns
- * false, then someone else flushed the buffer before we could, so we need
- * not do anything.
+ * Try to start an I/O operation. If StartBufferIO returns false, then
+ * someone else flushed the buffer before we could, so we need not do
+ * anything.
*/
if (!StartBufferIO(buf, false))
return;
@@ -2800,7 +2800,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Now it's safe to write buffer to disk. Note that no one else should
* have been able to write it while we were busy with log flushing because
- * we have the io_in_progress lock.
+ * only one process at a time can set the BM_IO_IN_PROGRESS bit.
*/
bufBlock = BufHdrGetBlock(buf);
@@ -2835,7 +2835,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
- * end the io_in_progress state.
+ * end the BM_IO_IN_PROGRESS state.
*/
TerminateBufferIO(buf, true, 0);
@@ -4271,7 +4271,7 @@ IsBufferCleanupOK(Buffer buffer)
* Functions for buffer I/O handling
*
* Note: We assume that nested buffer I/O never occurs.
- * i.e at most one io_in_progress lock is held per proc.
+ * i.e at most one BM_IO_IN_PROGRESS bit is set per proc.
*
* Also note that these are used only for shared buffers, not local ones.
*/
@@ -4282,13 +4282,7 @@ IsBufferCleanupOK(Buffer buffer)
static void
WaitIO(BufferDesc *buf)
{
- /*
- * Changed to wait until there's no IO - Inoue 01/13/2000
- *
- * Note this is *necessary* because an error abort in the process doing
- * I/O could release the io_in_progress_lock prematurely. See
- * AbortBufferIO.
- */
+ ConditionVariablePrepareToSleep(&buf->io_cv);
for (;;)
{
uint32 buf_state;
@@ -4303,9 +4297,9 @@ WaitIO(BufferDesc *buf)
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_SHARED);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
+ ConditionVariableSleep(&buf->io_cv, WAIT_EVENT_BUFFER_IO);
}
+ ConditionVariableCancelSleep();
}
/*
@@ -4317,7 +4311,7 @@ WaitIO(BufferDesc *buf)
* In some scenarios there are race conditions in which multiple backends
* could attempt the same I/O operation concurrently. If someone else
* has already started I/O on this buffer then we will block on the
- * io_in_progress lock until he's done.
+ * I/O condition variable until he's done.
*
* Input operations are only attempted on buffers that are not BM_VALID,
* and output operations only on buffers that are BM_VALID and BM_DIRTY,
@@ -4335,25 +4329,11 @@ StartBufferIO(BufferDesc *buf, bool forInput)
for (;;)
{
- /*
- * Grab the io_in_progress lock so that other processes can wait for
- * me to finish the I/O.
- */
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
buf_state = LockBufHdr(buf);
if (!(buf_state & BM_IO_IN_PROGRESS))
break;
-
- /*
- * The only way BM_IO_IN_PROGRESS could be set when the io_in_progress
- * lock isn't held is if the process doing the I/O is recovering from
- * an error (see AbortBufferIO). If that's the case, we must wait for
- * him to get unwedged.
- */
UnlockBufHdr(buf, buf_state);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
WaitIO(buf);
}
@@ -4363,7 +4343,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
{
/* someone else already did the I/O */
UnlockBufHdr(buf, buf_state);
- LWLockRelease(BufferDescriptorGetIOLock(buf));
return false;
}
@@ -4381,7 +4360,6 @@ StartBufferIO(BufferDesc *buf, bool forInput)
* (Assumptions)
* My process is executing IO for the buffer
* BM_IO_IN_PROGRESS bit is set for the buffer
- * We hold the buffer's io_in_progress lock
* The buffer is Pinned
*
* If clear_dirty is true and BM_JUST_DIRTIED is not set, we clear the
@@ -4413,7 +4391,7 @@ TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag_bits)
InProgressBuf = NULL;
- LWLockRelease(BufferDescriptorGetIOLock(buf));
+ ConditionVariableBroadcast(&buf->io_cv);
}
/*
@@ -4434,14 +4412,6 @@ AbortBufferIO(void)
{
uint32 buf_state;
- /*
- * Since LWLockReleaseAll has already been called, we're not holding
- * the buffer's io_in_progress_lock. We have to re-acquire it so that
- * we can use TerminateBufferIO. Anyone who's executing WaitIO on the
- * buffer will be in a busy spin until we succeed in doing this.
- */
- LWLockAcquire(BufferDescriptorGetIOLock(buf), LW_EXCLUSIVE);
-
buf_state = LockBufHdr(buf);
Assert(buf_state & BM_IO_IN_PROGRESS);
if (IsForInput)
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8cb6a6f042..adf19aba75 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -146,8 +146,6 @@ static const char *const BuiltinTrancheNames[] = {
"WALInsert",
/* LWTRANCHE_BUFFER_CONTENT: */
"BufferContent",
- /* LWTRANCHE_BUFFER_IO: */
- "BufferIO",
/* LWTRANCHE_REPLICATION_ORIGIN_STATE: */
"ReplicationOriginState",
/* LWTRANCHE_REPLICATION_SLOT_IO: */
@@ -469,8 +467,7 @@ CreateLWLocks(void)
StaticAssertStmt(LW_VAL_EXCLUSIVE > (uint32) MAX_BACKENDS,
"MAX_BACKENDS too big for lwlock.c");
- StaticAssertStmt(sizeof(LWLock) <= LWLOCK_MINIMAL_SIZE &&
- sizeof(LWLock) <= LWLOCK_PADDED_SIZE,
+ StaticAssertStmt(sizeof(LWLock) <= LWLOCK_PADDED_SIZE,
"Miscalculated LWLock padding");
if (!IsUnderPostmaster)
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 724068cf87..1176ed60a5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -961,6 +961,7 @@ typedef enum
WAIT_EVENT_BGWORKER_SHUTDOWN,
WAIT_EVENT_BGWORKER_STARTUP,
WAIT_EVENT_BTREE_PAGE,
+ WAIT_EVENT_BUFFER_IO,
WAIT_EVENT_CHECKPOINT_DONE,
WAIT_EVENT_CHECKPOINT_START,
WAIT_EVENT_EXECUTE_GATHER,
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index f6b5782965..d8a5da5ad9 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -18,6 +18,7 @@
#include "port/atomics.h"
#include "storage/buf.h"
#include "storage/bufmgr.h"
+#include "storage/condition_variable.h"
#include "storage/latch.h"
#include "storage/lwlock.h"
#include "storage/shmem.h"
@@ -181,10 +182,10 @@ typedef struct BufferDesc
/* state of the tag, containing flags, refcount and usagecount */
pg_atomic_uint32 state;
+ ConditionVariable io_cv; /* fires when BM_IO_IN_PROGRESS is cleared */
int wait_backend_pid; /* backend PID of pin-count waiter */
int freeNext; /* link in freelist chain */
-
LWLock content_lock; /* to lock access to buffer contents */
} BufferDesc;
@@ -221,13 +222,9 @@ typedef union BufferDescPadded
#define BufferDescriptorGetBuffer(bdesc) ((bdesc)->buf_id + 1)
-#define BufferDescriptorGetIOLock(bdesc) \
- (&(BufferIOLWLockArray[(bdesc)->buf_id]).lock)
#define BufferDescriptorGetContentLock(bdesc) \
((LWLock*) (&(bdesc)->content_lock))
-extern PGDLLIMPORT LWLockMinimallyPadded *BufferIOLWLockArray;
-
/*
* The freeNext field is either the index of the next freelist entry,
* or one of these special values:
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index cbf2510fbf..a8f052e484 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -48,29 +48,8 @@ typedef struct LWLock
* even more padding so that each LWLock takes up an entire cache line; this is
* useful, for example, in the main LWLock array, where the overall number of
* locks is small but some are heavily contended.
- *
- * When allocating a tranche that contains data other than LWLocks, it is
- * probably best to include a bare LWLock and then pad the resulting structure
- * as necessary for performance. For an array that contains only LWLocks,
- * LWLockMinimallyPadded can be used for cases where we just want to ensure
- * that we don't cross cache line boundaries within a single lock, while
- * LWLockPadded can be used for cases where we want each lock to be an entire
- * cache line.
- *
- * An LWLockMinimallyPadded might contain more than the absolute minimum amount
- * of padding required to keep a lock from crossing a cache line boundary,
- * because an unpadded LWLock will normally fit into 16 bytes. We ignore that
- * possibility when determining the minimal amount of padding. Older releases
- * had larger LWLocks, so 32 really was the minimum, and packing them in
- * tighter might hurt performance.
- *
- * LWLOCK_MINIMAL_SIZE should be 32 on basically all common platforms, but
- * because pg_atomic_uint32 is more than 4 bytes on some obscure platforms, we
- * allow for the possibility that it might be 64. Even on those platforms,
- * we probably won't exceed 32 bytes unless LOCK_DEBUG is defined.
*/
#define LWLOCK_PADDED_SIZE PG_CACHE_LINE_SIZE
-#define LWLOCK_MINIMAL_SIZE (sizeof(LWLock) <= 32 ? 32 : 64)
/* LWLock, padded to a full cache line size */
typedef union LWLockPadded
@@ -79,13 +58,6 @@ typedef union LWLockPadded
char pad[LWLOCK_PADDED_SIZE];
} LWLockPadded;
-/* LWLock, minimally padded */
-typedef union LWLockMinimallyPadded
-{
- LWLock lock;
- char pad[LWLOCK_MINIMAL_SIZE];
-} LWLockMinimallyPadded;
-
extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
/* struct for storing named tranche information */
@@ -202,7 +174,6 @@ typedef enum BuiltinTrancheIds
LWTRANCHE_SERIAL_BUFFER,
LWTRANCHE_WAL_INSERT,
LWTRANCHE_BUFFER_CONTENT,
- LWTRANCHE_BUFFER_IO,
LWTRANCHE_REPLICATION_ORIGIN_STATE,
LWTRANCHE_REPLICATION_SLOT_IO,
LWTRANCHE_LOCK_FASTPATH,
--
2.30.0
On Mon, Mar 08, 2021 at 06:10:36PM +1300, Thomas Munro wrote:
On Fri, Mar 5, 2021 at 12:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Fri, Feb 26, 2021 at 7:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Back in 2016, Robert Haas proposed to replace I/O locks with condition
variables[1]. Condition variables went in and have found lots of
uses, but this patch to replace a bunch of LWLocks and some busy
looping did not. Since then, it has been tested quite a lot as part
of the AIO project[2], which currently depends on it. That's why I'm
interested in following up now. I asked Robert if he planned to
re-propose it and he said I should go for it, so... here I go.I removed a redundant (Size) cast, fixed the wait event name and
category (WAIT_EVENT_BUFFILE_XXX is for buffile.c stuff, not bufmgr.c
stuff, and this is really an IPC wait, not an IO wait despite the
name), updated documentation and pgindented.More review and some proposed changes:
The old I/O lock array was the only user of struct
LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda
strange to leave it in the tree with no user. Of course it's remotely
possible there are extensions using it (know of any?). In the
attached, I've ripped that + associated commentary out, because it's
fun to delete dead code. Objections?
None from me. I don't know of any extension relying on it, and neither does
codesearch.debian.net. I would be surprised to see any extension actually
relying on that anyway.
Since the whole reason for that out-of-line array in the first place
was to keep BufferDesc inside one cache line, and since it is in fact
possible to put a new condition variable into BufferDesc without
exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that
instead? I haven't yet considered other architectures or potential
member orders.
+1 for adding the cv into BufferDesc. That brings the struct size to exactly
64 bytes on x86 64 bits architecture. This won't add any extra overhead to
LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that even
was a concern.
I wonder if we should try to preserve user experience a little harder,
for the benefit of people who have monitoring queries that look for
this condition. Instead of inventing a new wait_event value, let's
just keep showing "BufferIO" in that column. In other words, the
change is that wait_event_type changes from "LWLock" to "IPC", which
is a pretty good summary of this patch. Done in the attached. Does
this make sense?
I think it does make sense, and it's good to preserve this value.
Looking at the patch itself, I don't have much to add it all looks sensible and
I agree with the arguments in the first mail. All regression tests pass and
documentation builds.
I'm marking this patch as RFC.
On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
The old I/O lock array was the only user of struct
LWLockMinimallyPadded, added in commit 6150a1b08a9, and it seems kinda
strange to leave it in the tree with no user. Of course it's remotely
possible there are extensions using it (know of any?). In the
attached, I've ripped that + associated commentary out, because it's
fun to delete dead code. Objections?None from me. I don't know of any extension relying on it, and neither does
codesearch.debian.net. I would be surprised to see any extension actually
relying on that anyway.
Thanks for checking!
Since the whole reason for that out-of-line array in the first place
was to keep BufferDesc inside one cache line, and since it is in fact
possible to put a new condition variable into BufferDesc without
exceeding 64 bytes on a 64 bit x86 box, perhaps we should just do that
instead? I haven't yet considered other architectures or potential
member orders.+1 for adding the cv into BufferDesc. That brings the struct size to exactly
64 bytes on x86 64 bits architecture. This won't add any extra overhead to
LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that even
was a concern.
I also checked that it's 64B on an Arm box. Not sure about POWER.
But... despite the fact that it looks like a good change in isolation,
I decided to go back to the separate array in this initial commit,
because the AIO branch also wants to add a new BufferDesc member[1]https://github.com/anarazel/postgres/blob/aio/src/include/storage/buf_internals.h#L190.
I may come back to that change, if we can make some more space (seems
entirely doable, but I'd like to look into that separately).
I wonder if we should try to preserve user experience a little harder,
for the benefit of people who have monitoring queries that look for
this condition. Instead of inventing a new wait_event value, let's
just keep showing "BufferIO" in that column. In other words, the
change is that wait_event_type changes from "LWLock" to "IPC", which
is a pretty good summary of this patch. Done in the attached. Does
this make sense?I think it does make sense, and it's good to preserve this value.
Looking at the patch itself, I don't have much to add it all looks sensible and
I agree with the arguments in the first mail. All regression tests pass and
documentation builds.
I found one more thing to tweak: a reference in the README.
I'm marking this patch as RFC.
Thanks for the review. And of course to Robert for writing the patch. Pushed.
[1]: https://github.com/anarazel/postgres/blob/aio/src/include/storage/buf_internals.h#L190
On Thu, Mar 11, 2021 at 10:48:40AM +1300, Thomas Munro wrote:
On Tue, Mar 9, 2021 at 6:24 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
+1 for adding the cv into BufferDesc. That brings the struct size to exactly
64 bytes on x86 64 bits architecture. This won't add any extra overhead to
LOCK_DEBUG cases, as it was already exceeding the 64B threshold, if that even
was a concern.I also checked that it's 64B on an Arm box. Not sure about POWER.
But... despite the fact that it looks like a good change in isolation,
I decided to go back to the separate array in this initial commit,
because the AIO branch also wants to add a new BufferDesc member[1].
Ok!
I may come back to that change, if we can make some more space (seems
entirely doable, but I'd like to look into that separately).
- /*
- * It would be nice to include the I/O locks in the BufferDesc, but that
- * would increase the size of a BufferDesc to more than one cache line,
- * and benchmarking has shown that keeping every BufferDesc aligned on a
- * cache line boundary is important for performance. So, instead, the
- * array of I/O locks is allocated in a separate tranche. Because those
- * locks are not highly contended, we lay out the array with minimal
- * padding.
- */
- size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded)));
+ /* size of I/O condition variables */
+ size = add_size(size, mul_size(NBuffers,
+ sizeof(ConditionVariableMinimallyPadded)));
Should we keep for now some similar comment mentionning why we don't put the cv
in the BufferDesc even though it would currently fit the 64B target size?
Thanks for the review. And of course to Robert for writing the patch. Pushed.
Great!
On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
- /* - * It would be nice to include the I/O locks in the BufferDesc, but that - * would increase the size of a BufferDesc to more than one cache line, - * and benchmarking has shown that keeping every BufferDesc aligned on a - * cache line boundary is important for performance. So, instead, the - * array of I/O locks is allocated in a separate tranche. Because those - * locks are not highly contended, we lay out the array with minimal - * padding. - */ - size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded))); + /* size of I/O condition variables */ + size = add_size(size, mul_size(NBuffers, + sizeof(ConditionVariableMinimallyPadded)));Should we keep for now some similar comment mentionning why we don't put the cv
in the BufferDesc even though it would currently fit the 64B target size?
I tried to write some words along those lines, but it seemed hard to
come up with a replacement message about a thing we're not doing
because of other currently proposed patches. The situation could
change, and it seemed to be a strange place to put this comment
anyway, far away from the relevant struct. Ok, let me try that
again... what do you think of this, as a new comment for BufferDesc,
next to the existing discussion of the 64 byte rule?
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -174,6 +174,10 @@ typedef struct buftag
* Be careful to avoid increasing the size of the struct when adding or
* reordering members. Keeping it below 64 bytes (the most common CPU
* cache line size) is fairly important for performance.
+ *
+ * Per-buffer I/O condition variables are kept outside this struct in a
+ * separate array. They could be moved in here and still fit under that
+ * limit on common systems, but for now that is not done.
*/
typedef struct BufferDesc
{
On Thu, Mar 11, 2021 at 03:54:06PM +1300, Thomas Munro wrote:
On Thu, Mar 11, 2021 at 3:27 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
- /* - * It would be nice to include the I/O locks in the BufferDesc, but that - * would increase the size of a BufferDesc to more than one cache line, - * and benchmarking has shown that keeping every BufferDesc aligned on a - * cache line boundary is important for performance. So, instead, the - * array of I/O locks is allocated in a separate tranche. Because those - * locks are not highly contended, we lay out the array with minimal - * padding. - */ - size = add_size(size, mul_size(NBuffers, sizeof(LWLockMinimallyPadded))); + /* size of I/O condition variables */ + size = add_size(size, mul_size(NBuffers, + sizeof(ConditionVariableMinimallyPadded)));Should we keep for now some similar comment mentionning why we don't put the cv
in the BufferDesc even though it would currently fit the 64B target size?I tried to write some words along those lines, but it seemed hard to
come up with a replacement message about a thing we're not doing
because of other currently proposed patches. The situation could
change, and it seemed to be a strange place to put this comment
anyway, far away from the relevant struct.
Yeah, I agree that it's not the best place to document the size consideration.
Ok, let me try that
again... what do you think of this, as a new comment for BufferDesc,
next to the existing discussion of the 64 byte rule?--- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -174,6 +174,10 @@ typedef struct buftag * Be careful to avoid increasing the size of the struct when adding or * reordering members. Keeping it below 64 bytes (the most common CPU * cache line size) is fairly important for performance. + * + * Per-buffer I/O condition variables are kept outside this struct in a + * separate array. They could be moved in here and still fit under that + * limit on common systems, but for now that is not done. */ typedef struct BufferDesc {
I was mostly thinking about something like "leave room for now as other feature
could make a better use of that space", but I'm definitely fine with this
comment!