[PATCH] Improve performance of NOTIFY over many databases (issue blocking on AccessExclusiveLock on object 0 of class 1262 of database 0)
Hoi hackers,
We've been having issues with NOTIFYs blocking over multiple databases
(see [1]/messages/by-id/CADWG95t0j9zF0uwdcMH81KMnDsiTAVHxmBvgYqrRJcD-iLwQhw@mail.gmail.com for more details). That was 9.4 but we've updated the
database to 11.3 and still have the same issue. Now however we could
use perf to do profiling and got the following profile (useless
details elided):
--32.83%--ProcessClientReadInterrupt
--32.68%--ProcessNotifyInterrupt
--32.16%--asyncQueueReadAllNotifications
--23.37%--asyncQueueAdvanceTail
--20.49%--LWLockAcquire
--18.93%--LWLockQueueSelf
--12.99%--LWLockWaitListLock
(from: perf record -F 99 -ag -- sleep 600)
That shows that more than 20% of the time is spent in that single
function, waiting for an exclusive lock on the AsyncQueueLock. This
will block any concurrent session doing a NOTIFY in any database on
the system. This would certainly explain the symptoms we're seeing
(process xxx still waiting for AccessExclusiveLock on object 0 of
class 1262 of database 0).
Analysis of the code leads me to the following hypothesis (and hence
to the attached patches):
We have ~150 databases, each of which has 2 active backends with an
active LISTEN. When a NOTIFY happens anywhere on any database it
(under an exclusive lock) makes a list of 300 backends to send a
signal to. It then wakes up all of those backends. Each backend then
examines the message and all but one discards it as being for the
wrong database. Each backend then calls asyncQueueAdvanceTail (because
the current position of the each backend was the tail) which then
takes an exclusive lock and checks all the other backends to see if
the tail can be advanced. All of these will conclude 'no', except the
very last one which concludes the tail can be advanced by about 50
bytes or so.
So the inner loop of asyncQueueAdvanceTail will, while holding a
global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million
times for basically no benefit. During this time, no other transaction
anywhere in the system that does a NOTIFY will be able to commit.
The attached patches attempt reduce the overhead in two ways:
Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the
QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is
that there's no point trying to advance the tail unless we can
actually usefully truncate the SLRU. This does however mean that
asyncQueueReadAllNotifications always has to call
asyncQueueAdvanceTail since it can no longer be guaranteed that any
backend is still at the tail, which is one of the assumptions of the
current code. Not sure if this is a problem or if it can be improved
without tracking much more state.
Patch 2: Changes SignalBackends to only notify other backends when (a)
they're the same database as me or (b) the notify queue has advanced
to a new SLRU page. This avoids backends being woken up for messages
which they are not interested in.
As a consequence of these changes, we can reduce the number of
exclusive locks and backend wake ups in our case by a factor of 300.
You still however get a thundering herd at the end of each SLRU page.
Note: these patches have not yet been extensively tested, and so
should only be used as basis for discussion.
Comments? Suggestions?
[1]: /messages/by-id/CADWG95t0j9zF0uwdcMH81KMnDsiTAVHxmBvgYqrRJcD-iLwQhw@mail.gmail.com
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
Attachments:
0001-Only-try-advancing-tail-pointer-when-it-s-useful.patchtext/x-patch; charset=US-ASCII; name=0001-Only-try-advancing-tail-pointer-when-it-s-useful.patchDownload
From 17c241a2e307e70465c235248c17ac70d34fd175 Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout <oosterhout@fox-it.com>
Date: Mon, 3 Jun 2019 17:13:31 +0200
Subject: [PATCH 1/2] Only try advancing tail pointer when it's useful.
Advancing the tail pointer requires an exclusive lock which can block
backends from other databases, so it's worth keeping these attempts to a
minimum.
---
src/backend/commands/async.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 738e6ec7e2..c1b0705234 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -89,11 +89,9 @@
* Inbound-notify processing consists of reading all of the notifications
* that have arrived since scanning last time. We read every notification
* until we reach either a notification from an uncommitted transaction or
- * the head pointer's position. Then we check if we were the laziest
- * backend: if our pointer is set to the same position as the global tail
- * pointer is set, then we move the global tail pointer ahead to where the
- * second-laziest backend is (in general, we take the MIN of the current
- * head position and all active backends' new tail pointers). Whenever we
+ * the head pointer's position. Then we check if we can advance the global
+ * tail pointer to a new page, if so we take the MIN of the current
+ * head position and all active backends' new tail pointers. Whenever we
* move the global tail pointer we also truncate now-unused pages (i.e.,
* delete files in pg_notify/ that are no longer used).
*
@@ -1749,7 +1747,6 @@ asyncQueueReadAllNotifications(void)
QueuePosition oldpos;
QueuePosition head;
Snapshot snapshot;
- bool advanceTail;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -1871,12 +1868,10 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(AsyncQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyBackendId) = pos;
- advanceTail = QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL);
LWLockRelease(AsyncQueueLock);
- /* If we were the laziest backend, try to advance the tail pointer */
- if (advanceTail)
- asyncQueueAdvanceTail();
+ /* Check if tail can be advanced */
+ asyncQueueAdvanceTail();
PG_RE_THROW();
}
@@ -1885,12 +1880,10 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(AsyncQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyBackendId) = pos;
- advanceTail = QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL);
LWLockRelease(AsyncQueueLock);
- /* If we were the laziest backend, try to advance the tail pointer */
- if (advanceTail)
- asyncQueueAdvanceTail();
+ /* Check if tail can be advanced */
+ asyncQueueAdvanceTail();
/* Done with snapshot */
UnregisterSnapshot(snapshot);
@@ -2010,6 +2003,14 @@ asyncQueueAdvanceTail(void)
int newtailpage;
int boundary;
+ /*
+ * Advancing the tail is expensive (it takes an exclusive lock which
+ * can block committing backends) so don't bother if we can't advance
+ * at least a page.
+ */
+ if (QUEUE_POS_PAGE(QUEUE_TAIL) != QUEUE_POS_PAGE(QUEUE_HEAD))
+ return;
+
LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
min = QUEUE_HEAD;
for (i = 1; i <= MaxBackends; i++)
--
2.11.0
0002-Don-t-notify-other-backends-about-notifications-with.patchtext/x-patch; charset=US-ASCII; name=0002-Don-t-notify-other-backends-about-notifications-with.patchDownload
From bfefa25b210993450d89381ad26b5b77a37091bc Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout <oosterhout@fox-it.com>
Date: Mon, 3 Jun 2019 17:10:38 +0200
Subject: [PATCH 2/2] Don't notify other backends about notifications without
cause.
It could be that there are other databases with active listens but unless
they need to know because it may be useful to advance the queue tail,
there's no point waking them up.
---
src/backend/commands/async.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index c1b0705234..bb4f58d0b9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1519,11 +1519,13 @@ SignalBackends(void)
int count;
int i;
int32 pid;
+ int notify_all = false;
/*
* Identify all backends that are listening and not already up-to-date. We
* don't want to send signals while holding the AsyncQueueLock, so we just
- * build a list of target PIDs.
+ * build a list of target PIDs. If we haven't moved to a new page there is
+ * no point notifying backends of other databases.
*
* XXX in principle these pallocs could fail, which would be bad. Maybe
* preallocate the arrays? But in practice this is only run in trivial
@@ -1532,6 +1534,7 @@ SignalBackends(void)
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
ids = (BackendId *) palloc(MaxBackends * sizeof(BackendId));
count = 0;
+ notify_all = QUEUE_POS_PAGE(QUEUE_HEAD) != QUEUE_POS_PAGE(QUEUE_TAIL);
LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
for (i = 1; i <= MaxBackends; i++)
@@ -1539,6 +1542,9 @@ SignalBackends(void)
pid = QUEUE_BACKEND_PID(i);
if (pid != InvalidPid && pid != MyProcPid)
{
+ if (!notify_all && QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
QueuePosition pos = QUEUE_BACKEND_POS(i);
if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
--
2.11.0
Hoi hackers,
Please find attached updated versions of the patches, I've now tested
them. Also attached is a reproduction script to verify that they
actually work.
To test you need to create 150 databases as described in the script,
then simply execute it. Before patching you get the following results
(last figure is the CPU usage of Postgres):
1559749330 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 269.07%
1559749335 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 268.07%
1559749340 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 270.94%
After patching you get the following:
1559749840 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.02, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 5.09%
1559749845 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 5.06%
1559749850 Sent: 500, Recv: 1000, Delays: Min: 0.01, Max: 0.01, Avg:
0.01 [0.01/0.01/0.01/0.01/0.01], 4.75%
The async queue functions in postgres also no longer appear in the
perf output (below measuring threshold).
As for general method, it seems like the actual optimisation here is
that the async queue tail pointer is only updated once per SLRU page
instead of every message. This would require a significantly larger
patch, but shouldn't be too difficult. Thoughts?
Have a nice day,
Martijn
On Tue, 4 Jun 2019 at 09:08, Martijn van Oosterhout <kleptog@gmail.com> wrote:
Hoi hackers,
We've been having issues with NOTIFYs blocking over multiple databases
(see [1] for more details). That was 9.4 but we've updated the
database to 11.3 and still have the same issue. Now however we could
use perf to do profiling and got the following profile (useless
details elided):--32.83%--ProcessClientReadInterrupt
--32.68%--ProcessNotifyInterrupt
--32.16%--asyncQueueReadAllNotifications
--23.37%--asyncQueueAdvanceTail
--20.49%--LWLockAcquire
--18.93%--LWLockQueueSelf
--12.99%--LWLockWaitListLock(from: perf record -F 99 -ag -- sleep 600)
That shows that more than 20% of the time is spent in that single
function, waiting for an exclusive lock on the AsyncQueueLock. This
will block any concurrent session doing a NOTIFY in any database on
the system. This would certainly explain the symptoms we're seeing
(process xxx still waiting for AccessExclusiveLock on object 0 of
class 1262 of database 0).Analysis of the code leads me to the following hypothesis (and hence
to the attached patches):We have ~150 databases, each of which has 2 active backends with an
active LISTEN. When a NOTIFY happens anywhere on any database it
(under an exclusive lock) makes a list of 300 backends to send a
signal to. It then wakes up all of those backends. Each backend then
examines the message and all but one discards it as being for the
wrong database. Each backend then calls asyncQueueAdvanceTail (because
the current position of the each backend was the tail) which then
takes an exclusive lock and checks all the other backends to see if
the tail can be advanced. All of these will conclude 'no', except the
very last one which concludes the tail can be advanced by about 50
bytes or so.So the inner loop of asyncQueueAdvanceTail will, while holding a
global exclusive lock, execute 2*150*4000 (max backends) = 1.2 million
times for basically no benefit. During this time, no other transaction
anywhere in the system that does a NOTIFY will be able to commit.The attached patches attempt reduce the overhead in two ways:
Patch 1: Changes asyncQueueAdvanceTail to do nothing unless the
QUEUE_HEAD is on a different page than the QUEUE_TAIL. The idea is
that there's no point trying to advance the tail unless we can
actually usefully truncate the SLRU. This does however mean that
asyncQueueReadAllNotifications always has to call
asyncQueueAdvanceTail since it can no longer be guaranteed that any
backend is still at the tail, which is one of the assumptions of the
current code. Not sure if this is a problem or if it can be improved
without tracking much more state.Patch 2: Changes SignalBackends to only notify other backends when (a)
they're the same database as me or (b) the notify queue has advanced
to a new SLRU page. This avoids backends being woken up for messages
which they are not interested in.As a consequence of these changes, we can reduce the number of
exclusive locks and backend wake ups in our case by a factor of 300.
You still however get a thundering herd at the end of each SLRU page.Note: these patches have not yet been extensively tested, and so
should only be used as basis for discussion.Comments? Suggestions?
[1] /messages/by-id/CADWG95t0j9zF0uwdcMH81KMnDsiTAVHxmBvgYqrRJcD-iLwQhw@mail.gmail.com
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
Attachments:
0001-Only-try-advancing-tail-pointer-when-it-s-useful.patchtext/x-patch; charset=US-ASCII; name=0001-Only-try-advancing-tail-pointer-when-it-s-useful.patchDownload
From c0db16ec3bb971fb41b9e7f568b5cb168aa462ba Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout <oosterhout@fox-it.com>
Date: Mon, 3 Jun 2019 17:13:31 +0200
Subject: [PATCH 1/3] Only try advancing tail pointer when it's useful.
Advancing the tail pointer requires an exclusive lock which can block
backends from other databases, so it's worth keeping these attempts to a
minimum.
---
src/backend/commands/async.c | 29 +++++++++++++++--------------
1 file changed, 15 insertions(+), 14 deletions(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 738e6ec7e2..53e28fe777 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -89,11 +89,9 @@
* Inbound-notify processing consists of reading all of the notifications
* that have arrived since scanning last time. We read every notification
* until we reach either a notification from an uncommitted transaction or
- * the head pointer's position. Then we check if we were the laziest
- * backend: if our pointer is set to the same position as the global tail
- * pointer is set, then we move the global tail pointer ahead to where the
- * second-laziest backend is (in general, we take the MIN of the current
- * head position and all active backends' new tail pointers). Whenever we
+ * the head pointer's position. Then we check if we can advance the global
+ * tail pointer to a new page, if so we take the MIN of the current
+ * head position and all active backends' new tail pointers. Whenever we
* move the global tail pointer we also truncate now-unused pages (i.e.,
* delete files in pg_notify/ that are no longer used).
*
@@ -1749,7 +1747,6 @@ asyncQueueReadAllNotifications(void)
QueuePosition oldpos;
QueuePosition head;
Snapshot snapshot;
- bool advanceTail;
/* page_buffer must be adequately aligned, so use a union */
union
@@ -1871,12 +1868,10 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(AsyncQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyBackendId) = pos;
- advanceTail = QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL);
LWLockRelease(AsyncQueueLock);
- /* If we were the laziest backend, try to advance the tail pointer */
- if (advanceTail)
- asyncQueueAdvanceTail();
+ /* Check if tail can be advanced */
+ asyncQueueAdvanceTail();
PG_RE_THROW();
}
@@ -1885,12 +1880,10 @@ asyncQueueReadAllNotifications(void)
/* Update shared state */
LWLockAcquire(AsyncQueueLock, LW_SHARED);
QUEUE_BACKEND_POS(MyBackendId) = pos;
- advanceTail = QUEUE_POS_EQUAL(oldpos, QUEUE_TAIL);
LWLockRelease(AsyncQueueLock);
- /* If we were the laziest backend, try to advance the tail pointer */
- if (advanceTail)
- asyncQueueAdvanceTail();
+ /* Check if tail can be advanced */
+ asyncQueueAdvanceTail();
/* Done with snapshot */
UnregisterSnapshot(snapshot);
@@ -2010,6 +2003,14 @@ asyncQueueAdvanceTail(void)
int newtailpage;
int boundary;
+ /*
+ * Advancing the tail is expensive (it takes an exclusive lock which
+ * can block committing backends) so don't bother if we can't advance
+ * at least a page.
+ */
+ if (QUEUE_POS_PAGE(QUEUE_TAIL) == QUEUE_POS_PAGE(QUEUE_HEAD))
+ return;
+
LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
min = QUEUE_HEAD;
for (i = 1; i <= MaxBackends; i++)
--
2.11.0
0003-Quickly-bail-if-queue-tail-can-t-be-moved.patchtext/x-patch; charset=US-ASCII; name=0003-Quickly-bail-if-queue-tail-can-t-be-moved.patchDownload
From 67cc65b0c98d8282602b414d0ffaff158bf887f7 Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout <oosterhout@fox-it.com>
Date: Tue, 4 Jun 2019 16:14:31 +0200
Subject: [PATCH 3/3] Quickly bail if queue tail can't be moved.
---
src/backend/commands/async.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 18bd3e975e..f0a5a2472f 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -2022,7 +2022,11 @@ asyncQueueAdvanceTail(void)
for (i = 1; i <= MaxBackends; i++)
{
if (QUEUE_BACKEND_PID(i) != InvalidPid)
+ {
min = QUEUE_POS_MIN(min, QUEUE_BACKEND_POS(i));
+ if (QUEUE_POS_EQUAL(min, QUEUE_TAIL))
+ break;
+ }
}
oldtailpage = QUEUE_POS_PAGE(QUEUE_TAIL);
QUEUE_TAIL = min;
--
2.11.0
0002-Don-t-notify-other-backends-about-notifications-with.patchtext/x-patch; charset=US-ASCII; name=0002-Don-t-notify-other-backends-about-notifications-with.patchDownload
From 385f3729beb9e1e4bcf77e1428925a76e3300fdd Mon Sep 17 00:00:00 2001
From: Martijn van Oosterhout <oosterhout@fox-it.com>
Date: Mon, 3 Jun 2019 17:10:38 +0200
Subject: [PATCH 2/3] Don't notify other backends about notifications without
cause.
It could be that there are other databases with active listens but unless
they need to know because it may be useful to advance the queue tail,
there's no point waking them up.
---
src/backend/commands/async.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 53e28fe777..18bd3e975e 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -1519,11 +1519,13 @@ SignalBackends(void)
int count;
int i;
int32 pid;
+ int notify_all = false;
/*
* Identify all backends that are listening and not already up-to-date. We
* don't want to send signals while holding the AsyncQueueLock, so we just
- * build a list of target PIDs.
+ * build a list of target PIDs. If we haven't moved to a new page there is
+ * no point notifying backends of other databases.
*
* XXX in principle these pallocs could fail, which would be bad. Maybe
* preallocate the arrays? But in practice this is only run in trivial
@@ -1532,6 +1534,7 @@ SignalBackends(void)
pids = (int32 *) palloc(MaxBackends * sizeof(int32));
ids = (BackendId *) palloc(MaxBackends * sizeof(BackendId));
count = 0;
+ notify_all = QUEUE_POS_PAGE(QUEUE_HEAD) != QUEUE_POS_PAGE(QUEUE_TAIL);
LWLockAcquire(AsyncQueueLock, LW_EXCLUSIVE);
for (i = 1; i <= MaxBackends; i++)
@@ -1541,6 +1544,9 @@ SignalBackends(void)
{
QueuePosition pos = QUEUE_BACKEND_POS(i);
+ if (!notify_all && QUEUE_BACKEND_DBOID(i) != MyDatabaseId)
+ continue;
+
if (!QUEUE_POS_EQUAL(pos, QUEUE_HEAD))
{
pids[count] = pid;
--
2.11.0
Martijn van Oosterhout <kleptog@gmail.com> writes:
Please find attached updated versions of the patches, I've now tested
them. Also attached is a reproduction script to verify that they
actually work.
I looked through these (a bit cursorily).
I'm generally on board with the idea of 0001, but not with the patch
details. As coded, asyncQueueAdvanceTail is supposing that it can
examine the shared QUEUE_HEAD and QUEUE_TAIL pointers without any
lock whatsoever. That's probably unsafe, and if it is safe for some
reason, you haven't made the argument why. Moreover, it seems
unnecessary to make any such assumption. Why not put back the
advanceTail tests you removed, but adjust them so that advanceTail
isn't set true unless QUEUE_HEAD and QUEUE_TAIL point to different
pages? (Note that in the existing coding, those tests are made
while holding an appropriate lock, so it's safe to look at those
pointers there.)
It might be a good idea to make a macro encapsulating this new,
more complicated rule for setting advanceTail, instead of relying
on keeping the various call sites in sync.
More attention to comments is also needed. For instance, you've
made a lie out of the documentation of the tail pointer:
QueuePosition tail; /* the global tail is equivalent to the pos of
* the "slowest" backend */
It needs to say something like "is <= the pos of the slowest backend",
instead. I think the explanation of why this algorithm is good could
use more effort, too.
Comments for 0002 are about the same: for no explained reason, and
certainly no savings, you've put the notify_all test in an unsafe
place rather than a safe one (viz, two lines down, *after* taking
the relevant lock). And 0002 needs more commentary about why
its optimization is safe and useful, too. In particular it's
not obvious why QUEUE_HEAD being on a different page from QUEUE_TAIL
has anything to do with whether we should wake up other backends.
I'm not very persuaded by 0003, mainly because it seems likely to
me that 0001 and 0002 will greatly reduce the possibility that
the early-exit can happen. So it seems like it's adding cycles
(in a spot where we hold exclusive lock) without a good chance of
saving any cycles.
Taking a step back in hopes of seeing the bigger picture ...
as you already noted, these changes don't really fix the "thundering
herd of wakeups" problem, they just arrange for it to happen
once per SLRU page rather than once per message. I wonder if we
could improve matters by stealing an idea from the sinval code:
when we're trying to cause advance of the global QUEUE_TAIL, waken
only the slowest backend, and have it waken the next-slowest after
it's done. In sinval there are some additional provisions to prevent
a nonresponsive backend from delaying matters for other backends,
but I think maybe we don't need that here. async.c doesn't have
anything equivalent to sinval reset, so there's no chance of
overruling a slow backend's failure to advance its pos pointer,
so there's not much reason not to just wait till it does do so.
A related idea is to awaken only one backend at a time when we
send a new message (i.e., advance QUEUE_HEAD) but I think that
would likely be bad. The hazard with the chained-wakeups method
is that a slow backend blocks wakeup of everything else. We don't
care about that hugely for QUEUE_TAIL advance, because we're just
hoping to free some SLRU space. But for QUEUE_HEAD advance it'd
mean failing to deliver notifies in a timely way, which we do care
about. (Also, if I remember correctly, the processing on that side
only requires shared lock so it's less of a problem if many backends
do it at once.)
regards, tom lane
Hoi Tom,
Thank you for the detailed response. Sorry the delay, I was on holiday.
You are absolutely correct when you point out that the queue pointers
were accessed without the lock and this is probably unsafe. For the
first two patches this is can be remedied, though I find it makes the
logic a bit harder to follow. The comments will need to be updated to
reflect the new logic. I hope to post something soon.
As for your point about the third patch, you are right that it's
probably not saving many cycles. However I do think it's worthwhile
actually optimising this loop, because the number of backends that are
listening is likely to be much smaller than the total number of
backends, so there's a lot of cycles being wasted here already. Fixing
the thundering herd issue (like in sinval as you point out) doesn't
actually reduce the amount of work being done, just spreads it out.
Since readers and writers block each other, blocking a writer means
blocking commits across the whole cluster.
There are a number of possible improvements here:
1. Do what sinval does and separate the reader and writer locks so
they can't block each other. This is the ultimate solution, but it's a
significant refactor and it's not clear that's actually worthwhile
here. This would almost be adopting the sinvaladt structure wholesale.
2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.
3. The other idea from sinval where you only wake up one worker at a
time is a good one as you point out. This seems quite doable, however,
it seems wasteful to try and wake everyone up the moment we switch to
a new page. The longer you delay the lower the chance you need to wake
anyone at all because they've because they'll have caught up by
themselves. A single SLRU page can hold hundreds, or even thousands of
messages.
Do 2 & 3 seem like a good direction to go? I can probably work something up.
Thanks in advance,
Martijn
Martijn van Oosterhout <kleptog@gmail.com> writes:
Please find attached updated versions of the patches, I've now tested
them. Also attached is a reproduction script to verify that they
actually work.I looked through these (a bit cursorily).
I'm generally on board with the idea of 0001, but not with the patch
details. As coded, asyncQueueAdvanceTail is supposing that it can
examine the shared QUEUE_HEAD and QUEUE_TAIL pointers without any
lock whatsoever. That's probably unsafe, and if it is safe for some
reason, you haven't made the argument why. Moreover, it seems
unnecessary to make any such assumption. Why not put back the
advanceTail tests you removed, but adjust them so that advanceTail
isn't set true unless QUEUE_HEAD and QUEUE_TAIL point to different
pages? (Note that in the existing coding, those tests are made
while holding an appropriate lock, so it's safe to look at those
pointers there.)It might be a good idea to make a macro encapsulating this new,
more complicated rule for setting advanceTail, instead of relying
on keeping the various call sites in sync.More attention to comments is also needed. For instance, you've
made a lie out of the documentation of the tail pointer:QueuePosition tail; /* the global tail is equivalent to the pos of
* the "slowest" backend */It needs to say something like "is <= the pos of the slowest backend",
instead. I think the explanation of why this algorithm is good could
use more effort, too.Comments for 0002 are about the same: for no explained reason, and
certainly no savings, you've put the notify_all test in an unsafe
place rather than a safe one (viz, two lines down, *after* taking
the relevant lock). And 0002 needs more commentary about why
its optimization is safe and useful, too. In particular it's
not obvious why QUEUE_HEAD being on a different page from QUEUE_TAIL
has anything to do with whether we should wake up other backends.I'm not very persuaded by 0003, mainly because it seems likely to
me that 0001 and 0002 will greatly reduce the possibility that
the early-exit can happen. So it seems like it's adding cycles
(in a spot where we hold exclusive lock) without a good chance of
saving any cycles.Taking a step back in hopes of seeing the bigger picture ...
as you already noted, these changes don't really fix the "thundering
herd of wakeups" problem, they just arrange for it to happen
once per SLRU page rather than once per message. I wonder if we
could improve matters by stealing an idea from the sinval code:
when we're trying to cause advance of the global QUEUE_TAIL, waken
only the slowest backend, and have it waken the next-slowest after
it's done. In sinval there are some additional provisions to prevent
a nonresponsive backend from delaying matters for other backends,
but I think maybe we don't need that here. async.c doesn't have
anything equivalent to sinval reset, so there's no chance of
overruling a slow backend's failure to advance its pos pointer,
so there's not much reason not to just wait till it does do so.A related idea is to awaken only one backend at a time when we
send a new message (i.e., advance QUEUE_HEAD) but I think that
would likely be bad. The hazard with the chained-wakeups method
is that a slow backend blocks wakeup of everything else. We don't
care about that hugely for QUEUE_TAIL advance, because we're just
hoping to free some SLRU space. But for QUEUE_HEAD advance it'd
mean failing to deliver notifies in a timely way, which we do care
about. (Also, if I remember correctly, the processing on that side
only requires shared lock so it's less of a problem if many backends
do it at once.)regards, tom lane
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
Martijn van Oosterhout <kleptog@gmail.com> writes:
There are a number of possible improvements here:
1. Do what sinval does and separate the reader and writer locks so
they can't block each other. This is the ultimate solution, but it's a
significant refactor and it's not clear that's actually worthwhile
here. This would almost be adopting the sinvaladt structure wholesale.
I agree that that's probably more ambitious than is warranted.
2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.
I don't understand how that would work? The sending backend doesn't
know what the "next listening backend" is. Having to scan the whole
queue when a listener unlistens seems pretty awful too, especially
if you need exclusive lock while doing so.
3. The other idea from sinval where you only wake up one worker at a
time is a good one as you point out. This seems quite doable, however,
it seems wasteful to try and wake everyone up the moment we switch to
a new page. The longer you delay the lower the chance you need to wake
anyone at all because they've because they'll have caught up by
themselves. A single SLRU page can hold hundreds, or even thousands of
messages.
Not entirely following your comment here either. The point of the change
is exactly that we'd wake up only one backend at a time (and only the
furthest-behind one, so that anyone who catches up of their own accord
stops being a factor). Also, "hundreds or thousands" seems
over-optimistic given that the minimum size of AsyncQueueEntry is 20
bytes --- in practice it'll be more because people don't use empty
strings as notify channel names. I think a few hundred messages per
page is the upper limit, and it could be a lot less.
Do 2 & 3 seem like a good direction to go? I can probably work something up.
I'm on board with 3, obviously. Not following what you have in mind
for 2.
regards, tom lane
On Tue, 23 Jul 2019 at 19:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Martijn van Oosterhout <kleptog@gmail.com> writes:
2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.I don't understand how that would work? The sending backend doesn't
know what the "next listening backend" is. Having to scan the whole
queue when a listener unlistens seems pretty awful too, especially
if you need exclusive lock while doing so.
I mean tracking the listening backends specifically, so you can
replace the loops:
for (i=0; i < MaxBackends; i++)
with
for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = QUEUE_NEXT_LISTENING_BACKEND(i))
Such loops occur often when trying to advance the tail, when adding a
new listener,
when sending a notify, etc, all while holding a (exclusive) lock.
Seems like such an easy win
to only loop over the listening backends rather than all of them.
Do 2 & 3 seem like a good direction to go? I can probably work something up.
I'm on board with 3, obviously. Not following what you have in mind
for 2.
Hope this clears it up a bit. Only waking up one at a time is a good
idea, but needs to some
careful thinking to prove it actually works.
Have a nice day,
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
Martijn van Oosterhout <kleptog@gmail.com> writes:
On Tue, 23 Jul 2019 at 19:21, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Martijn van Oosterhout <kleptog@gmail.com> writes:
2. Add a field to AsyncQueueEntry which points to the next listening
backend. This would allow the loops over all listening backends to
complete much faster, especially in the normal case where there are
not many listeners relative to the number of backends. The downside is
this requires an exclusive lock to remove listeners, but that doesn't
seem a big problem.
I don't understand how that would work? The sending backend doesn't
know what the "next listening backend" is. Having to scan the whole
queue when a listener unlistens seems pretty awful too, especially
if you need exclusive lock while doing so.
I mean tracking the listening backends specifically, so you can
replace the loops:
for (i=0; i < MaxBackends; i++)
with
for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = QUEUE_NEXT_LISTENING_BACKEND(i))
Ah ... but surely you would not put such info in AsyncQueueEntry,
where there'd be a separate copy for each message. I think you
meant to add the info to AsyncQueueControl.
It might be better to redefine the backends[] array as being mostly
contiguous (ie, a new backend takes the first free slot not the one
indexed by its own BackendId), at the price of needing to store
BackendId in each slot explicitly instead of assuming it's equal to
the array index. I suspect the existing data structure is putting too
much of a premium on making sizeof(QueueBackendStatus) a power of 2.
regards, tom lane
On Tue, 23 Jul 2019 at 23:32, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Martijn van Oosterhout <kleptog@gmail.com> writes:
I mean tracking the listening backends specifically, so you can
replace the loops:
for (i=0; i < MaxBackends; i++)
with
for (i=QUEUE_FIRST_LISTENING_BACKEND; i; i = QUEUE_NEXT_LISTENING_BACKEND(i))Ah ... but surely you would not put such info in AsyncQueueEntry,
where there'd be a separate copy for each message. I think you
meant to add the info to AsyncQueueControl.
Umm, yeah. Got that mixed up.
It might be better to redefine the backends[] array as being mostly
contiguous (ie, a new backend takes the first free slot not the one
indexed by its own BackendId), at the price of needing to store
BackendId in each slot explicitly instead of assuming it's equal to
the array index. I suspect the existing data structure is putting too
much of a premium on making sizeof(QueueBackendStatus) a power of 2.
This would require adding a "MyListenerId" to each backend which I'm not sure
helps the readability. And there's a chance of mixing the id up. The
power-of-2-ness
is I think indeed overrated.
I'll give it a shot a see how it looks.
Have a nice day,
--
Martijn van Oosterhout <kleptog@gmail.com> http://svana.org/kleptog/
On Wed, Jul 24, 2019 at 8:30 PM Martijn van Oosterhout
<kleptog@gmail.com> wrote:
I'll give it a shot a see how it looks.
Moved to September CF, "Waiting on Author".
--
Thomas Munro
https://enterprisedb.com