subscription worker signalling wal writer too much
If I publish a pgbench workload and subscribe to it, the subscription
worker is signalling the wal writer thousands of times a second, once for
every async commit. This has a noticeable performance cost.
I don't think it is ever necessary to signal the wal writer here, unless
wal writer is taking the long sleep for power saving purposes. If up
to wal_writer_delay of "committed" transactions get lost on a crash, it
doesn't really matter because they will be replayed again once replication
resumes. However, might delaying the advance of the hot_standby_feedback by
up to the amount of wal_writer_delay be a problem? I would think any
set-up which depends on the standby never being a few dozen milliseconds
behind is already doomed.
If I want to suppress signalling, what would be the right way to
communicate to XLogSetAsyncXactLSN that it is being called in a
subscription worker?
Another approach would be to make XLogSetAsyncXactLSN signalling less
aggressive for everyone, not just subscription workers. There is no point
in signalling it more than once for a given WriteRqstPtr page boundary. So
each backend could locally remember the last boundary for which it
signalled wal writer, and not signal again for the same boundary. This
would be especially effective for a subscription worker, as it should be
pretty common for almost all the async commits to be coming from the same
process. Or, the last boundary could be kept in shared memory (XLogCtl) so
all backends can share it, at the expense of additional work needed to be
done under a spin lock.
Other ideas?
Thanks,
Jeff
On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
If I publish a pgbench workload and subscribe to it, the subscription
worker is signalling the wal writer thousands of times a second, once for
every async commit. This has a noticeable performance cost.
I've used a local variable to avoid waking up the wal writer more than once
for the same page boundary. This reduces the number of wake-ups by about
7/8.
I'm testing it by doing 1e6 transactions over 8 clients while replication
is in effect, then waiting for the logical replica to catch up. This cycle
takes 183.1 seconds in HEAD, and 162.4 seconds with the attached patch.
N=14, p-value for difference of the means 6e-17.
If I suppress all wake-ups just to see what would happen, it further
reduces the runtime to 153.7.
Cheers,
Jeff
Attachments:
wake_wal_writer_less_aggressively.patchapplication/octet-stream; name=wake_wal_writer_less_aggressively.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index 70d2570..f5c6f9a
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** XLogSetAsyncXactLSN(XLogRecPtr asyncXact
*** 2609,2614 ****
--- 2609,2615 ----
{
XLogRecPtr WriteRqstPtr = asyncXactLSN;
bool sleeping;
+ static XLogRecPtr last_woke = 0;
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
*************** XLogSetAsyncXactLSN(XLogRecPtr asyncXact
*** 2630,2635 ****
--- 2631,2642 ----
/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
return;
+
+ /* If we have already woken up the WALWriter for this boundary, we're done */
+ if (WriteRqstPtr <= last_woke)
+ return;
+
+ last_woke=WriteRqstPtr;
}
/*
On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com> wrote:
If I publish a pgbench workload and subscribe to it, the subscription
worker is signalling the wal writer thousands of times a second, once for
every async commit. This has a noticeable performance cost.I've used a local variable to avoid waking up the wal writer more than once
for the same page boundary. This reduces the number of wake-ups by about
7/8.
Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page... As this
isn't specific to logical-rep, I don't think that's ok.
Have you chased down why there's that many wakeups? Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com>
wrote:
If I publish a pgbench workload and subscribe to it, the subscription
worker is signalling the wal writer thousands of times a second, oncefor
every async commit. This has a noticeable performance cost.
I've used a local variable to avoid waking up the wal writer more than
once
for the same page boundary. This reduces the number of wake-ups by about
7/8.Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page... As this
isn't specific to logical-rep, I don't think that's ok.
The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
this changes that. (Also, it isn't really a guarantee, the fsync can take
many seconds to complete once we do initiate it, and there is absolutely
nothing we can do about that, other than do the fsync synchronously in the
first place).
The reason for kicking the wal writer at page boundaries is so that hint
bits can get set earlier than they otherwise could. But I don't think
kicking it multiple times per page boundary can help in that effort.
Have you chased down why there's that many wakeups? Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?
I'll have to dig into that some more. The 7/8 reduction I cited was just
in calls to SetLatch from that part of the code, I didn't measure whether
the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
determined that reduction, so it wasn't truly wake ups I measured. Actual
wake ups were measured only indirectly via the impact on performance. I'll
need to figure out how to instrument that without distorting the
performance too much in the process..
Cheers,
Jeff
On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com>
wrote:
If I publish a pgbench workload and subscribe to it, the subscription
worker is signalling the wal writer thousands of times a second, oncefor
every async commit. This has a noticeable performance cost.
I've used a local variable to avoid waking up the wal writer more than
once
for the same page boundary. This reduces the number of wake-ups by about
7/8.Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page... As this
isn't specific to logical-rep, I don't think that's ok.The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't think
this changes that. (Also, it isn't really a guarantee, the fsync can take
many seconds to complete once we do initiate it, and there is absolutely
nothing we can do about that, other than do the fsync synchronously in the
first place).
Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?
Have you chased down why there's that many wakeups? Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?
I'll have to dig into that some more. The 7/8 reduction I cited was just
in calls to SetLatch from that part of the code, I didn't measure whether
the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
determined that reduction, so it wasn't truly wake ups I measured. Actual
wake ups were measured only indirectly via the impact on performance. I'll
need to figure out how to instrument that without distorting the
performance too much in the process..
I'd suspect that just measuring the number of kill() calls should be
doable, if measured via perf or something like hta.t
Greetings,
Andres Freund
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Jun 14, 2017 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 16:24:27 -0700, Jeff Janes wrote:
On Wed, Jun 14, 2017 at 3:20 PM, Andres Freund <andres@anarazel.de>
wrote:
On 2017-06-14 15:08:49 -0700, Jeff Janes wrote:
On Wed, Jun 14, 2017 at 11:55 AM, Jeff Janes <jeff.janes@gmail.com>
wrote:
If I publish a pgbench workload and subscribe to it, the
subscription
worker is signalling the wal writer thousands of times a second,
once
for
every async commit. This has a noticeable performance cost.
I've used a local variable to avoid waking up the wal writer more
than
once
for the same page boundary. This reduces the number of wake-ups by
about
7/8.
Maybe I'm missing something here, but isn't that going to reduce our
guarantees about when asynchronously committed xacts are flushed out?
You can easily fit a number of commits into the same page... As this
isn't specific to logical-rep, I don't think that's ok.The guarantee is based on wal_writer_delay not on SIGUSR1, so I don't
think
this changes that. (Also, it isn't really a guarantee, the fsync can take
many seconds to complete once we do initiate it, and there is absolutely
nothing we can do about that, other than do the fsync synchronously inthe
first place).
Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?
If it is taking the big sleep, then we always wake it up, since
acd4c7d58baf09f.
I didn't change that part. I only changed what we do if it not hibernating.
Have you chased down why there's that many wakeups? Normally I'd have
expected that a number of the SetLatch() calls get consolidated
together, but I guess walwriter is "too quick" in waking up and
resetting the latch?I'll have to dig into that some more. The 7/8 reduction I cited was just
in calls to SetLatch from that part of the code, I didn't measure whether
the SetLatch actually called kill(owner_pid, SIGUSR1) or not when I
determined that reduction, so it wasn't truly wake ups I measured.Actual
wake ups were measured only indirectly via the impact on performance.
I'll
need to figure out how to instrument that without distorting the
performance too much in the process..I'd suspect that just measuring the number of kill() calls should be
doable, if measured via perf or something like hta.t
It looks like only limited consolidation was happening, the number of kills
invoked was more than half of the number of SetLatch. I think the reason
is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
suspends the calling process and gives the signalled process a chance to
run in that time slice. The Wal Writer gets woken up, sees that it only
has a partial page to write and decides not to do anything, but resets the
latch. It does this fast enough that the subscription worker hasn't
migrated CPUs yet.
But, if the WAL Writer sees it has already flushed up to the highest
eligible page boundary, why doesn't the SetAsync code see the same thing?
Because it isn't flushed that high, but only written that high. It looks
like a story of three git commits:
commit 4de82f7d7c50a81ec8e70e2cb0ab413ab9134c0b
Author: Simon Riggs <simon@2ndQuadrant.com>
Date: Sun Nov 13 09:00:57 2011 +0000
Wakeup WALWriter as needed for asynchronous commit performance.
commit db76b1efbbab2441428a9ef21f7ac9ba43c52482
Author: Andres Freund <andres@anarazel.de>
Date: Mon Feb 15 22:15:35 2016 +0100
Allow SetHintBits() to succeed if the buffer's LSN is new enough.
commit 7975c5e0a992ae9a45e03d145e0d37e2b5a707f5
Author: Andres Freund <andres@anarazel.de>
Date: Mon Feb 15 23:52:38 2016 +0100
Allow the WAL writer to flush WAL at a reduced rate.
The first change made the WALWriter wake up and do a write and flush
whenever an async commit occurred and there was a completed WAL page to be
written. This way the hint bits could be set while the heap page was still
hot, because they couldn't get set until the WAL covering the hinted-at
transaction commit is flushed.
The second change said we can set hint bits before the WAL covering the
hinted-at transaction is flushed, as long the page LSN is newer than that
(so the wal covering the hinted-at transaction commit must be flushed
before the page containing the hint bit can be written).
Then the third change makes the wal writer only write the full pages most
of the times it is woken up, not flush them. It only flushes them once
every wal_writer_delay or wal_writer_flush_after, regardless of how often
it is woken up.
It seems like the third change rendered the first one useless. Wouldn't it
better, and much simpler, just to have reverted the first change once the
second one was done? If that were done, would the third change still be
needed? (It did seem to add some other features as well, so I'm going to
assume we still want those...).
But now the first change is even worse than useless, it is positively
harmful. The only thing to stop it from waking the WALWriter for every
async commit is this line:
/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
return;
Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
becomes false due to us waking walwriter, it only becomes false once the
timeout expires (which it would have done anyway with no help from us), or
once wal_writer_flush_after is exceeded. So now every async commit is
waking the walwriter. Most of those wake up do nothing (there is not a
completely new patch to write), some write one completed page but don't
flush anything, and very few do a flush, and that one would have been done
anyway.
So for the sake of wal_writer_flush_after, I guess the simplest thing to do
would be to change repeat the flushbytes calculation and wake the walwriter
only if it looks like that will be exceeded? Attached patch does that.
Maybe the computation of flushbytes should be put into a macro, rather than
repeated in the code? Or even done on the fly, I don't see any reason it
needs to be stored in a variable before testing.
This new patch is simpler than the previous one, and more effective at
speeding up replication. I assume it would speed up pgbench with
synchronous_commit turned off (or against unlogged tables) as well, but I
don't think I have the hardware needed to test that.
Cheers,
Jeff
Attachments:
change_walwriter_wakeup.patchapplication/octet-stream; name=change_walwriter_wakeup.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index 70d2570..33d0123
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** XLogSetAsyncXactLSN(XLogRecPtr asyncXact
*** 2624,2634 ****
*/
if (!sleeping)
{
/* back off to last completed page boundary */
WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
! /* if we have already flushed that far, we're done */
! if (WriteRqstPtr <= LogwrtResult.Flush)
return;
}
--- 2624,2638 ----
*/
if (!sleeping)
{
+ int flushbytes;
/* back off to last completed page boundary */
WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
! flushbytes =
! WriteRqstPtr / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
!
! /* if we don't have enough to flush, we're done */
! if (flushbytes < WalWriterFlushAfter)
return;
}
On Thu, Jun 15, 2017 at 3:06 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
This new patch is simpler than the previous one, and more effective at
speeding up replication. I assume it would speed up pgbench with
synchronous_commit turned off (or against unlogged tables) as well, but I
don't think I have the hardware needed to test that.
If I use the 'tpcb-func' script embodied in the attached patch to pgbench,
then I can see the performance difference against unlogged tables using 8
clients on a 8 CPU virtual machine. The normal tpcb-like script has too
much communication overhead, bouncing from pgbench to the postgres backend
7 times per transaction, to see the difference. I also had to
make autovacuum_vacuum_cost_delay=0, otherwise auto analyze holds a
snapshot long enough to bloat the HOT chains which injects a great deal of
variability into the timings.
Commit 7975c5e0a992ae9 in the 9.6 branch causes a regression of about 10%,
and the my patch from the previous email redeems that regression. It also
gives the same improvement against 10dev HEAD.
Cheers,
Jeff
Attachments:
pgbench_function.patchapplication/octet-stream; name=pgbench_function.patchDownload
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
new file mode 100644
index ae36247..d879f11
*** a/src/bin/pgbench/pgbench.c
--- b/src/bin/pgbench/pgbench.c
*************** static const BuiltinScript builtin_scrip
*** 425,430 ****
--- 425,439 ----
"END;\n"
},
{
+ "tpcb-func",
+ "<builtin: TPC-B (sort of) in PL/pgSQL>",
+ "\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
+ "\\set bid random(1, " CppAsString2(nbranches) " * :scale)\n"
+ "\\set tid random(1, " CppAsString2(ntellers) " * :scale)\n"
+ "\\set delta random(-5000, 5000)\n"
+ "select * from pgbench_transaction(:aid, :bid, :tid, :delta);\n"
+ },
+ {
"simple-update",
"<builtin: simple update>",
"\\set aid random(1, " CppAsString2(naccounts) " * :scale)\n"
*************** init(bool is_no_vacuum)
*** 2688,2693 ****
--- 2697,2716 ----
executeStatement(con, buffer);
}
+ executeStatement(con,
+ "create or replace function pgbench_transaction(arg_aid int, arg_bid int, arg_tid int, arg_delta int) returns int as $$"
+ "DECLARE\n"
+ "abal int;\n"
+ "BEGIN\n"
+ "UPDATE pgbench_accounts SET abalance = abalance + arg_delta WHERE aid = arg_aid;\n"
+ "SELECT abalance into abal FROM pgbench_accounts WHERE aid = arg_aid;\n"
+ "UPDATE pgbench_tellers SET tbalance = tbalance + arg_delta WHERE tid = arg_tid;\n"
+ "UPDATE pgbench_branches SET bbalance = bbalance + arg_delta WHERE bid = arg_bid;\n"
+ "INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (arg_tid, arg_bid, arg_aid, arg_delta, CURRENT_TIMESTAMP);\n"
+ "RETURN abal;\n"
+ "END;\n"
+ "$$ language plpgsql");
+
executeStatement(con, "begin");
for (i = 0; i < nbranches * scale; i++)
Hi,
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
this afaics would allow wal writer to go into sleep mode with half a
page filled, and it'd not be woken up again until the page is filled.
No?If it is taking the big sleep, then we always wake it up, since
acd4c7d58baf09f.I didn't change that part. I only changed what we do if it not hibernating.
Right, I hadn't read enough of the context.
It looks like only limited consolidation was happening, the number of kills
invoked was more than half of the number of SetLatch. I think the reason
is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
suspends the calling process and gives the signalled process a chance to
run in that time slice. The Wal Writer gets woken up, sees that it only
has a partial page to write and decides not to do anything, but resets the
latch. It does this fast enough that the subscription worker hasn't
migrated CPUs yet.
I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless. In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches. I think we can avoid doing so quite
easily, as e.g. with the attached patch. Could you check how much that
helps your benchmark?
The first change made the WALWriter wake up and do a write and flush
whenever an async commit occurred and there was a completed WAL page to be
written. This way the hint bits could be set while the heap page was still
hot, because they couldn't get set until the WAL covering the hinted-at
transaction commit is flushed.The second change said we can set hint bits before the WAL covering the
hinted-at transaction is flushed, as long the page LSN is newer than that
(so the wal covering the hinted-at transaction commit must be flushed
before the page containing the hint bit can be written).Then the third change makes the wal writer only write the full pages most
of the times it is woken up, not flush them. It only flushes them once
every wal_writer_delay or wal_writer_flush_after, regardless of how often
it is woken up.It seems like the third change rendered the first one useless.
I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?
I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing. Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work. I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
if (ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);
Wouldn't it
better, and much simpler, just to have reverted the first change once the
second one was done?
I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.
If that were done, would the third change still be
needed? (It did seem to add some other features as well, so I'm going to
assume we still want those...).
It's still useful, yes. It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.
But now the first change is even worse than useless, it is positively
harmful. The only thing to stop it from waking the WALWriter for every
async commit is this line:/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
return;Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
becomes false due to us waking walwriter, it only becomes false once the
timeout expires (which it would have done anyway with no help from us), or
once wal_writer_flush_after is exceeded. So now every async commit is
waking the walwriter. Most of those wake up do nothing (there is not a
completely new patch to write), some write one completed page but don't
flush anything, and very few do a flush, and that one would have been done
anyway.
s/completely new patch/completely new page/?
In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic. Have to think about
this some more...
Greetings,
Andres Freund
Attachments:
0001-Don-t-signal-latch-owners-if-owner-not-waiting-on-la.patchtext/x-patch; charset=us-asciiDownload
From 50a3e62b16b752837f5a388e3259fee2bca6c2ab Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Sat, 24 Jun 2017 16:46:11 -0700
Subject: [PATCH 1/3] Don't signal latch owners if owner not waiting on latch.
Doing so can generate a lot of useless syscalls and context switches
in cases where the owner of the latch is busy doing actual work.
Discussion: https://postgr.es/m/CAMkU=1z6_K4mE12vNirgV5qnu58+cZQdNxD+Pb5bzxnvOVVWoA@mail.gmail.com
---
src/backend/storage/ipc/latch.c | 62 ++++++++++++++++++++++++++++++++++-------
src/include/storage/latch.h | 1 +
2 files changed, 53 insertions(+), 10 deletions(-)
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364de8..c1a1a76d0a 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -221,6 +221,7 @@ InitLatch(volatile Latch *latch)
{
latch->is_set = false;
latch->owner_pid = MyProcPid;
+ latch->owner_waiting = false;
latch->is_shared = false;
#ifndef WIN32
@@ -268,6 +269,7 @@ InitSharedLatch(volatile Latch *latch)
latch->is_set = false;
latch->owner_pid = 0;
+ latch->owner_waiting = false;
latch->is_shared = true;
}
@@ -311,6 +313,7 @@ DisownLatch(volatile Latch *latch)
Assert(latch->owner_pid == MyProcPid);
latch->owner_pid = 0;
+ latch->owner_waiting = false; /* paranoia */
}
/*
@@ -466,7 +469,17 @@ SetLatch(volatile Latch *latch)
sendSelfPipeByte();
}
else
- kill(owner_pid, SIGUSR1);
+ {
+ /*
+ * If the owner of the latch is currently waiting on it, send signal
+ * to wake up that process. The read-barrier is necessary so we see
+ * an up-to-date value, and it pairs with the memory barrier in the
+ * path setting owner_waiting in WaitEventSetWait.
+ */
+ pg_read_barrier();
+ if (latch->owner_waiting)
+ kill(owner_pid, SIGUSR1);
+ }
#else
/*
@@ -954,6 +967,12 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
* immediately, avoid blocking again. We don't attempt to report any
* other events that might also be satisfied.
*
+ * To avoid other processes signalling us if we're not waiting,
+ * wasting context switches, we first check whether the latch is
+ * already set, and only enable signalling from other processes after
+ * that. To avoid a race, we've to recheck whether the latch is set
+ * after asking to be woken up.
+ *
* If someone sets the latch between this and the
* WaitEventSetWaitBlock() below, the setter will write a byte to the
* pipe (or signal us and the signal handler will do that), and the
@@ -976,17 +995,37 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
* ordering, so that we cannot miss seeing is_set if a notification
* has already been queued.
*/
- if (set->latch && set->latch->is_set)
+ if (set->latch)
{
- occurred_events->fd = PGINVALID_SOCKET;
- occurred_events->pos = set->latch_pos;
- occurred_events->user_data =
- set->events[set->latch_pos].user_data;
- occurred_events->events = WL_LATCH_SET;
- occurred_events++;
- returned_events++;
+ bool latch_set = false;
+
+ if (set->latch->is_set)
+ {
+ latch_set = true;
+ }
+ else
+ {
+ set->latch->owner_waiting = true;
+ pg_memory_barrier();
+ if (set->latch->is_set)
+ {
+ latch_set = true;
+ }
+ }
+
+ if (latch_set)
+ {
+ occurred_events->fd = PGINVALID_SOCKET;
+ occurred_events->pos = set->latch_pos;
+ occurred_events->user_data =
+ set->events[set->latch_pos].user_data;
+ occurred_events->events = WL_LATCH_SET;
+ occurred_events++;
+ returned_events++;
+
+ break;
+ }
- break;
}
/*
@@ -1016,6 +1055,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
waiting = false;
#endif
+ if (set->latch)
+ set->latch->owner_waiting = false;
+
pgstat_report_wait_end();
return returned_events;
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfafec5..442344d914 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -112,6 +112,7 @@ typedef struct Latch
sig_atomic_t is_set;
bool is_shared;
int owner_pid;
+ int owner_waiting; /* int, so all platforms can set atomically */
#ifdef WIN32
HANDLE event;
#endif
--
2.13.1.392.g8d1b10321b.dirty
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
It looks like only limited consolidation was happening, the number of
kills
invoked was more than half of the number of SetLatch. I think the reason
is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
suspends the calling process and gives the signalled process a chance to
run in that time slice. The Wal Writer gets woken up, sees that it only
has a partial page to write and decides not to do anything, but resetsthe
latch. It does this fast enough that the subscription worker hasn't
migrated CPUs yet.I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless. In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches. I think we can avoid doing so quite
easily, as e.g. with the attached patch. Could you check how much that
helps your benchmark?
That patch doesn't make any improvement to the pgbench --unlogged-tables
benchmark. I expected that, because this benchmark exposes the opposite
problem. The wal writer isn't busy, it is constantly being woken up but
then immediately going back to sleep.
The first change made the WALWriter wake up and do a write and flush
whenever an async commit occurred and there was a completed WAL page tobe
written. This way the hint bits could be set while the heap page was
still
hot, because they couldn't get set until the WAL covering the hinted-at
transaction commit is flushed.The second change said we can set hint bits before the WAL covering the
hinted-at transaction is flushed, as long the page LSN is newer than that
(so the wal covering the hinted-at transaction commit must be flushed
before the page containing the hint bit can be written).Then the third change makes the wal writer only write the full pages most
of the times it is woken up, not flush them. It only flushes them once
every wal_writer_delay or wal_writer_flush_after, regardless of how often
it is woken up.It seems like the third change rendered the first one useless.
I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?
If we wanted to do that, I think the asynchronous committer should just
issue the write directly (unless wal_sync_method is open_datasync
or open_sync). Writes are usually extremely fast and it not worth trying
to offload them to a background process, as your half of the signalling
takes longer than just doing the write yourself. When that is not true and
writes start blocking due to extreme io congestion, them everything is
going to start blocking anyway (wal_buffers head will run into the tail,
and the tail can't advance because the writes are blocking) so while
offloading writes to a background process is then good in theory, it isn't
very effective. If everyone is stuck, it doesn't matter which process is
directly stuck and which ones are indirectly stuck.
I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing. Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work. I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
if (ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);
My understanding is that you can't start on a new file until the old file
is completely synced, because the book keeping can currently only handle
one file at a time. So if you signal the wal writer to do the sync, you
would just end up immediately blocking on it to finish. Getting around
that would require substantially more changes; we would need to implement a
queue of full log files not yet synced so that we could move on without
waiting. If we had such a queue, then sure, this would be a good idea to
just wake the wal writer. But currently, it only matters for large
transactions (\copy, insert ... select, bulk updates). With small
transactions, you can't fill up log files quickly enough for it to make
much of a difference. So I think this is something to do, but it is
independent from what I am currently going on about.
Wouldn't it
better, and much simpler, just to have reverted the first change once the
second one was done?I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.
That is true, but I don't see any residual regression when going from
pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied. So I
think the hint bit issue only shows up when hitting the same page
aggressively, in which case the LSN does get advanced quickly enough that
db76b1efbba solves it. Perhaps a different test case could show that
issue. I also didn't see any improvement from the original 4de82f7d7c50a81e,
so maybe 8 CPUs just isn't enough to detect the problem with setting
hint-buts with permanent tables with synchronous_commit=false.
If that were done, would the third change still be
needed? (It did seem to add some other features as well, so I'm going to
assume we still want those...).It's still useful, yes. It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.
But the only reason it would do that is because we keep waking it up. If
we stop waking it up, it would stop flushing each page, even if it still
had the logic to flush for each page there.
But now the first change is even worse than useless, it is positively
harmful. The only thing to stop it from waking the WALWriter for every
async commit is this line:/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
return;Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
becomes false due to us waking walwriter, it only becomes false once the
timeout expires (which it would have done anyway with no help from us),or
once wal_writer_flush_after is exceeded. So now every async commit is
waking the walwriter. Most of those wake up do nothing (there is not a
completely new patch to write), some write one completed page but don't
flush anything, and very few do a flush, and that one would have beendone
anyway.
s/completely new patch/completely new page/?
Yes. My mental tab completion sometimes goes awry.
In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic. Have to think about
this some more...
I think if we want to write each page, we should do it in the foreground.
Conceptually, it could just be another enum value in synchronous_commit,
but synchronous_commit is getting rather unwieldy already.
Cheers,
Jeff
On Mon, Jul 3, 2017 at 11:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
My understanding is that you can't start on a new file until the old file is
completely synced, because the book keeping can currently only handle one
file at a time. So if you signal the wal writer to do the sync, you would
just end up immediately blocking on it to finish. Getting around that would
require substantially more changes; we would need to implement a queue of
full log files not yet synced so that we could move on without waiting.
Yeah. I bet that effort would be reasonably worthwhile. I believe
that the forced end-of-segment syncs are costing us a noticeable
amount of performance. Once or twice I took a very half-hearted run
at doing what you describe here, but gave up pretty quickly; it seems
like a somewhat tricky problem.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jul 3, 2017 at 8:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
Wouldn't it
better, and much simpler, just to have reverted the first change oncethe
second one was done?
I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.That is true, but I don't see any residual regression when going from
pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied. So I
think the hint bit issue only shows up when hitting the same page
aggressively, in which case the LSN does get advanced quickly enough that
db76b1efbba solves it. Perhaps a different test case could show that
issue. I also didn't see any improvement from the original 4de82f7d7c50a81e,
so maybe 8 CPUs just isn't enough to detect the problem with setting
hint-buts with permanent tables with synchronous_commit=false.
I've refined my benchmarking, using a smaller scaling factor (8, same as my
CPU count) and running the tests for longer. The smaller size means there
are more intensive updates on individual pgbench_accounts pages, and the
longer run times allows be unset hint bits to build up for longer (usually
I like short test runs with a large number of randomized repetitions, but
that doesn't work well here). Since the scale factor is the same as the
number of CPUs, I've bumped the thread count so that it is more likely
someone will choose a non-contended value of pgbench_branches.bid to update
at any given moment.
pgbench -c16 -j16 -T1200 -M prepared -b tpcb-func --startup='set
synchronous_commit=false'
This testing did show the importance of waking up the wal writer so that
hint bits can be set:
commit TPS
cfafd8beadcd6f 22200.48
cfafd8beadcd6f_nowake 30766.83
median of 14 runs reported, p-val on difference of means is 2e-22.
Where nowake is a hackish disabling of the wake up introduced in
4de82f7d7c50a8,
forward ported to 9.6 branch. (I still wake it if is is doing the big
sleep, I just took out the part that wake it up even from small sleeps)
So my revised benchmark is capable of detecting this effect, even with only
8 CPUs.
When I move to next commit where we set hint bits as long as the page was
re-dirtied after, get these numbers:
db76b1efbbab24 30742.69
db76b1efbbab24_nowake 31072.97
The difference between these is not statistically different, nor is the
difference between them and cfafd8beadcd6f.
So I think the logic you introduced in db76b1efbbab24 captures all the gain
there is to be captured, and 4de82f7d7c50a8 can be reverted. The only
reason to wake up the wal writer is if it is taking the big sleep.
Of course, I can't prove a negative. There could always be some test case
which demonstrates that 4de82f7d7c50a8 still matters in spite of
db76b1efbbab24.
So to be ultra-conservative, attached is a patch which should preserve all
wake-ups other than the ones known to be useless.
7975c5e0a 27810.66
7975c5e0a_patched 30821.41
So 7975c5 causes a significant regression (10% regression, p-val 7e-16 on
difference of the means). It repeatedly wakes the wal-writer up to flush a
page which the wal-writer simply refuses to flush. This can't serve any
purpose. My patch still wakes it up to write the page if it has not been
written yet, and also still wakes it to flush a range of pages of
WalWriterFlushAfter is met. But it just doesn't wake it up to flush a page
which has already been written and will not get flushed.
I'd prefer the code-simplifying change of just not waking it up anymore
except for when it is doing the big sleep, but I can't reliably detect a
performance difference between that and the more conservative change posted
here.
Cheers,
Jeff
Attachments:
change_walwriter_wakeup_v2.patchapplication/octet-stream; name=change_walwriter_wakeup_v2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100644
index df4843f..554c720
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
*************** XLogSetAsyncXactLSN(XLogRecPtr asyncXact
*** 2623,2633 ****
*/
if (!sleeping)
{
/* back off to last completed page boundary */
WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
! /* if we have already flushed that far, we're done */
! if (WriteRqstPtr <= LogwrtResult.Flush)
return;
}
--- 2623,2637 ----
*/
if (!sleeping)
{
+ int flushbytes;
/* back off to last completed page boundary */
WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
! flushbytes =
! WriteRqstPtr / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
!
! /* if there is nothing to write and we don't have enough to flush, we're done */
! if (WriteRqstPtr <= LogwrtResult.Write && (WalWriterFlushAfter == 0 || flushbytes < WalWriterFlushAfter))
return;
}
At Sat, 26 Aug 2017 14:45:20 -0700, Jeff Janes <jeff.janes@gmail.com> wrote in <CAMkU=1zP931=v-ypFZMDxHAGGQwuVNUPONZE5_ggV+onkXEdvQ@mail.gmail.com>
On Mon, Jul 3, 2017 at 8:26 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
Wouldn't it
better, and much simpler, just to have reverted the first change oncethe
second one was done?
I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.That is true, but I don't see any residual regression when going from
pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied. So I
think the hint bit issue only shows up when hitting the same page
aggressively, in which case the LSN does get advanced quickly enough that
db76b1efbba solves it. Perhaps a different test case could show that
issue. I also didn't see any improvement from the original 4de82f7d7c50a81e,
so maybe 8 CPUs just isn't enough to detect the problem with setting
hint-buts with permanent tables with synchronous_commit=false.I've refined my benchmarking, using a smaller scaling factor (8, same as my
CPU count) and running the tests for longer. The smaller size means there
are more intensive updates on individual pgbench_accounts pages, and the
longer run times allows be unset hint bits to build up for longer (usually
I like short test runs with a large number of randomized repetitions, but
that doesn't work well here). Since the scale factor is the same as the
number of CPUs, I've bumped the thread count so that it is more likely
someone will choose a non-contended value of pgbench_branches.bid to update
at any given moment.pgbench -c16 -j16 -T1200 -M prepared -b tpcb-func --startup='set
synchronous_commit=false'This testing did show the importance of waking up the wal writer so that
hint bits can be set:commit TPS
cfafd8beadcd6f 22200.48
cfafd8beadcd6f_nowake 30766.83
median of 14 runs reported, p-val on difference of means is 2e-22.Where nowake is a hackish disabling of the wake up introduced in
4de82f7d7c50a8,
forward ported to 9.6 branch. (I still wake it if is is doing the big
sleep, I just took out the part that wake it up even from small sleeps)So my revised benchmark is capable of detecting this effect, even with only
8 CPUs.When I move to next commit where we set hint bits as long as the page was
re-dirtied after, get these numbers:db76b1efbbab24 30742.69
db76b1efbbab24_nowake 31072.97The difference between these is not statistically different, nor is the
difference between them and cfafd8beadcd6f.So I think the logic you introduced in db76b1efbbab24 captures all the gain
there is to be captured, and 4de82f7d7c50a8 can be reverted. The only
reason to wake up the wal writer is if it is taking the big sleep.Of course, I can't prove a negative. There could always be some test case
which demonstrates that 4de82f7d7c50a8 still matters in spite of
db76b1efbbab24.
So to be ultra-conservative, attached is a patch which should preserve all
wake-ups other than the ones known to be useless.7975c5e0a 27810.66
7975c5e0a_patched 30821.41So 7975c5 causes a significant regression (10% regression, p-val 7e-16 on
difference of the means). It repeatedly wakes the wal-writer up to flush a
page which the wal-writer simply refuses to flush. This can't serve any
purpose. My patch still wakes it up to write the page if it has not been
written yet, and also still wakes it to flush a range of pages of
WalWriterFlushAfter is met. But it just doesn't wake it up to flush a page
which has already been written and will not get flushed.I'd prefer the code-simplifying change of just not waking it up anymore
except for when it is doing the big sleep, but I can't reliably detect a
performance difference between that and the more conservative change posted
here.
FWIW I tried this patch with taking the following numbers using
pgbench --unlogged-tables/ -T 60 -j 8 -c 8. All numbers are of
the 2nd of two successive runs.
TA: # that XLogSetAsyncXactLSN called while wal writer is awake.
TB: # that XLogSetAsyncXactLSN skipped SetLatch when the TA case.
WA: # that WalWriterMain called XLogBackgroundFlush().
WB: # that XLogBackgroundFlush() skipped flushing in the WA case.
A: without this patch : 1745tps(wal_writer_flush_after = 1MB (default))
WB/WA = 9/141100 : 0.9% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 141091 times)
TB/TA =27300/103000 : 26% of SetLatch is avoided. (12899 times called)
B: with this patch : 2041 tps (wal_writer_flush_after = 1MB (default))
WB/WA = 46/ 1600 : 2.8% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 1553 times)
TB/TA =115822/118000: 97% of SetLatch is avoided. (4178 times called)
C: with this patch : 2033 tps (But wal_writer_flush_after = 0)
WB/WA = 117/1700 : 6.9% of all XLogBackgroundFlush() did nothing.
(woke by SetLatch 1582 times)
TB/TA =110615/118000: 93% of SetLatch is avoided. (7785 times called)
The condition for skipping SetLatch of the latest patch (v2)
looks reasonable. And the effect is obvious at least in the
test. The "needless" SetLatch seems very effectively reduced and
actually improving performance. Increased size of bytes flushed
at once seems the main cause of the performance improvement.
Though a bit uneasy to have similar code on both side
(XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
me.
regards,
--
Kyotaro Horiguchi
NTT Open Source Software Center
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Sep 19, 2017 at 8:48 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Though a bit uneasy to have similar code on both side
(XLogBackgroundFlush and XLogSetAsyncXactLSN) but +1 to this from
me.
This patch wasn't formatted very nicely; attached is a version that
pgindent likes better and doesn't bust past 80 columns.
I think this patch needs significantly better comments. There's no
explanation of the rather complex formula the patch proposed to use.
Maybe it's just copied from someplace else that does have a comment,
but in that case, the comments here should mention that. And the
comment in the other place should probably be updated too, like:
/* Make sure to keep this in sync with the similar logic in
XLogDoAwesomeStuff() */
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
change_walwriter_wakeup_v3.patchapplication/octet-stream; name=change_walwriter_wakeup_v3.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a12a4..4f48903b8f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2639,11 +2639,20 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
*/
if (!sleeping)
{
+ int flushbytes;
+
/* back off to last completed page boundary */
WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
- /* if we have already flushed that far, we're done */
- if (WriteRqstPtr <= LogwrtResult.Flush)
+ flushbytes =
+ WriteRqstPtr / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
+
+ /*
+ * if there is nothing to write and we don't have enough to flush,
+ * we're done
+ */
+ if (WriteRqstPtr <= LogwrtResult.Write
+ && (WalWriterFlushAfter == 0 || flushbytes < WalWriterFlushAfter))
return;
}
On Thu, Oct 5, 2017 at 10:59 PM, Robert Haas <robertmhaas@gmail.com> wrote:
This patch wasn't formatted very nicely; attached is a version that
pgindent likes better and doesn't bust past 80 columns.I think this patch needs significantly better comments. There's no
explanation of the rather complex formula the patch proposed to use.
Maybe it's just copied from someplace else that does have a comment,
but in that case, the comments here should mention that. And the
comment in the other place should probably be updated too, like:/* Make sure to keep this in sync with the similar logic in
XLogDoAwesomeStuff() */
I have just gone through this thread, and reducing the amount of times
the WAL writer is woken up is something worth doing. The last version
if v3 that Robert has sent in
/messages/by-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com.
As far as I can see, the smarter formula presented here comes from
XLogBackgroundFlush(). Jeff, could you update the patch to reflect
that the new condition you are adding comes from there? I am switching
the patch as "waiting on author" for now.
--
Michael
On Tue, Nov 21, 2017 at 3:44 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
I have just gone through this thread, and reducing the amount of times
the WAL writer is woken up is something worth doing. The last version
if v3 that Robert has sent in
/messages/by-id/CA+TgmoZhw_nyrxYUqSuDh-XEJya-YzfwShki5NuDS=UfgvodrA@mail.gmail.com.
As far as I can see, the smarter formula presented here comes from
XLogBackgroundFlush(). Jeff, could you update the patch to reflect
that the new condition you are adding comes from there? I am switching
the patch as "waiting on author" for now.
It has been one week since this update, in consequence I am marking
this patch as returned with feedback for now. Please make sure to
register a new CF entry if you can get a new version out.
--
Michael