walwriter interacts quite badly with synchronous_commit=off
Hi,
I recently mentioned to Heikki that I was seeing latch related wakeups being
frequent enough to prevent walwriter from doing a whole lot of work. He asked
me to write that set of concerns up, which seems quite fair...
Here's a profile of walwriter while the following pgbench run was ongoing:
c=1;psql -Xq -c 'drop table if exists testtable_logged; CREATE TABLE testtable_logged(v int not null default 0);' && PGOPTIONS='-c synchronous_commit=off' pgbench -n -c$c -j$c -Mprepared -T150 -f <(echo 'INSERT INTO testtable_logged DEFAULT VALUES;') -P1
Looking at top, walwriter is around 15-20% busy with this
workload. Unfortunately a profile quickly shows that little of that work is
useful:
perf record --call-graph dwarf -m16M -p $(pgrep -f 'walwriter') sleep 3
- 94.42% 0.00% postgres postgres [.] AuxiliaryProcessMain
AuxiliaryProcessMain
- WalWriterMain
+ 78.26% WaitLatch
+ 14.01% XLogBackgroundFlush
+ 0.51% pgstat_report_wal
0.29% ResetLatch
0.13% pgstat_flush_io
+ 0.02% asm_sysvec_apic_timer_interrupt
0.01% HandleWalWriterInterrupts (inlined)
Confirmed by the distribution of what syscalls are made:
perf trace -m128M --summary -p $(pgrep -f 'walwriter') sleep 5
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 216610 0 3744.984 0.000 0.017 0.113 0.03%
read 216602 0 333.905 0.001 0.002 0.029 0.03%
fdatasync 27 0 94.703 1.939 3.508 11.279 8.83%
pwrite64 2998 0 15.646 0.004 0.005 0.027 0.45%
openat 2 0 0.019 0.006 0.010 0.013 34.84%
close 2 0 0.004 0.002 0.002 0.003 25.76%
We're doing far more latch related work than actual work.
The walwriter many many times wakes up without having to do anything.
And if you increase the number of clients to e.g. c=8, it gets worse in some
ways:
perf trace:
epoll_wait 291512 0 2364.067 0.001 0.008 0.693 0.10%
read 290938 0 479.837 0.001 0.002 0.020 0.05%
fdatasync 146 0 410.043 2.508 2.809 7.006 1.90%
futex 56384 43982 183.896 0.001 0.003 2.791 1.65%
pwrite64 17058 0 105.625 0.004 0.006 4.015 4.61%
clock_nanosleep 1 0 1.063 1.063 1.063 1.063 0.00%
openat 9 0 0.072 0.006 0.008 0.014 14.35%
close 9 0 0.018 0.002 0.002 0.003 5.55%
Note that we 5x more lock waits (the futex calls) than writes!
I think the problem is mainly that XLogSetAsyncXactLSN() wakes up walwriter
whenever it is sleeping, regardless of whether the modified asyncXactLSN will
lead to a write. We even wake up walwriter when we haven't changed
asyncXactLSN, because our LSN is older than some other backends!
So often we'll just wake up walwriter, which finds no work, immediately goes
to sleep, just to be woken again.
Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...
Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.
We also acquire WALWriteLock and call WaitXLogInsertionsToFinish(), even if
could already know we're not going to flush! Not cheap, when you do it this
many times a second.
There is an absolutely basic optimization, helping a it at higher client
counts: Don't wake if the new asyncXactLSN is <= the old one. But it doesn't
help that much.
I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this.
E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.
1 client:
before:
tps = 42926.288765 (without initial connection time)
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 209077 0 3746.918 0.000 0.018 0.143 0.03%
read 209073 0 310.532 0.001 0.001 0.021 0.02%
fdatasync 25 0 82.673 2.623 3.307 3.457 1.13%
pwrite64 2892 0 14.600 0.004 0.005 0.018 0.43%
after:
tps = 46244.394058 (without initial connection time)
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 25 0 4732.625 0.000 189.305 200.281 4.17%
fdatasync 25 0 90.264 2.814 3.611 3.835 1.02%
pwrite64 48 0 15.825 0.020 0.330 0.707 12.76%
read 21 0 0.117 0.003 0.006 0.007 3.69%
8 clients:
tps = 279316.646315 (without initial connection time)
postgres (2861734), 1215159 events, 100.0%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 267517 0 2150.206 0.000 0.008 0.973 0.12%
read 266683 0 512.348 0.001 0.002 0.036 0.08%
fdatasync 149 0 413.658 2.583 2.776 3.395 0.29%
futex 56597 49588 183.174 0.001 0.003 1.047 0.69%
pwrite64 17516 0 126.208 0.004 0.007 2.927 3.93%
after:
tps = 290958.322594 (without initial connection time)
postgres (2861534), 1626 events, 100.0%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
epoll_wait 153 0 4383.285 0.000 28.649 32.699 0.92%
fdatasync 153 0 464.088 2.452 3.033 19.999 4.88%
pwrite64 306 0 80.361 0.049 0.263 0.590 4.38%
read 153 0 0.459 0.002 0.003 0.004 1.37%
futex 49 46 0.211 0.002 0.004 0.038 17.05%
More throughput for less CPU, seems neat :)
I'm not addressing that here, but I think we also have the opposite behaviour
- we're not waking up walwriter often enough. E.g. if you have lots of bulk
dataloads, walwriter will just wake up once per wal_writer_delay, leading to
most of the work being done by backends. We should probably wake walsender at
the end of XLogInsertRecord() if there is sufficient outstanding WAL.
Greetings,
Andres Freund
Attachments:
v1-0001-xlog.c-Rename-misleadingly-named-variable.patchtext/x-diff; charset=us-asciiDownload
From 953b38dc4e6f35e25abf805c531350f43e867d15 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 24 Oct 2023 16:00:18 -0700
Subject: [PATCH v1 1/2] xlog.c: Rename misleadingly named variable
---
src/backend/access/transam/xlog.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea3..6065fd63b8b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2784,7 +2784,7 @@ XLogBackgroundFlush(void)
bool flexible = true;
static TimestampTz lastflush;
TimestampTz now;
- int flushbytes;
+ int flushblocks;
TimeLineID insertTLI;
/* XLOG doesn't need flushing during recovery */
@@ -2838,7 +2838,7 @@ XLogBackgroundFlush(void)
* wal_writer_flush_after GUCs.
*/
now = GetCurrentTimestamp();
- flushbytes =
+ flushblocks =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
if (WalWriterFlushAfter == 0 || lastflush == 0)
@@ -2857,7 +2857,7 @@ XLogBackgroundFlush(void)
WriteRqst.Flush = WriteRqst.Write;
lastflush = now;
}
- else if (flushbytes >= WalWriterFlushAfter)
+ else if (flushblocks >= WalWriterFlushAfter)
{
/* exceeded wal_writer_flush_after blocks, flush */
WriteRqst.Flush = WriteRqst.Write;
--
2.38.0
v1-0002-Reduce-rate-of-wakeups-of-walsender-due-to-async-.patchtext/x-diff; charset=us-asciiDownload
From 5e56bd0801962bd04fa2829d6f42579d85c47958 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 24 Oct 2023 16:01:11 -0700
Subject: [PATCH v1 2/2] Reduce rate of wakeups of walsender due to async
commits
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/backend/access/transam/xlog.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6065fd63b8b..867ab3bd742 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2437,14 +2437,23 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
{
XLogRecPtr WriteRqstPtr = asyncXactLSN;
bool sleeping;
+ XLogRecPtr prevAsyncXactLSN;
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
sleeping = XLogCtl->WalWriterSleeping;
+ prevAsyncXactLSN = XLogCtl->asyncXactLSN;
if (XLogCtl->asyncXactLSN < asyncXactLSN)
XLogCtl->asyncXactLSN = asyncXactLSN;
SpinLockRelease(&XLogCtl->info_lck);
+ /*
+ * If somebody else already called this function with a more aggressive
+ * LSN, they will have done what we needed (and perhaps more).
+ */
+ if (asyncXactLSN <= prevAsyncXactLSN)
+ return;
+
/*
* If the WALWriter is sleeping, we should kick it to make it come out of
* low-power mode. Otherwise, determine whether there's a full page of
@@ -2458,6 +2467,18 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
/* if we have already flushed that far, we're done */
if (WriteRqstPtr <= LogwrtResult.Flush)
return;
+
+ /*
+ * There is no benefit in waking walwriter if we only have a small
+ * amount of writes outstanding. We want to flush WAL at least once
+ * per wal_writer_flush_after and not defer writing for more than
+ * wal_writer_delay. As we wake walwriter below, if it's currently
+ * hibernating, it's own loop will ensure that wal_writer_delay is
+ * taken care of. This condition takes care of flushing if there is
+ * more than wal_writer_flush_after unflushed data.
+ */
+ if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * XLOG_BLCKSZ)
+ return;
}
/*
--
2.38.0
On 25/10/2023 02:09, Andres Freund wrote:
Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...
That can only happen when walwriter has just come out of "hibernation",
ie. when the system has been idle for a while. So probably not a big
deal in practice.
I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this.E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.
In the patch, you actually did:
+ if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * XLOG_BLCKSZ) + return;
It means that you never wake up the walwriter to merely *write* the WAL.
You only wake it up if it's going to also fsync() it. I think that's
correct and appropriate, but it took me a while to reach that conclusion:
It might be beneficial to wake up the walwriter just to perform a
write(), to offload that work from the backend. And walwriter will
actually also perform an fsync() after finishing the current segment, so
it would make sense to also wake it up when 'asyncXactLSN' crosses a
segment boundary. However, if those extra wakeups make sense, they would
also make sense when there are no asynchronous commits involved.
Therefore those extra wakeups should be done elsewhere, perhaps
somewhere around AdvanceXLInsertBuffer(). The condition you have in the
patch is appropriate for XLogSetAsyncXactLSN().
Another reason to write the WAL aggressively, even if you don't flush
it, would be to reduce the number of lost transactions on a segfault.
But we don't give any guarantees on that, and even with the current
aggressive logic, we only write when a page is full so you're anyway
going to lose the last partial page.
It also took me a while to convince myself that this calculation matches
the calculation that XLogBackgroundFlush() uses to determine whether it
needs to flush or not. XLogBackgroundFlush() first divides the request
and result with XLOG_BLCKSZ and then compares the number of blocks,
whereas here you perform the calculation in bytes. I think the result is
the same, but to make it more clear, let's do it the same way in both
places.
See attached. It's the same logic as in your patch, just formulatd more
clearly IMHO.
Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.
Even with this patch, when I bumped up wal_writer_delay to 2 so that the
wal writer gets woken up by the async commits rather than the timeout,
the write pattern is a bit silly:
$ strace -p 1099926 # walwriter
strace: Process 1099926 attached
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
u64=94261056289248}}], 1, 1991) = 1
read(3,
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 128
pwrite64(5,
"\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"...,
1007616, 49152) = 1007616
fdatasync(5) = 0
pwrite64(5, "\24\321\5\0\1\0\0\0\0
\20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384,
1056768) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
u64=94261056289248}}], 1, 2000) = 1
read(3,
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 128
pwrite64(5,
"\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"...,
1040384, 1073152) = 1040384
fdatasync(5) = 0
pwrite64(5, "\24\321\4\0\1\0\0\0\0@
\373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
u64=94261056289248}}], 1, 2000) = 1
read(3,
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 128
pwrite64(5, "\24\321\5\0\1\0\0\0\0\200
\373\5\0\0\0003\0\0\0\0\0\0\0\320\177 \373\5\0\0\0"..., 1040384,
2129920) = 1040384
fdatasync(5) = 0
In each cycle, the wal writer writes a full 1 MB chunk
(wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller
write before going to sleep.
Those smaller writes seem a bit silly. But I think it's fine.
More throughput for less CPU, seems neat :)
Indeed, impressive speedup from such a small patch!
I'm not addressing that here, but I think we also have the opposite behaviour
- we're not waking up walwriter often enough. E.g. if you have lots of bulk
dataloads, walwriter will just wake up once per wal_writer_delay, leading to
most of the work being done by backends. We should probably wake walsender at
the end of XLogInsertRecord() if there is sufficient outstanding WAL.
Right, that's basically the same issue that I reasoned through above. I
did some quick testing with a few different settings of wal_buffers,
wal_writer_flush_after and wal_writer_delay, to try to see that effect.
But I was not able to find a case where it makes a difference.
--
Heikki Linnakangas
Neon (https://neon.tech)
Attachments:
v2-Reduce-rate-of-walwriter-wakeups.patchtext/x-patch; charset=UTF-8; name=v2-Reduce-rate-of-walwriter-wakeups.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea3..e2c0f17fd61 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2437,35 +2437,44 @@ XLogSetAsyncXactLSN(XLogRecPtr asyncXactLSN)
{
XLogRecPtr WriteRqstPtr = asyncXactLSN;
bool sleeping;
+ bool wakeup = false;
+ XLogRecPtr prevAsyncXactLSN;
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
sleeping = XLogCtl->WalWriterSleeping;
+ prevAsyncXactLSN = XLogCtl->asyncXactLSN;
if (XLogCtl->asyncXactLSN < asyncXactLSN)
XLogCtl->asyncXactLSN = asyncXactLSN;
SpinLockRelease(&XLogCtl->info_lck);
/*
- * If the WALWriter is sleeping, we should kick it to make it come out of
- * low-power mode. Otherwise, determine whether there's a full page of
- * WAL available to write.
+ * If somebody else already called this function with a more aggressive
+ * LSN, they will have done what we needed (and perhaps more).
*/
- if (!sleeping)
+ if (asyncXactLSN <= prevAsyncXactLSN)
+ return;
+
+ /*
+ * If the WALWriter is sleeping, kick it to make it come out of low-power
+ * mode, so that this async commit will reach disk within the expected
+ * amount of time. Otherwise, determine whether it has enough WAL
+ * available to flush, the same way that XLogBackgroundFlush() does.
+ */
+ if (sleeping)
+ wakeup = true;
+ else
{
- /* back off to last completed page boundary */
- WriteRqstPtr -= WriteRqstPtr % XLOG_BLCKSZ;
+ int flushblocks;
- /* if we have already flushed that far, we're done */
- if (WriteRqstPtr <= LogwrtResult.Flush)
- return;
+ flushblocks =
+ WriteRqstPtr / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
+
+ if (WalWriterFlushAfter == 0 || flushblocks >= WalWriterFlushAfter)
+ wakeup = true;
}
- /*
- * Nudge the WALWriter: it has a full page of WAL to write, or we want it
- * to come out of low-power mode so that this async commit will reach disk
- * within the expected amount of time.
- */
- if (ProcGlobal->walwriterLatch)
+ if (wakeup && ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);
}
@@ -2784,7 +2793,7 @@ XLogBackgroundFlush(void)
bool flexible = true;
static TimestampTz lastflush;
TimestampTz now;
- int flushbytes;
+ int flushblocks;
TimeLineID insertTLI;
/* XLOG doesn't need flushing during recovery */
@@ -2836,9 +2845,13 @@ XLogBackgroundFlush(void)
/*
* Determine how far to flush WAL, based on the wal_writer_delay and
* wal_writer_flush_after GUCs.
+ *
+ * Note that XLogSetAsyncXactLSN() performs similar calculation based on
+ * wal_writer_flush_after, to decide when to wake us up. Make sure the
+ * logic is the same in both places if you change this.a
*/
now = GetCurrentTimestamp();
- flushbytes =
+ flushblocks =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
if (WalWriterFlushAfter == 0 || lastflush == 0)
@@ -2857,7 +2870,7 @@ XLogBackgroundFlush(void)
WriteRqst.Flush = WriteRqst.Write;
lastflush = now;
}
- else if (flushbytes >= WalWriterFlushAfter)
+ else if (flushblocks >= WalWriterFlushAfter)
{
/* exceeded wal_writer_flush_after blocks, flush */
WriteRqst.Flush = WriteRqst.Write;
Hi,
On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:
On 25/10/2023 02:09, Andres Freund wrote:
Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...That can only happen when walwriter has just come out of "hibernation", ie.
when the system has been idle for a while. So probably not a big deal in
practice.
Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up
walwriter?
We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.
I think the most important optimization we need is to have
XLogSetAsyncXactLSN() only wake up if there is a certain amount of unflushed
WAL. Unless walsender is hibernating, walsender will wake up on its own after
wal_writer_delay. I think we can just reuse WalWriterFlushAfter for this.E.g. a condition like
if (WriteRqstPtr <= LogwrtResult.Write + WalWriterFlushAfter * XLOG_BLCKSZ)
return;
drastically cuts down on the amount of wakeups, without - I think - loosing
guarantees around synchronous_commit=off.In the patch, you actually did:
+ if (WriteRqstPtr <= LogwrtResult.Flush + WalWriterFlushAfter * XLOG_BLCKSZ) + return;It means that you never wake up the walwriter to merely *write* the WAL. You
only wake it up if it's going to also fsync() it. I think that's correct and
appropriate, but it took me a while to reach that conclusion:
Yea, after writing the email I got worried that just looking at Write would
perhaps lead to not flushing data soon enough...
It might be beneficial to wake up the walwriter just to perform a write(),
to offload that work from the backend. And walwriter will actually also
perform an fsync() after finishing the current segment, so it would make
sense to also wake it up when 'asyncXactLSN' crosses a segment boundary.
However, if those extra wakeups make sense, they would also make sense when
there are no asynchronous commits involved. Therefore those extra wakeups
should be done elsewhere, perhaps somewhere around AdvanceXLInsertBuffer().
The condition you have in the patch is appropriate for
XLogSetAsyncXactLSN().
Yea. I agree we should wake up walsender in other situations too...
Another reason to write the WAL aggressively, even if you don't flush it,
would be to reduce the number of lost transactions on a segfault. But we
don't give any guarantees on that, and even with the current aggressive
logic, we only write when a page is full so you're anyway going to lose the
last partial page.
Wal writer does end up writing the trailing partially filled page during the
next wal_writer_delay cycle.
It also took me a while to convince myself that this calculation matches the
calculation that XLogBackgroundFlush() uses to determine whether it needs to
flush or not. XLogBackgroundFlush() first divides the request and result
with XLOG_BLCKSZ and then compares the number of blocks, whereas here you
perform the calculation in bytes. I think the result is the same, but to
make it more clear, let's do it the same way in both places.See attached. It's the same logic as in your patch, just formulatd more
clearly IMHO.
Yep, makes sense!
Because of the frequent wakeups, we do something else that's not smart: We
write out 8k blocks individually, many times a second. Often thousands of
8k pwrites a second.Even with this patch, when I bumped up wal_writer_delay to 2 so that the wal
writer gets woken up by the async commits rather than the timeout, the write
pattern is a bit silly:$ strace -p 1099926 # walwriter
strace: Process 1099926 attached
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
u64=94261056289248}}], 1, 1991) = 1
read(3,
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 128
pwrite64(5, "\24\321\5\0\1\0\0\0\0\300\0\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0n\276\242\305"...,
1007616, 49152) = 1007616
fdatasync(5) = 0
pwrite64(5, "\24\321\5\0\1\0\0\0\0
\20\373\5\0\0\0003\0\0\0\0\0\0\0\320\37\20\373\5\0\0\0"..., 16384, 1056768)
= 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
u64=94261056289248}}], 1, 2000) = 1
read(3,
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 128
pwrite64(5,
"\24\321\5\0\1\0\0\0\0`\20\373\5\0\0\0+\0\0\0\0\0\0\0\0\n\0\0\5~\23\261"...,
1040384, 1073152) = 1040384
fdatasync(5) = 0
pwrite64(5, "\24\321\4\0\1\0\0\0\0@
\373\5\0\0\0\0\0\0\0\0\0\0\0;\0\0\0\264'\246\3"..., 16384, 2113536) = 16384
epoll_wait(10, [{events=EPOLLIN, data={u32=3704011232,
u64=94261056289248}}], 1, 2000) = 1
read(3,
"\27\0\0\0\0\0\0\0\0\0\0\0<\312\20\0\350\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
1024) = 128
pwrite64(5, "\24\321\5\0\1\0\0\0\0\200 \373\5\0\0\0003\0\0\0\0\0\0\0\320\177
\373\5\0\0\0"..., 1040384, 2129920) = 1040384
fdatasync(5) = 0In each cycle, the wal writer writes a full 1 MB chunk
(wal_writer_flush_after = '1MB'), flushes it, and then perform a smaller
write before going to sleep.
I think that's actually somewhat sane - we write out the partial page in the
subsequent cycle. That won't happen if the page isn't partially filled or
doesn't have an async commit on it.
I think we end up with somewhat bogus write patterns in other cases still, but
that's really more an issue in XLogBackgroundFlush() and thus deserves a
separate patch/thread.
I'm not addressing that here, but I think we also have the opposite behaviour
- we're not waking up walwriter often enough. E.g. if you have lots of bulk
dataloads, walwriter will just wake up once per wal_writer_delay, leading to
most of the work being done by backends. We should probably wake walsender at
the end of XLogInsertRecord() if there is sufficient outstanding WAL.Right, that's basically the same issue that I reasoned through above. I did
some quick testing with a few different settings of wal_buffers,
wal_writer_flush_after and wal_writer_delay, to try to see that effect. But
I was not able to find a case where it makes a difference.
I think in the right set of circumstances it can make quite a bit of
difference. E.g. I bulk load 3GB of data in a cluster with s_b 1GB. Then I
checkpoint and VACUUM FREEZE it. With wal_writer_delay=1ms that's
considerably faster (5.4s) than with wal_writer_delay=2s (8.3s) or even the
default 200ms (7.9s), because a fast walwriter makes it much more likely that
vacuum won't need to wait for an xlog flush before replacing a buffer in the
strategy ring.
I think improving this logic would be quite worthwhile!
Another benefit of triggering wakeups based on the amount of outstanding
writes would be that we could increase wal_writer_delay substantially (with
perhaps some adjustment for the partial-trailing-page-with-async-commit case),
reducing power usage. It's imo pretty silly that we have wal writer wake up
regularly, if it just writes once every few seconds.
Greetings,
Andres Freund
On 25/10/2023 21:59, Andres Freund wrote:
On 2023-10-25 12:17:03 +0300, Heikki Linnakangas wrote:
On 25/10/2023 02:09, Andres Freund wrote:
Because of the inherent delay between the checks of XLogCtl->WalWriterSleeping
and Latch->is_set, we also sometimes end up with multiple processes signalling
walwriter, which can be bad, because it increases the likelihood that some of
the signals may be received when we are already holding WALWriteLock, delaying
its release...That can only happen when walwriter has just come out of "hibernation", ie.
when the system has been idle for a while. So probably not a big deal in
practice.Maybe I am missing something here - why can this only happen when hibernating?
Even outside of that two backends can decide that that they need to wake up
walwriter?
Ah sure, multiple backends can decide to wake up walwriter at the same
time. I thought you meant that the window for that was somehow wider
when XLogCtl->WalWriterSleeping.
We could prevent that, by updating state when requesting walwriter to be woken
up. But with the changes we're discussing below, that should be rare.
One small easy thing we could do to reduce the redundant wakeups: only
wake up walwriter if asyncXactLSN points to different page than
prevAsyncXactLSN.
--
Heikki Linnakangas
Neon (https://neon.tech)
On 25/10/2023 21:59, Andres Freund wrote:
See attached. It's the same logic as in your patch, just formulatd more
clearly IMHO.Yep, makes sense!
Pushed this. Thanks for the investigation!
--
Heikki Linnakangas
Neon (https://neon.tech)
On 2023-11-27 17:55:34 +0200, Heikki Linnakangas wrote:
On 25/10/2023 21:59, Andres Freund wrote:
See attached. It's the same logic as in your patch, just formulatd more
clearly IMHO.Yep, makes sense!
Pushed this. Thanks for the investigation!
Thanks!