001_rep_changes.pl stalls
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
important for promptly updating pg_stat_replication. When caught up, we
should impose that logic before every sleep. The one-line fix is to sleep in
WalSndLoop() only when pq_is_send_pending(), not when caught up.
On my regular development machine, src/test/subscription/t/001_rep_changes.pl
stalls for ~10s at this wait_for_catchup:
$node_publisher->safe_psql('postgres', "DELETE FROM tab_rep");
# Restart the publisher and check the state of the subscriber which
# should be in a streaming state after catching up.
$node_publisher->stop('fast');
$node_publisher->start;
$node_publisher->wait_for_catchup('tap_sub');
That snippet emits three notable physical WAL records. There's a
Transaction/COMMIT at the end of the DELETE, an XLOG/CHECKPOINT_SHUTDOWN, and
an XLOG/FPI_FOR_HINT.
The buildfarm has stalled there, but it happens probably less than half the
time. Examples[1]I spot-checked only my animals, since I wanted to experiment on an affected animal. showing the stall:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mandrill&dt=2020-03-20%2017%3A09%3A53&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=thorntail&dt=2020-03-22%2019%3A51%3A38&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-19%2003%3A35%3A01&stg=subscription-check
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hoverfly&dt=2020-03-20%2015%3A15%3A01&stg=subscription-check
Here's the most-relevant walsender call tree:
WalSndLoop
XLogSendLogical (caller invokes once per loop iteration, via send_data callback)
XLogReadRecord (caller invokes once)
ReadPageInternal (caller invokes twice in this test; more calls are possible)
logical_read_xlog_page (caller skips when page is same as last call, else invokes 1-2 times via state->read_page() callback, registered in StartLogicalReplication)
WalSndWaitForWal (caller invokes once; has fast path)
The cause is a race involving the flow of reply messages (send_feedback()
messages) from logical apply worker to walsender. Here are two sequencing
patterns; the more-indented parts are what differ. Stalling pattern:
sender reads Transaction/COMMIT and sends the changes
receiver applies the changes
receiver send_feedback() reports progress up to Transaction/COMMIT
sender accepts the report
sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep
sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
receiver wal_receiver_status_interval elapses; receiver reports progress up to Transaction/COMMIT
sender wakes up, accepts the report
sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr"
receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT
Non-stalling pattern (more prevalent with lower machine performance):
sender reads Transaction/COMMIT and sends the changes
sender reads XLOG/CHECKPOINT_SHUTDOWN and/or XLOG/FPI_FOR_HINT, which are no-ops for logical rep
sender WalSndCaughtUp becomes true; sender sleeps in WalSndLoop()
receiver applies the changes
receiver send_feedback() reports progress up to Transaction/COMMIT
sender wakes up, accepts the report
sender calls WalSndWaitForWal(), which sends a keepalive due to "MyWalSnd->write < sentPtr"
receiver gets keepalive, send_feedback() reports progress up to XLOG/FPI_FOR_HINT
The fix is to test "MyWalSnd->write < sentPtr" before more sleeps. The test
is unnecessary when sleeping due to pq_is_send_pending(); in that case, the
receiver is not idle and will reply before idling. I changed WalSndLoop() to
sleep only for pq_is_send_pending(). For all other sleep reasons, the sleep
will happen in WalSndWaitForWal(). Attached. I don't know whether this is
important outside of testing scenarios. I lean against back-patching, but I
will back-patch if someone thinks this qualifies as a performance bug.
Thanks,
nm
[1]: I spot-checked only my animals, since I wanted to experiment on an affected animal.
affected animal.
Attachments:
WalSndLoop-stall-v1.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
When WalSndCaughtUp, sleep only in WalSndWaitForWal().
Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
< sentPtr. That is important in logical replication. When the latest
physical LSN yields no logical replication messages (a common case),
that keepalive elicits a reply, and processing the reply updates
pg_stat_replication.replay_lsn. WalSndLoop() lacks that; when
WalSndLoop() slept, replay_lsn advancement could stall until
wal_receiver_status_interval elapsed. This sometimes stalled
src/test/subscription/t/001_rep_changes.pl for up to 10s.
Reviewed by FIXME.
Discussion: https://postgr.es/m/FIXME
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9e56115..d9c6359 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,8 +1428,10 @@ WalSndWaitForWal(XLogRecPtr loc)
/*
* We only send regular messages to the client for full decoded
* transactions, but a synchronous replication and walsender shutdown
- * possibly are waiting for a later location. So we send pings
- * containing the flush location every now and then.
+ * possibly are waiting for a later location. So, before sleeping, we
+ * send a ping containing the flush location. If the receiver is
+ * otherwise idle, this keepalive will trigger a reply. Processing the
+ * reply will update these MyWalSnd locations.
*/
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
@@ -2314,20 +2316,16 @@ WalSndLoop(WalSndSendDataCallback send_data)
WalSndKeepaliveIfNecessary();
/*
- * We don't block if not caught up, unless there is unsent data
- * pending in which case we'd better block until the socket is
- * write-ready. This test is only needed for the case where the
- * send_data callback handled a subset of the available data but then
- * pq_flush_if_writable flushed it all --- we should immediately try
- * to send more.
+ * Block if we have unsent data. Let WalSndWaitForWal() handle any
+ * other blocking; idle receivers need its additional actions.
*/
- if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
+ if (pq_is_send_pending())
{
long sleeptime;
int wakeEvents;
wakeEvents = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT |
- WL_SOCKET_READABLE;
+ WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
/*
* Use fresh timestamp, not last_processing, to reduce the chance
@@ -2335,9 +2333,6 @@ WalSndLoop(WalSndSendDataCallback send_data)
*/
sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
- if (pq_is_send_pending())
- wakeEvents |= WL_SOCKET_WRITEABLE;
-
/* Sleep until something happens or we time out */
(void) WaitLatchOrSocket(MyLatch, wakeEvents,
MyProcPort->sock, sleeptime,
On Sun, Apr 05, 2020 at 11:36:49PM -0700, Noah Misch wrote:
Executive summary: the "MyWalSnd->write < sentPtr" in WalSndWaitForWal() is
important for promptly updating pg_stat_replication. When caught up, we
should impose that logic before every sleep. The one-line fix is to sleep in
WalSndLoop() only when pq_is_send_pending(), not when caught up.
This seems to have made the following race condition easier to hit:
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com
/messages/by-id/21519.1585272409@sss.pgh.pa.us
Now it happened eight times in three days, all on BSD machines:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2018%3A30%3A21
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2018%3A45%3A39
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2020-04-11%2020%3A30%3A26
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-11%2021%3A45%3A48
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2010%3A45%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2020-04-13%2016%3A00%3A18
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2018%3A45%3A34
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2020-04-13%2023%3A45%3A22
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?
There was a novel buildfarm failure, in 010_logical_decoding_timelines.pl:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2008%3A35%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2020-04-13%2017%3A15%3A01
Most-relevant lines of the test script:
$node_master->safe_psql('postgres',
"INSERT INTO decoding(blah) VALUES ('afterbb');");
$node_master->safe_psql('postgres', 'CHECKPOINT');
$node_master->stop('immediate');
The failure suggested the INSERT was not replicated before the immediate stop.
I can reproduce that consistently, before or after the fix for $SUBJECT, by
modifying walsender to delay 0.2s before sending WAL:
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -65,2 +65,3 @@
#include "libpq/pqformat.h"
+#include "libpq/pqsignal.h"
#include "miscadmin.h"
@@ -2781,2 +2782,5 @@ retry:
+ PG_SETMASK(&BlockSig);
+ pg_usleep(200 * 1000);
+ PG_SETMASK(&UnBlockSig);
pq_putmessage_noblock('d', output_message.data, output_message.len);
I will shortly push a fix adding a wait_for_catchup to the test. I don't know
if/how fixing $SUBJECT made this 010_logical_decoding_timelines.pl race
condition easier to hit.
Noah Misch <noah@leadboat.com> writes:
This seems to have made the following race condition easier to hit:
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com
/messages/by-id/21519.1585272409@sss.pgh.pa.us
Yeah, I just came to the same guess in the other thread.
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?
I don't think you should revert. Those failures are (just) often enough
to be annoying but I do not think that a proper fix is very far away.
regards, tom lane
On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
This seems to have made the following race condition easier to hit:
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com
/messages/by-id/21519.1585272409@sss.pgh.pa.usYeah, I just came to the same guess in the other thread.
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?I don't think you should revert. Those failures are (just) often enough
to be annoying but I do not think that a proper fix is very far away.
That works for me, but an actual defect may trigger a revert. Fujii Masao
reported high walsender CPU usage after this patch. The patch caused idle
physical walsenders to use 100% CPU. When caught up, the
WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
until more WAL is available. XLogSendPhysical() just returns when caught up.
No amount of WAL is too small for physical replication to dispatch, but
logical replication needs the full xl_tot_len of a record. Some options:
1. Make XLogSendPhysical() more like XLogSendLogical(), calling
WalSndWaitForWal() when no WAL is available. A quick version of this
passes tests, but I'll need to audit WalSndWaitForWal() for things that are
wrong for physical replication.
2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
insufficient WAL is available. This complicates the xlogreader.h API to
pass back "wait for this XLogRecPtr", and we'd then persist enough state to
resume decoding. This doesn't have any advantages to make up for those.
3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
further drift between the two wait sites; on the other hand, one could
refactor later to help avoid that.
4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.
I'm favoring (1). Other preferences?
At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in
On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
This seems to have made the following race condition easier to hit:
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com
/messages/by-id/21519.1585272409@sss.pgh.pa.usYeah, I just came to the same guess in the other thread.
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?I don't think you should revert. Those failures are (just) often enough
to be annoying but I do not think that a proper fix is very far away.That works for me, but an actual defect may trigger a revert. Fujii Masao
reported high walsender CPU usage after this patch. The patch caused idle
physical walsenders to use 100% CPU. When caught up, the
WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
until more WAL is available. XLogSendPhysical() just returns when caught up.
No amount of WAL is too small for physical replication to dispatch, but
logical replication needs the full xl_tot_len of a record. Some options:1. Make XLogSendPhysical() more like XLogSendLogical(), calling
WalSndWaitForWal() when no WAL is available. A quick version of this
passes tests, but I'll need to audit WalSndWaitForWal() for things that are
wrong for physical replication.2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
insufficient WAL is available. This complicates the xlogreader.h API to
pass back "wait for this XLogRecPtr", and we'd then persist enough state to
resume decoding. This doesn't have any advantages to make up for those.3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
further drift between the two wait sites; on the other hand, one could
refactor later to help avoid that.4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.I'm favoring (1). Other preferences?
Starting from the current shape, I think 1 is preferable, since that
waiting logic is no longer shared between logical and physical
replications. But I'm not sure I like calling WalSndWaitForWal()
(maybe with previous location + 1?), because the function seems to do
too-much.
By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).
Since the only source other than timeout of walsender wakeup is latch,
we should avoid wasteful consuming of latch. (It is the same issue
with [1]/messages/by-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com).
If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.
[1]: /messages/by-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Sorry , I wrote something wrong.
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in
On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
This seems to have made the following race condition easier to hit:
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com
/messages/by-id/21519.1585272409@sss.pgh.pa.usYeah, I just came to the same guess in the other thread.
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?I don't think you should revert. Those failures are (just) often enough
to be annoying but I do not think that a proper fix is very far away.That works for me, but an actual defect may trigger a revert. Fujii Masao
reported high walsender CPU usage after this patch. The patch caused idle
physical walsenders to use 100% CPU. When caught up, the
WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
until more WAL is available. XLogSendPhysical() just returns when caught up.
No amount of WAL is too small for physical replication to dispatch, but
logical replication needs the full xl_tot_len of a record. Some options:1. Make XLogSendPhysical() more like XLogSendLogical(), calling
WalSndWaitForWal() when no WAL is available. A quick version of this
passes tests, but I'll need to audit WalSndWaitForWal() for things that are
wrong for physical replication.2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
insufficient WAL is available. This complicates the xlogreader.h API to
pass back "wait for this XLogRecPtr", and we'd then persist enough state to
resume decoding. This doesn't have any advantages to make up for those.3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
further drift between the two wait sites; on the other hand, one could
refactor later to help avoid that.4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.I'm favoring (1). Other preferences?
Starting from the current shape, I think 1 is preferable, since that
waiting logic is no longer shared between logical and physical
replications. But I'm not sure I like calling WalSndWaitForWal()
(maybe with previous location + 1?), because the function seems to do
too-much.By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).
- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).
+ Since walsender is wokeup by LSN advancement via latch, we should
+ avoid wasteful consuming of latch. (It is the same issue with [1]).
If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.[1] /messages/by-id/20200408.164605.1874250940847340108.horikyota.ntt@gmail.com
--
Kyotaro Horiguchi
NTT Open Source Software Center
Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
At Thu, 16 Apr 2020 22:41:46 -0700, Noah Misch <noah@leadboat.com> wrote in
I'm favoring (1). Other preferences?
Starting from the current shape, I think 1 is preferable, since that
waiting logic is no longer shared between logical and physical
replications. But I'm not sure I like calling WalSndWaitForWal()
(maybe with previous location + 1?), because the function seems to do
too-much.
I'm far from an expert on this code, but it does look like there's
a lot of stuff in WalSndWaitForWal that is specific to logical rep,
so I'm not sure that (1) is workable. At the very least there'd
have to be a bunch more conditionals in that function than there are
now. In the end, a separate copy for physical rep might be better.
(BTW, I think this code is in desperate need of a renaming
campaign to make it clearer which functions are for logical rep,
physical rep, or both.)
regards, tom lane
On 2020/04/17 14:41, Noah Misch wrote:
On Mon, Apr 13, 2020 at 09:45:16PM -0400, Tom Lane wrote:
Noah Misch <noah@leadboat.com> writes:
This seems to have made the following race condition easier to hit:
/messages/by-id/20200206074552.GB3326097@rfd.leadboat.com
/messages/by-id/21519.1585272409@sss.pgh.pa.usYeah, I just came to the same guess in the other thread.
While I don't think that indicates anything wrong with the fix for $SUBJECT,
creating more buildfarm noise is itself bad. I am inclined to revert the fix
after a week. Not immediately, in case it uncovers lower-probability bugs.
I'd then re-commit it after one of those threads fixes the other bug. Would
anyone like to argue for a revert earlier, later, or not at all?I don't think you should revert. Those failures are (just) often enough
to be annoying but I do not think that a proper fix is very far away.That works for me, but an actual defect may trigger a revert. Fujii Masao
reported high walsender CPU usage after this patch. The patch caused idle
physical walsenders to use 100% CPU. When caught up, the
WalSndSendDataCallback for logical replication, XLogSendLogical(), sleeps
until more WAL is available. XLogSendPhysical() just returns when caught up.
No amount of WAL is too small for physical replication to dispatch, but
logical replication needs the full xl_tot_len of a record. Some options:1. Make XLogSendPhysical() more like XLogSendLogical(), calling
WalSndWaitForWal() when no WAL is available. A quick version of this
passes tests, but I'll need to audit WalSndWaitForWal() for things that are
wrong for physical replication.
(1) makes even physical replication walsender sleep in two places and
which seems to make the code for physical replication complicated
more than necessary. I'd like to avoid (1) if possible.
2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
insufficient WAL is available. This complicates the xlogreader.h API to
pass back "wait for this XLogRecPtr", and we'd then persist enough state to
resume decoding. This doesn't have any advantages to make up for those.3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
further drift between the two wait sites; on the other hand, one could
refactor later to help avoid that.
Since the additional call of WalSndKeepalive() is necessary only for
logical replication, it should be copied to, e.g., XLogSendLogical(),
instead of WalSndLoop()? For example, when XLogSendLogical() sets
WalSndCaughtUp to true, it should call WalSndKeepalive()?
The root problem seems that when WAL record that's no-opes for
logical rep is processed, keep alive message has not sent immediately,
in spite of that we want pg_stat_replication to be updated promptly.
(3) seems to try to address this problem straightly and looks better to me.
4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.
(4) also looks good because it's simple, if we can redesign those
functions in good shape.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).+ Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]).If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.
No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition." In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal. Each
walsender.c WaitLatch call does follow the rules.
On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
On 2020/04/17 14:41, Noah Misch wrote:
1. Make XLogSendPhysical() more like XLogSendLogical(), calling
WalSndWaitForWal() when no WAL is available. A quick version of this
passes tests, but I'll need to audit WalSndWaitForWal() for things that are
wrong for physical replication.(1) makes even physical replication walsender sleep in two places and
which seems to make the code for physical replication complicated
more than necessary. I'd like to avoid (1) if possible.
Good point.
2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
insufficient WAL is available. This complicates the xlogreader.h API to
pass back "wait for this XLogRecPtr", and we'd then persist enough state to
resume decoding. This doesn't have any advantages to make up for those.3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
further drift between the two wait sites; on the other hand, one could
refactor later to help avoid that.Since the additional call of WalSndKeepalive() is necessary only for
logical replication, it should be copied to, e.g., XLogSendLogical(),
instead of WalSndLoop()? For example, when XLogSendLogical() sets
WalSndCaughtUp to true, it should call WalSndKeepalive()?
We'd send a keepalive even when pq_flush_if_writable() can't empty the output
buffer. That could be acceptable, but it's not ideal.
The root problem seems that when WAL record that's no-opes for
logical rep is processed, keep alive message has not sent immediately,
in spite of that we want pg_stat_replication to be updated promptly.
The degree of promptness should be predictable, at least. If we removed the
WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would
not be prompt, but they would be predictable. I do, however, think prompt
updates are worthwhile.
(3) seems to try to address this problem straightly and looks better to me.
4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.(4) also looks good because it's simple, if we can redesign those
functions in good shape.
Let's do that. I'm attaching the replacement implementation and the revert of
v1.
Attachments:
WalSndLoop-stall-v1revert.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
Revert "When WalSndCaughtUp, sleep only in WalSndWaitForWal()."
This reverts commit 421685812290406daea58b78dfab0346eb683bbb. It caused
idle physical walsenders to busy-wait, as reported by Fujii Masao.
Discussion: https://postgr.es/m/20200417054146.GA1061007@rfd.leadboat.com
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index fc475d1..122d884 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,10 +1428,8 @@ WalSndWaitForWal(XLogRecPtr loc)
/*
* We only send regular messages to the client for full decoded
* transactions, but a synchronous replication and walsender shutdown
- * possibly are waiting for a later location. So, before sleeping, we
- * send a ping containing the flush location. If the receiver is
- * otherwise idle, this keepalive will trigger a reply. Processing the
- * reply will update these MyWalSnd locations.
+ * possibly are waiting for a later location. So we send pings
+ * containing the flush location every now and then.
*/
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
@@ -2316,16 +2314,20 @@ WalSndLoop(WalSndSendDataCallback send_data)
WalSndKeepaliveIfNecessary();
/*
- * Block if we have unsent data. Let WalSndWaitForWal() handle any
- * other blocking; idle receivers need its additional actions.
+ * We don't block if not caught up, unless there is unsent data
+ * pending in which case we'd better block until the socket is
+ * write-ready. This test is only needed for the case where the
+ * send_data callback handled a subset of the available data but then
+ * pq_flush_if_writable flushed it all --- we should immediately try
+ * to send more.
*/
- if (pq_is_send_pending())
+ if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
{
long sleeptime;
int wakeEvents;
wakeEvents = WL_LATCH_SET | WL_EXIT_ON_PM_DEATH | WL_TIMEOUT |
- WL_SOCKET_READABLE | WL_SOCKET_WRITEABLE;
+ WL_SOCKET_READABLE;
/*
* Use fresh timestamp, not last_processing, to reduce the chance
@@ -2333,6 +2335,9 @@ WalSndLoop(WalSndSendDataCallback send_data)
*/
sleeptime = WalSndComputeSleeptime(GetCurrentTimestamp());
+ if (pq_is_send_pending())
+ wakeEvents |= WL_SOCKET_WRITEABLE;
+
/* Sleep until something happens or we time out */
(void) WaitLatchOrSocket(MyLatch, wakeEvents,
MyProcPort->sock, sleeptime,
WalSndLoop-stall-v2.patchtext/plain; charset=us-asciiDownload
Author: Noah Misch <noah@leadboat.com>
Commit: Noah Misch <noah@leadboat.com>
In caught-up logical walsender, sleep only in WalSndWaitForWal().
Before sleeping, WalSndWaitForWal() sends a keepalive if MyWalSnd->write
< sentPtr. When the latest physical LSN yields no logical replication
messages (a common case), that keepalive elicits a reply. Processing
the reply updates pg_stat_replication.replay_lsn. WalSndLoop() lacks
that; when WalSndLoop() slept, replay_lsn advancement could stall until
wal_receiver_status_interval elapsed. This sometimes stalled
src/test/subscription/t/001_rep_changes.pl for up to 10s.
Reviewed by FIXME.
Discussion: https://postgr.es/m/FIXME
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 9e56115..a880195 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1428,8 +1428,10 @@ WalSndWaitForWal(XLogRecPtr loc)
/*
* We only send regular messages to the client for full decoded
* transactions, but a synchronous replication and walsender shutdown
- * possibly are waiting for a later location. So we send pings
- * containing the flush location every now and then.
+ * possibly are waiting for a later location. So, before sleeping, we
+ * send a ping containing the flush location. If the receiver is
+ * otherwise idle, this keepalive will trigger a reply. Processing the
+ * reply will update these MyWalSnd locations.
*/
if (MyWalSnd->flush < sentPtr &&
MyWalSnd->write < sentPtr &&
@@ -2314,14 +2316,14 @@ WalSndLoop(WalSndSendDataCallback send_data)
WalSndKeepaliveIfNecessary();
/*
- * We don't block if not caught up, unless there is unsent data
- * pending in which case we'd better block until the socket is
- * write-ready. This test is only needed for the case where the
- * send_data callback handled a subset of the available data but then
- * pq_flush_if_writable flushed it all --- we should immediately try
- * to send more.
+ * Block if we have unsent data. XXX For logical replication, let
+ * WalSndWaitForWal(), handle any other blocking; idle receivers need
+ * its additional actions. For physical replication, also block if
+ * caught up; its send_data does not block.
*/
- if ((WalSndCaughtUp && !streamingDoneSending) || pq_is_send_pending())
+ if ((WalSndCaughtUp && send_data != XLogSendLogical &&
+ !streamingDoneSending) ||
+ pq_is_send_pending())
{
long sleeptime;
int wakeEvents;
On 2020/04/18 16:01, Noah Misch wrote:
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).+ Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]).If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition." In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal. Each
walsender.c WaitLatch call does follow the rules.On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
On 2020/04/17 14:41, Noah Misch wrote:
1. Make XLogSendPhysical() more like XLogSendLogical(), calling
WalSndWaitForWal() when no WAL is available. A quick version of this
passes tests, but I'll need to audit WalSndWaitForWal() for things that are
wrong for physical replication.(1) makes even physical replication walsender sleep in two places and
which seems to make the code for physical replication complicated
more than necessary. I'd like to avoid (1) if possible.Good point.
2. Make XLogSendLogical() more like XLogSendPhysical(), returning when
insufficient WAL is available. This complicates the xlogreader.h API to
pass back "wait for this XLogRecPtr", and we'd then persist enough state to
resume decoding. This doesn't have any advantages to make up for those.3. Don't avoid waiting in WalSndLoop(); instead, fix the stall by copying the
WalSndKeepalive() call from WalSndWaitForWal() to WalSndLoop(). This risks
further drift between the two wait sites; on the other hand, one could
refactor later to help avoid that.Since the additional call of WalSndKeepalive() is necessary only for
logical replication, it should be copied to, e.g., XLogSendLogical(),
instead of WalSndLoop()? For example, when XLogSendLogical() sets
WalSndCaughtUp to true, it should call WalSndKeepalive()?We'd send a keepalive even when pq_flush_if_writable() can't empty the output
buffer. That could be acceptable, but it's not ideal.The root problem seems that when WAL record that's no-opes for
logical rep is processed, keep alive message has not sent immediately,
in spite of that we want pg_stat_replication to be updated promptly.The degree of promptness should be predictable, at least. If we removed the
WalSndKeepalive() from WalSndWaitForWal(), pg_stat_replication updates would
not be prompt, but they would be predictable. I do, however, think prompt
updates are worthwhile.(3) seems to try to address this problem straightly and looks better to me.
4. Keep the WalSndLoop() wait, but condition it on !logical. This is the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.(4) also looks good because it's simple, if we can redesign those
functions in good shape.Let's do that. I'm attaching the replacement implementation and the revert of
v1.
Thanks for the patch! Though referencing XLogSendLogical inside WalSndLoop()
might be a bit ugly,, I'm fine with this change because it's simple and easier
to understand.
+ * Block if we have unsent data. XXX For logical replication, let
+ * WalSndWaitForWal(), handle any other blocking; idle receivers need
+ * its additional actions. For physical replication, also block if
+ * caught up; its send_data does not block.
It might be better to s/WalSndWaitForWal()/send_data()? Because not only
WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().
frame #2: 0x0000000106bcfa84 postgres`WalSndWriteData(ctx=0x00007fb2a4812d20, lsn=22608080, xid=488, last_write=false) at walsender.c:1247:2
frame #3: 0x0000000106b98295 postgres`OutputPluginWrite(ctx=0x00007fb2a4812d20, last_write=false) at logical.c:540:2
frame #4: 0x00000001073fe9b8 pgoutput.so`send_relation_and_attrs(relation=0x00000001073ba2c0, ctx=0x00007fb2a4812d20) at pgoutput.c:353:2
frame #5: 0x00000001073fe7a0 pgoutput.so`maybe_send_schema(ctx=0x00007fb2a4812d20, relation=0x00000001073ba2c0, relentry=0x00007fb2a483aa60) at pgoutput.c:315:2
frame #6: 0x00000001073fd4c0 pgoutput.so`pgoutput_change(ctx=0x00007fb2a4812d20, txn=0x00007fb2a502e428, relation=0x00000001073ba2c0, change=0x00007fb2a5030428) at pgoutput.c:394:2
frame #7: 0x0000000106b99094 postgres`change_cb_wrapper(cache=0x00007fb2a482ed20, txn=0x00007fb2a502e428, relation=0x00000001073ba2c0, change=0x00007fb2a5030428) at logical.c:753:2
frame #8: 0x0000000106ba2200 postgres`ReorderBufferCommit(rb=0x00007fb2a482ed20, xid=488, commit_lsn=22621088, end_lsn=22621136, commit_time=640675460323211, origin_id=0, origin_lsn=0) at reorderbuffer.c:1653:7
frame #9: 0x0000000106b93c10 postgres`DecodeCommit(ctx=0x00007fb2a4812d20, buf=0x00007ffee954c0f8, parsed=0x00007ffee954bf90, xid=488) at decode.c:637:2
frame #10: 0x0000000106b92fa9 postgres`DecodeXactOp(ctx=0x00007fb2a4812d20, buf=0x00007ffee954c0f8) at decode.c:245:5
frame #11: 0x0000000106b92aee postgres`LogicalDecodingProcessRecord(ctx=0x00007fb2a4812d20, record=0x00007fb2a4812fe0) at decode.c:114:4
frame #12: 0x0000000106bd2a16 postgres`XLogSendLogical at walsender.c:2863:3
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote:
+ * Block if we have unsent data. XXX For logical replication, let + * WalSndWaitForWal(), handle any other blocking; idle receivers need + * its additional actions. For physical replication, also block if + * caught up; its send_data does not block.It might be better to s/WalSndWaitForWal()/send_data()? Because not only
WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().
Thanks for reviewing. WalSndWriteData() blocks when we have unsent data,
which is the same cause for blocking in WalSndLoop(). Since the comment you
quote says we let WalSndWaitForWal() "handle any other blocking", I don't
think your proposed change makes it more correct. Also, if someone wants to
refactor this, the place to look is WalSndWaitForWal(), not any other part of
send_data().
At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).+ Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]).If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition." In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal. Each
walsender.c WaitLatch call does follow the rules.
I didn't meant that, of course. I thought of more or less the same
with moving the trigger from latch to signal then the handler sets a
flag and SetLatch(). If we use bare latch, we should avoid false
entering to sleep, which also makes thinks compolex.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 20 Apr 2020 14:30:08 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in
On 2020/04/18 16:01, Noah Misch wrote:
On Sat, Apr 18, 2020 at 12:29:58AM +0900, Fujii Masao wrote:
4. Keep the WalSndLoop() wait, but condition it on !logical. This is
the
minimal fix, but it crudely punches through the abstraction between
WalSndLoop() and its WalSndSendDataCallback.(4) also looks good because it's simple, if we can redesign those
functions in good shape.Let's do that. I'm attaching the replacement implementation and the
revert of
v1.Thanks for the patch! Though referencing XLogSendLogical inside
WalSndLoop()
might be a bit ugly,, I'm fine with this change because it's simple
and easier
to understand.
I thought that if we do this, read_data returns boolean that indiates
whether wait for latch or incoming packet, or returns a wake event
mask.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote:
At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).+ Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]).If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition." In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal. Each
walsender.c WaitLatch call does follow the rules.I didn't meant that, of course. I thought of more or less the same
with moving the trigger from latch to signal then the handler sets a
flag and SetLatch(). If we use bare latch, we should avoid false
entering to sleep, which also makes thinks compolex.
I don't understand. If there's a defect, can you write a test case or
describe a sequence of events (e.g. at line X, variable Y has value Z)?
At Mon, 20 Apr 2020 00:59:54 -0700, Noah Misch <noah@leadboat.com> wrote in
On Mon, Apr 20, 2020 at 04:15:40PM +0900, Kyotaro Horiguchi wrote:
At Sat, 18 Apr 2020 00:01:42 -0700, Noah Misch <noah@leadboat.com> wrote in
On Fri, Apr 17, 2020 at 05:06:29PM +0900, Kyotaro Horiguchi wrote:
At Fri, 17 Apr 2020 17:00:15 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
By the way, if latch is consumed in WalSndLoop, succeeding call to
WalSndWaitForWal cannot be woke-up by the latch-set. Doesn't that
cause missing wakeups? (in other words, overlooking of wakeup latch).- Since the only source other than timeout of walsender wakeup is latch,
- we should avoid wasteful consuming of latch. (It is the same issue
- with [1]).+ Since walsender is wokeup by LSN advancement via latch, we should + avoid wasteful consuming of latch. (It is the same issue with [1]).If wakeup signal is not remembered on walsender (like
InterruptPending), WalSndPhysical cannot enter a sleep with
confidence.No; per latch.h, "What must be avoided is placing any checks for asynchronous
events after WaitLatch and before ResetLatch, as that creates a race
condition." In other words, the thing to avoid is calling ResetLatch()
without next examining all pending work that a latch would signal. Each
walsender.c WaitLatch call does follow the rules.I didn't meant that, of course. I thought of more or less the same
with moving the trigger from latch to signal then the handler sets a
flag and SetLatch(). If we use bare latch, we should avoid false
entering to sleep, which also makes thinks compolex.I don't understand. If there's a defect, can you write a test case or
describe a sequence of events (e.g. at line X, variable Y has value Z)?
Indeed. Anyway the current version cannot have such a possible issue.
Thanks.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2020/04/20 16:02, Noah Misch wrote:
On Mon, Apr 20, 2020 at 02:30:08PM +0900, Fujii Masao wrote:
+ * Block if we have unsent data. XXX For logical replication, let + * WalSndWaitForWal(), handle any other blocking; idle receivers need + * its additional actions. For physical replication, also block if + * caught up; its send_data does not block.It might be better to s/WalSndWaitForWal()/send_data()? Because not only
WalSndWaitForWal() but also WalSndWriteData() seems to handle the blocking.
WalSndWriteData() is called also under send_data, i.e., XLogSendLogical().Thanks for reviewing. WalSndWriteData() blocks when we have unsent data,
which is the same cause for blocking in WalSndLoop(). Since the comment you
quote says we let WalSndWaitForWal() "handle any other blocking", I don't
think your proposed change makes it more correct.
I was misreading this as something like "any other blocking than
the blocking in WalSndLoop()". Ok, I have no more comments on
the patch.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
On Mon, Apr 20, 2020 at 07:24:28PM +0900, Fujii Masao wrote:
I was misreading this as something like "any other blocking than
the blocking in WalSndLoop()". Ok, I have no more comments on
the patch.
Patch looks rather sane to me at quick glance. I can see that WAL
senders are now not stuck at 100% CPU per process when sitting idle,
for both physical and logical replication. Thanks.
--
Michael