Logical replication wal sender timestamp bug
While monitoring pg_stat_subscription, I noticed that last_msg_send_time
was usually NULL, which doesn't make sense. Why would we have a message,
but not know when it was sent?
Looking in src/backend/replication/walsender.c, there is this:
/* output previously gathered data in a CopyData packet */
pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
/*
* Fill the send timestamp last, so that it is taken as late as
possible.
* This is somewhat ugly, but the protocol is set as it's already used
for
* several releases by streaming physical replication.
*/
resetStringInfo(&tmpbuf);
now = GetCurrentTimestamp();
pq_sendint64(&tmpbuf, now);
memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
tmpbuf.data, sizeof(int64));
Filling out the timestamp after the message has already been sent is taking
"as late as possible" a little too far. It results in every message having
a zero timestamp, other than keep-alives which go through a different path.
Re-ordering the code blocks as in the attached seems to fix it. But I have
to wonder, if this has been broken from the start and no one ever noticed,
is this even valuable information to relay in the first place? We could
just take the column out of the view, and not bother calling
GetCurrentTimestamp() for each message.
Cheers,
Jeff
Attachments:
walsender_timestamp_fix.patchapplication/octet-stream; name=walsender_timestamp_fix.patchDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index b0ebe50..7f56715 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1180,9 +1180,6 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
{
TimestampTz now;
- /* output previously gathered data in a CopyData packet */
- pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
-
/*
* Fill the send timestamp last, so that it is taken as late as possible.
* This is somewhat ugly, but the protocol is set as it's already used for
@@ -1194,6 +1191,9 @@ WalSndWriteData(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId xid,
memcpy(&ctx->out->data[1 + sizeof(int64) + sizeof(int64)],
tmpbuf.data, sizeof(int64));
+ /* output previously gathered data in a CopyData packet */
+ pq_putmessage_noblock('d', ctx->out->data, ctx->out->len);
+
CHECK_FOR_INTERRUPTS();
/* Try to flush pending output to the client */
On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
While monitoring pg_stat_subscription, I noticed that last_msg_send_time
was usually NULL, which doesn't make sense. Why would we have a message,
but not know when it was sent?
So... The timestamp is received and stored in LogicalRepApplyLoop()
with send_time when a 'w' message is received in the subscription
cluster. And it gets computed with a two-phase process:
- WalSndPrepareWrite() reserves the space in the message for the
timestamp.
- WalSndWriteData() computes the timestamp in the reserved space once
the write message is computed and ready to go.
Filling out the timestamp after the message has already been sent is taking
"as late as possible" a little too far. It results in every message having
a zero timestamp, other than keep-alives which go through a different path.
It seems to me that you are right: the timestamp is computed too
late.
Re-ordering the code blocks as in the attached seems to fix it. But I have
to wonder, if this has been broken from the start and no one ever noticed,
is this even valuable information to relay in the first place? We could
just take the column out of the view, and not bother calling
GetCurrentTimestamp() for each message.
I think that there are use cases for such monitoring capabilities,
see for example 7fee252. So I would rather keep it.
--
Michael
On Tue, Nov 05, 2019 at 01:19:37PM +0900, Michael Paquier wrote:
On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
Filling out the timestamp after the message has already been sent is taking
"as late as possible" a little too far. It results in every message having
a zero timestamp, other than keep-alives which go through a different path.It seems to me that you are right: the timestamp is computed too
late.
It is easy enough to reproduce the problem by setting for example
logical replication between two nodes and pgbench to produce some
load and then monitor pg_stat_subscription periodically. However, it
is a problem since logical decoding has been introduced (5a991ef) so
committed your fix down to 9.4.
--
Michael
On Wed, Nov 6, 2019 at 2:15 AM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Nov 05, 2019 at 01:19:37PM +0900, Michael Paquier wrote:
On Sat, Nov 02, 2019 at 09:54:54PM -0400, Jeff Janes wrote:
Filling out the timestamp after the message has already been sent is
taking
"as late as possible" a little too far. It results in every message
having
a zero timestamp, other than keep-alives which go through a different
path.
It seems to me that you are right: the timestamp is computed too
late.It is easy enough to reproduce the problem by setting for example
logical replication between two nodes and pgbench to produce some
load and then monitor pg_stat_subscription periodically. However, it
is a problem since logical decoding has been introduced (5a991ef) so
committed your fix down to 9.4.
Thanks. This column looks much more reasonable now.
Cheers,
Jeff