walsender.c fileheader comment

Started by Peter Smithover 1 year ago4 messages

smithpb2250@gmail.com

over 1 year ago

1 attachment(s)

Hi,

I was reading the walsender.c fileheader comment while studying
another thread. I think if there is logical replication in progress
then the PROCSIG_WALSND_INIT_STOPPING handler will *always* switch to
a "stopping" state: e.g.,

/*
* Handle PROCSIG_WALSND_INIT_STOPPING signal.
*/
void
HandleWalSndInitStopping(void)
{
Assert(am_walsender);

/*
* If replication has not yet started, die like with SIGTERM. If
* replication is active, only set a flag and wake up the main loop. It
* will send any outstanding WAL, wait for it to be replicated to the
* standby, and then exit gracefully.
*/
if (!replication_active)
kill(MyProcPid, SIGTERM);
else
got_STOPPING = true;
}

~~~

But the walsender.c fileheader comment seems to be saying something
slightly different. IIUC, some minor rewording of the comment is
needed so it describes the code better.

HEAD
...
* shutdown, if logical replication is in progress all existing WAL records
* are processed followed by a shutdown. Otherwise this causes the walsender
* to switch to the "stopping" state. In this state, the walsender will reject
* any further replication commands. The checkpointer begins the shutdown
...

SUGGESTION
.. shutdown. If logical replication is in progress, the walsender
switches to a "stopping" state. In this state, the walsender will
reject any further replication commands - but all existing WAL records
are processed - followed by a shutdown.

~~~

I attached a patch for the above-suggested change.

Thoughts?

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachments:

v1-0001-walsender-fileheader-comment.patchapplication/octet-stream; name=v1-0001-walsender-fileheader-comment.patchDownload

From 7188680f0f3ce81df6bf8067940b51257fbf3b5a Mon Sep 17 00:00:00 2001
From: Peter Smith <peter.b.smith@fujitsu.com>
Date: Tue, 11 Jun 2024 12:31:05 +1000
Subject: [PATCH v1] walsender fileheader comment

---
 src/backend/replication/walsender.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c623b07..1d41aea 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -27,14 +27,14 @@
  * If the server is shut down, checkpointer sends us
  * PROCSIG_WALSND_INIT_STOPPING after all regular backends have exited.  If
  * the backend is idle or runs an SQL query this causes the backend to
- * shutdown, if logical replication is in progress all existing WAL records
- * are processed followed by a shutdown.  Otherwise this causes the walsender
- * to switch to the "stopping" state. In this state, the walsender will reject
- * any further replication commands. The checkpointer begins the shutdown
- * checkpoint once all walsenders are confirmed as stopping. When the shutdown
- * checkpoint finishes, the postmaster sends us SIGUSR2. This instructs
- * walsender to send any outstanding WAL, including the shutdown checkpoint
- * record, wait for it to be replicated to the standby, and then exit.
+ * shutdown. If logical replication is in progress, the walsender switches to a
+ * "stopping" state. In this state, the walsender will reject any further
+ * replication commands - but all existing WAL records are processed - followed
+ * by a shutdown. The checkpointer begins the shutdown checkpoint once all
+ * walsenders are confirmed as stopping. When the shutdown checkpoint finishes,
+ * the postmaster sends us SIGUSR2. This instructs walsender to send any
+ * outstanding WAL, including the shutdown checkpoint record, wait for it to be
+ * replicated to the standby, and then exit.
  *
  *
  * Portions Copyright (c) 2010-2024, PostgreSQL Global Development Group
-- 
1.8.3.1

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Peter Smith (#1)

Re: walsender.c fileheader comment

On 6/11/24 04:35, Peter Smith wrote:

Hi,

I was reading the walsender.c fileheader comment while studying
another thread. I think if there is logical replication in progress
then the PROCSIG_WALSND_INIT_STOPPING handler will *always* switch to
a "stopping" state: e.g.,

/*
* Handle PROCSIG_WALSND_INIT_STOPPING signal.
*/
void
HandleWalSndInitStopping(void)
{
Assert(am_walsender);

/*
* If replication has not yet started, die like with SIGTERM. If
* replication is active, only set a flag and wake up the main loop. It
* will send any outstanding WAL, wait for it to be replicated to the
* standby, and then exit gracefully.
*/
if (!replication_active)
kill(MyProcPid, SIGTERM);
else
got_STOPPING = true;
}

~~~

But the walsender.c fileheader comment seems to be saying something
slightly different. IIUC, some minor rewording of the comment is
needed so it describes the code better.

HEAD
...
* shutdown, if logical replication is in progress all existing WAL records
* are processed followed by a shutdown. Otherwise this causes the walsender
* to switch to the "stopping" state. In this state, the walsender will reject
* any further replication commands. The checkpointer begins the shutdown
...

SUGGESTION
.. shutdown. If logical replication is in progress, the walsender
switches to a "stopping" state. In this state, the walsender will
reject any further replication commands - but all existing WAL records
are processed - followed by a shutdown.

~~~

I attached a patch for the above-suggested change.

Thoughts?

I did look at this, and while the explanation in the current comment may
seem a bit confusing, I'm not sure the suggested changes improve the
situation very much.

This suggests the two comments somehow disagree, but it does not say in
what exactly, so perhaps I just missed it :-(

ISTM there's a bit of confusion what is meant by "stopping" state - you
seem to be interpreting it as a general concept, where the walsender is
requested to stop (through the signal), and starts doing stuff to exit.
But the comments actually talk about WalSnd->state, where "stopping"
means it needs to be set to WALSNDSTATE_STOPPING.

And we only ever switch to that state in two places - in WalSndPhysical
and exec_replication_command. And that does not happen in regular
logical replication (which is what "logical replication is in progress"
refers to) - if you have a walsender just replicating DML, it will never
see the WALSNDSTATE_STOPPING state. It will simply do the cleanup while
still in WALSNDSTATE_STREAMING state, and then just exit.

So from this point of view, the suggestion is actually wrong.

To conclude, I think this probably makes the comments more confusing. If
we want to make it clearer, I'd probably start by clarifying what the
"stopping" state means. Also, it's a bit surprising we may not actually
go through the "stopping" state during shutdown.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Peter Smith

smithpb2250@gmail.com

over 1 year ago

In reply to: Tomas Vondra (#2)

Re: walsender.c fileheader comment

Hi, Thankyou for taking the time to look at this and reply.

I did look at this, and while the explanation in the current comment may
seem a bit confusing, I'm not sure the suggested changes improve the
situation very much.

This suggests the two comments somehow disagree, but it does not say in
what exactly, so perhaps I just missed it :-(

ISTM there's a bit of confusion what is meant by "stopping" state - you
seem to be interpreting it as a general concept, where the walsender is
requested to stop (through the signal), and starts doing stuff to exit.
But the comments actually talk about WalSnd->state, where "stopping"
means it needs to be set to WALSNDSTATE_STOPPING.

Yes, I interpreted the "stopping" state meaning as when the boolean
flag 'got_STOPPING' is assigned true.

And we only ever switch to that state in two places - in WalSndPhysical
and exec_replication_command. And that does not happen in regular
logical replication (which is what "logical replication is in progress"
refers to) - if you have a walsender just replicating DML, it will never
see the WALSNDSTATE_STOPPING state. It will simply do the cleanup while
still in WALSNDSTATE_STREAMING state, and then just exit.

So from this point of view, the suggestion is actually wrong.

OK.

To conclude, I think this probably makes the comments more confusing. If
we want to make it clearer, I'd probably start by clarifying what the
"stopping" state means. Also, it's a bit surprising we may not actually
go through the "stopping" state during shutdown.

I agree. My interpretation of the (ambiguous) "stopping" state led me
to believe the comment was quite wrong. So, this thread was only
intended as a trivial comment fix in passing but clearly there is more
to this than I anticipated. I would be happy if someone with more
knowledge about the WALSNDSTATE_STOPPING versus got_STOPPING could
disambiguate the file header comment, but that's not me, so I have
withdrawn this from the Commitfest.

======
Kind Regards,
Peter Smith.
Fujitsu Australia

Tomas Vondra

tomas.vondra@enterprisedb.com

over 1 year ago

In reply to: Peter Smith (#3)

Re: walsender.c fileheader comment

On 7/19/24 07:02, Peter Smith wrote:

...

To conclude, I think this probably makes the comments more confusing. If
we want to make it clearer, I'd probably start by clarifying what the
"stopping" state means. Also, it's a bit surprising we may not actually
go through the "stopping" state during shutdown.

I agree. My interpretation of the (ambiguous) "stopping" state led me
to believe the comment was quite wrong. So, this thread was only
intended as a trivial comment fix in passing but clearly there is more
to this than I anticipated. I would be happy if someone with more
knowledge about the WALSNDSTATE_STOPPING versus got_STOPPING could
disambiguate the file header comment, but that's not me, so I have
withdrawn this from the Commitfest.

Understood. Thanks for the patch anyway, I appreciate you took the time
to try to improve the comments!

I agree the state transitions in walsender are not very clear, and the
fact that it may shutdown without ever going through STOPPING state is
quite confusing. That being said, I personally don't have ambition to
improve this.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company