Exit walsender before confirming remote flush in logical replication

Started by Hayato Kuroda (Fujitsu)over 3 years ago109 messageshackers
Jump to latest
#1Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com

Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1]https://commitfest.postgresql.org/41/3581/.
While discussing, we thought that we could extend the condition of walsender shutdown[2]/messages/by-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19@TYAPR01MB5866.jpnprd01.prod.outlook.com[3]/messages/by-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg@mail.gmail.com.

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4]https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459, which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

PSA the minimal patch for that. I'm not sure whether WalSndCaughtUp should be
also omitted or not. It seems that changes may affect other parts like
WalSndWaitForWal(), but we can investigate more about it.

[1]: https://commitfest.postgresql.org/41/3581/
[2]: /messages/by-id/TYAPR01MB58661BA3BF38E9798E59AE14F5E19@TYAPR01MB5866.jpnprd01.prod.outlook.com
[3]: /messages/by-id/CAA4eK1LyetktcphdRrufHac4t5DGyhsS2xG2DSOGb7OaOVcDVg@mail.gmail.com
[4]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

0001-Exit-walsender-before-confirming-remote-flush-in-log.patchapplication/octet-stream; name=0001-Exit-walsender-before-confirming-remote-flush-in-log.patchDownload+14-5
#2Ashutosh Bapat
ashutosh.bapat@enterprisedb.com
In reply to: Hayato Kuroda (Fujitsu) (#1)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear hackers,
(I added Amit as CC because we discussed in another thread)

This is a fork thread from time-delayed logical replication [1].
While discussing, we thought that we could extend the condition of walsender shutdown[2][3].

Currently, walsenders delay the shutdown request until confirming all sent data
are flushed on remote side. This condition was added in 985bd7[4], which is for
supporting clean switchover. Supposing that there is a primary-secondary
physical replication system, and do following steps. If any changes are come
while step 2 but the walsender does not confirm the remote flush, the reboot in
step 3 may be failed.

1. Stops primary server.
2. Promotes secondary to new primary.
3. Reboot (old)primary as new secondary.

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

I have not reviewed the patch though.

--
Best Wishes,
Ashutosh Bapat

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Ashutosh Bapat (#2)
Re: Exit walsender before confirming remote flush in logical replication

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional

..

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

...

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone. And I'm afraind that many people already
relies on that behavior.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it. I'm
fine doing that only in the apply_delay case, though. However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#3)
Re: Exit walsender before confirming remote flush in logical replication

On Fri, Dec 23, 2022 at 7:51 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional

..

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

...

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone.

Do you see it was discussed somewhere? If so, can you please point to
that discussion?

And I'm afraind that many people already
relies on that behavior.

But OTOH, it can also be annoying for users to see some wait during
the shutdown which is actually not required.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it. I'm
fine doing that only in the apply_delay case, though. However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

How can we send commit_data.end_lsn before actually committing the
xact? I think this can lead to a problem because next time (say after
restart of walsender) server can skip sending the xact even if it is
not committed by the client.

--
With Regards,
Amit Kapila.

#5Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#3)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1]/messages/by-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos and flushpos in send_feedback().

Followings were contents of the feedback message I got, and we could see that recv and flush were overwritten.

```
DEBUG: sending feedback (force 1) to recv 0/1553638, write 0/1553550, flush 0/1553638
CONTEXT: processing remote data for replication origin "pg_16390" during message type "BEGIN" in transaction 730, finished at 0/1553638
```

In terms of walsender, however, sentPtr seemed to be slightly larger than flushed position on subscriber.

```
(gdb) p MyWalSnd->sentPtr
$2 = 22361760
(gdb) p MyWalSnd->flush
$3 = 22361656
(gdb) p *MyWalSnd
$4 = {pid = 28807, state = WALSNDSTATE_STREAMING, sentPtr = 22361760, needreload = false, write = 22361656,
flush = 22361656, apply = 22361424, writeLag = 20020343, flushLag = 20020343, applyLag = 20020343,
sync_standby_priority = 0, mutex = 0 '\000', latch = 0x7ff0350cbb94, replyTime = 725113263592095}
```

Therefore I could not shut down the publisher node when applications were delaying.
Do you have any opinions about them?

```
$ pg_ctl stop -D data_pub/
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

[1]: /messages/by-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#6Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#5)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1] for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos
and flushpos in send_feedback().

Maybe I misunderstood what you said... I have also found that sentPtr is not the actual sent
position, but the starting point of the next WAL. You can see the comment below.

```
/*
* How far have we sent WAL already? This is also advertised in
* MyWalSnd->sentPtr. (Actually, this is the next WAL location to send.)
*/
static XLogRecPtr sentPtr = InvalidXLogRecPtr;
```

We must use end_lsn for crafting messages to cheat the walsender, but such records
are included in COMMIT, not in BEGIN for the non-streaming case.
But if workers are delayed in apply_handle_commit(), will they hold locks for database
objects for a long time and it causes another issue.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#7Dilip Kumar
dilipbalaut@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#1)
Re: Exit walsender before confirming remote flush in logical replication

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

+1 for the idea.

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#8Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Dilip Kumar (#7)
RE: Exit walsender before confirming remote flush in logical replication

Dear Dilip,

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v2-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v2-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload+14-5
#9Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#8)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

--
With Regards,
Amit Kapila.

#10Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#9)
Re: Exit walsender before confirming remote flush in logical replication

On Tue, Dec 27, 2022 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

--
With Regards,
Amit Kapila.

#11Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#9)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.
I think this is an architectural problem. Maybe we have assumed that the decoded
WALs are consumed in as short time. I do not have good idea, but one approach is
introducing a new process logical-walreceiver. It will record the decoded WALs to
the persistent storage and workers consume and then remove them. It may have huge
impact for other features and should not be accepted...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v3-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v3-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload+13-5
#12Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#10)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

Tx-2 (on publisher)
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#13Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#12)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Dec 28, 2022 at 8:19 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

Tx-2 (on publisher)
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

--
With Regards,
Amit Kapila.

#14Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#13)
RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

I think this should be documented at "Shutting Down the Server" section in runtime.sgml
or logical-replicaiton.sgml, but I cannot find. It will be included in next version.

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa.
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1]https://wiki.postgresql.org/wiki/Streaming_Replication[2]https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal, but I could not find any design about it.

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#15Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#11)
Re: Exit walsender before confirming remote flush in logical replication

On Wed, Dec 28, 2022 at 8:18 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit,

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

--
With Regards,
Amit Kapila.

#16Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#14)
Re: Exit walsender before confirming remote flush in logical replication

At Wed, 28 Dec 2022 09:15:41 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa.
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1][2], but I could not find any design about it.

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit. Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#15)
Re: Exit walsender before confirming remote flush in logical replication

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply. The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#17)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san, Amit,

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com>
wrote in

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply. The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

If Amit wanted to say about the case that sending data is pending in physical
replication, the walsender cannot stop. But this is not related with the
synchronous_commit: it is caused because it must sweep all pending data before
shutting down. We can reproduce the situation with:

1. build streaming replication system
2. kill -STOP $walreceiver
3. insert data to primary server
4. try to stop the primary server

If what you said was not related with pending data, walsender can be stopped even
if the synchronous_commit = remote_apply. As Horiguchi-san said, such a condition
is not written in WalSndDone() [1]https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121. I think the parameter synchronous_commit does
not affect walsender process so well. It just define when backend returns the
result to client.

I could check by following steps:

1. built streaming replication system. PSA the script to follow that.

Primary config.

```
synchronous_commit = 'remote_apply'
synchronous_standby_names = 'secondary'
```

Secondary config.

```
recovery_min_apply_delay = 1d
primary_conninfo = 'user=postgres port=$port_N1 application_name=secondary'
hot_standby = on
```

2. inserted data to primary. This waited the remote apply

psql -U postgres -p $port_primary -c "INSERT INTO tbl SELECT generate_series(1, 5000)"

3. Stopped the primary server from another terminal. It could be done.
The terminal on step2 said like:

```
WARNING: canceling the wait for synchronous replication and terminating connection due to administrator command
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
```

[1]: https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

test_phy.shapplication/octet-stream; name=test_phy.shDownload
#19Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#16)
RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit.

AFAIK the condition (sentPtr == replicatedPtr) seemed to be introduced for the purpose[1]https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459.
You meant to say that the conditon (!pq_is_send_pending()) has same motivation, right?

Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

I was not clear the word "next startup", but I agreed that we can shut down the
walsender in case of recovery_min_apply_delay > 0 and synchronous_commit = remote_apply.
The startup process will be not terminated even if the primary crashes, so I
think the process will apply the transaction sooner or later.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

I updated the patch accordingly. Also, I found that the previous version
did not work well in case of streamed transactions. When a streamed transaction
is committed on publisher but the application is delayed on subscriber, the
process sometimes waits until there is no pending write. This is done in
ProcessPendingWrites(). I added another termination path in the function.

[1]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Attachments:

v4-0001-Exit-walsender-before-confirming-remote-flush-in-.patchapplication/octet-stream; name=v4-0001-Exit-walsender-before-confirming-remote-flush-in-.patchDownload+39-17
#20Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#19)
Re: Exit walsender before confirming remote flush in logical replication

On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend. Similarly, if the user has configured a
large value of time-delayed apply, it can lead to a network buffer
full between walsender and receive/process.

The condition to allow the shutdown to wait for all WAL to be sent has
two parts: (a) it confirms that there is no pending WAL to be sent;
(b) it confirms all the WAL sent has been flushed by the client. As
per our understanding, both these conditions are to allow clean
switchover/failover which seems to be useful only for physical
replication. The logical replication doesn't provide such
functionality.

The proposed patch tries to eliminate condition (b) for logical
replication in the hopes that the same will allow the shutdown to be
complete in most cases. There is no specific reason discussed to not
do (a) for logical replication.

Now, to proceed here we have the following options: (1) Fix (b) as
proposed by the patch and document the risks related to (a); (2) Fix
both (a) and (b); (3) Do nothing and document that users need to
unblock the subscribers to complete the shutdown.

Thoughts?

--
With Regards,
Amit Kapila.

#21Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#20)
#22Amit Kapila
amit.kapila16@gmail.com
In reply to: Amit Kapila (#20)
#23Dilip Kumar
dilipbalaut@gmail.com
In reply to: Amit Kapila (#22)
#24Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Dilip Kumar (#23)
#25Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#22)
#26Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#25)
#27Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#26)
#28Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#26)
#29Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#28)
#30Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#27)
#31Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#29)
#32Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#29)
#34Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#34)
#36Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#36)
#38Amit Kapila
amit.kapila16@gmail.com
In reply to: Andres Freund (#37)
#39Andres Freund
andres@anarazel.de
In reply to: Amit Kapila (#38)
#40Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Andres Freund (#39)
#41Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#40)
#42Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#41)
#43Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#42)
#44Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#43)
#45Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#44)
#46Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Kyotaro Horiguchi (#43)
#47Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#46)
#48osumi.takamichi@fujitsu.com
osumi.takamichi@fujitsu.com
In reply to: Hayato Kuroda (Fujitsu) (#47)
#49Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#47)
#50Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: osumi.takamichi@fujitsu.com (#48)
#51Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Masahiko Sawada (#49)
#52Peter Smith
smithpb2250@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#50)
#53Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Peter Smith (#52)
#54Amit Kapila
amit.kapila16@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#53)
#55Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Amit Kapila (#54)
#56Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#55)
#57Amit Kapila
amit.kapila16@gmail.com
In reply to: Kyotaro Horiguchi (#56)
#58Peter Smith
smithpb2250@gmail.com
In reply to: Hayato Kuroda (Fujitsu) (#53)
#59Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Amit Kapila (#57)
#60Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#59)
#61Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#60)
#62Greg Sabino Mullane
greg@turnstep.com
In reply to: Kyotaro Horiguchi (#61)
#63Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Greg Sabino Mullane (#62)
#64Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#53)
#65Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#64)
#66Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#65)
#67Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#66)
#68Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#67)
#69Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrey Silitskiy (#68)
#70Fujii Masao
masao.fujii@gmail.com
In reply to: Alexander Korotkov (#69)
#71Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#68)
#72Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Alexander Korotkov (#69)
#73Hayato Kuroda (Fujitsu)
kuroda.hayato@fujitsu.com
In reply to: Andrey Silitskiy (#72)
#74Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Hayato Kuroda (Fujitsu) (#73)
#75Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#71)
#76Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#75)
#77Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Fujii Masao (#76)
#78Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Vitaly Davydov (#77)
#79Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#78)
#80Ronan Dunklau
ronan@dunklau.fr
In reply to: Fujii Masao (#79)
#81Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#79)
#82Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Ronan Dunklau (#80)
#83Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Andrey Silitskiy (#81)
#84Michael Paquier
michael@paquier.xyz
In reply to: Andrey Silitskiy (#83)
#85Fujii Masao
masao.fujii@gmail.com
In reply to: Michael Paquier (#84)
#86Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Michael Paquier (#84)
#87Ronan Dunklau
ronan@dunklau.fr
In reply to: Andrey Silitskiy (#82)
#88Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Ronan Dunklau (#87)
#89Japin Li
japinli@hotmail.com
In reply to: Andrey Silitskiy (#88)
#90Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Japin Li (#89)
#91Alexander Korotkov
aekorotkov@gmail.com
In reply to: Andrey Silitskiy (#90)
#92Greg Sabino Mullane
greg@turnstep.com
In reply to: Alexander Korotkov (#91)
#93Japin Li
japinli@hotmail.com
In reply to: Alexander Korotkov (#91)
#94Fujii Masao
masao.fujii@gmail.com
In reply to: Japin Li (#93)
#95Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Greg Sabino Mullane (#92)
#96Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#94)
#97Greg Sabino Mullane
greg@turnstep.com
In reply to: Andrey Silitskiy (#95)
#98Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Greg Sabino Mullane (#97)
#99Greg Sabino Mullane
greg@turnstep.com
In reply to: Andrey Silitskiy (#98)
#100Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#98)
#101Vitaly Davydov
v.davydov@postgrespro.ru
In reply to: Fujii Masao (#100)
#102Alexander Korotkov
aekorotkov@gmail.com
In reply to: Fujii Masao (#100)
#103Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Alexander Korotkov (#102)
#104Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Andrey Silitskiy (#103)
#105Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#103)
#106Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#104)
#107Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#106)
#108Andrey Silitskiy
a.silitskiy@postgrespro.ru
In reply to: Fujii Masao (#107)
#109Fujii Masao
masao.fujii@gmail.com
In reply to: Andrey Silitskiy (#108)