Exit walsender before confirming remote flush in logical replication

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Ashutosh Bapat (#2)

Re: Exit walsender before confirming remote flush in logical replication

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

...

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone. And I'm afraind that many people already
relies on that behavior.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it. I'm
fine doing that only in the apply_delay case, though. However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

amit.kapila16@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: Exit walsender before confirming remote flush in logical replication

On Fri, Dec 23, 2022 at 7:51 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 22 Dec 2022 17:29:34 +0530, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote in

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional

..

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

...

This change may be useful for time-delayed logical replication. The walsender
waits the shutdown until all changes are applied on subscriber, even if it is
delayed. This causes that publisher cannot be stopped if large delay-time is
specified.

I think the current behaviour is an artifact of using the same WAL
sender code for both logical and physical replication.

Yeah, I don't think we do that for the reason of switchover. On the
other hand I think the behavior was intentionally taken over since it
is thought as sensible alone.

Do you see it was discussed somewhere? If so, can you please point to
that discussion?

And I'm afraind that many people already
relies on that behavior.

But OTOH, it can also be annoying for users to see some wait during
the shutdown which is actually not required.

I agree with you that the logical WAL sender need not wait for all the
WAL to be replayed downstream.

Thus I feel that it might be a bit outrageous to get rid of that
bahavior altogether because of a new feature stumbling on it. I'm
fine doing that only in the apply_delay case, though. However, I have
another concern that we are introducing the second exception for
XLogSendLogical in the common path.

I doubt that anyone wants to use synchronous logical replication with
apply_delay since the sender transaction is inevitablly affected back
by that delay.

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

How can we send commit_data.end_lsn before actually committing the
xact? I think this can lead to a problem because next time (say after
restart of walsender) server can skip sending the xact even if it is
not committed by the client.

--
With Regards,
Amit Kapila.

[1]: /messages/by-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#3)

RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1]/messages/by-id/TYCPR01MB83730A3E21E921335F6EFA38EDE89@TYCPR01MB8373.jpnprd01.prod.outlook.com for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos and flushpos in send_feedback().

Followings were contents of the feedback message I got, and we could see that recv and flush were overwritten.

```
DEBUG: sending feedback (force 1) to recv 0/1553638, write 0/1553550, flush 0/1553638
CONTEXT: processing remote data for replication origin "pg_16390" during message type "BEGIN" in transaction 730, finished at 0/1553638
```

In terms of walsender, however, sentPtr seemed to be slightly larger than flushed position on subscriber.

```
(gdb) p MyWalSnd->sentPtr
$2 = 22361760
(gdb) p MyWalSnd->flush
$3 = 22361656
(gdb) p *MyWalSnd
$4 = {pid = 28807, state = WALSNDSTATE_STREAMING, sentPtr = 22361760, needreload = false, write = 22361656,
flush = 22361656, apply = 22361424, writeLag = 20020343, flushLag = 20020343, applyLag = 20020343,
sync_standby_priority = 0, mutex = 0 '\000', latch = 0x7ff0350cbb94, replyTime = 725113263592095}
```

Therefore I could not shut down the publisher node when applications were delaying.
Do you have any opinions about them?

```
$ pg_ctl stop -D data_pub/
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#5)

RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

Thus how about before entering an apply_delay, logrep worker sending a
kind of crafted feedback, which reports commit_data.end_lsn as
flushpos? A little tweak is needed in send_feedback() but seems to
work..

Thanks for replying! I tested your saying but it could not work well...

I made PoC based on the latest time-delayed patches [1] for non-streaming case.
Apply workers that are delaying applications send begin_data.final_lsn as recvpos
and flushpos in send_feedback().

Maybe I misunderstood what you said... I have also found that sentPtr is not the actual sent
position, but the starting point of the next WAL. You can see the comment below.

```
/*
* How far have we sent WAL already? This is also advertised in
* MyWalSnd->sentPtr. (Actually, this is the next WAL location to send.)
*/
static XLogRecPtr sentPtr = InvalidXLogRecPtr;
```

We must use end_lsn for crafting messages to cheat the walsender, but such records
are included in COMMIT, not in BEGIN for the non-streaming case.
But if workers are delayed in apply_handle_commit(), will they hold locks for database
objects for a long time and it causes another issue.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#1)

Re: Exit walsender before confirming remote flush in logical replication

On Thu, Dec 22, 2022 at 11:16 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In case of logical replication, however, we cannot support the use-case that
switches the role publisher <-> subscriber. Suppose same case as above, additional
transactions are committed while doing step2. To catch up such changes subscriber
must receive WALs related with trans, but it cannot be done because subscriber
cannot request WALs from the specific position. In the case, we must truncate all
data in new subscriber once, and then create new subscription with copy_data
= true.

Therefore, I think that we can ignore the condition for shutting down the
walsender in logical replication.

+1 for the idea.

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.

I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Dilip Kumar (#7)

RE: Exit walsender before confirming remote flush in logical replication

Dear Dilip,

Thanks for checking my proposal!

- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.
I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

amit.kapila16@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#8)

Re: Exit walsender before confirming remote flush in logical replication

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Thanks for checking my proposal!
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.
I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode
Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

--
With Regards,
Amit Kapila.

#10

amit.kapila16@gmail.com

over 3 years ago

In reply to: Amit Kapila (#9)

Re: Exit walsender before confirming remote flush in logical replication

On Tue, Dec 27, 2022 at 2:50 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 27, 2022 at 1:44 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
Thanks for checking my proposal!
- * Note that if we determine that there's still more data to send, this
- * function will return control to the caller.
+ * Note that if we determine that there's still more data to send or we are in
+ * the physical replication more, this function will return control to the
+ * caller.
I think in this comment you meant to say

1. "or we are in physical replication mode and all WALs are not yet replicated"
2. Typo /replication more/replication mode
Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.
I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

--
With Regards,
Amit Kapila.

#11

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#9)

RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.
I think this is an architectural problem. Maybe we have assumed that the decoded
WALs are consumed in as short time. I do not have good idea, but one approach is
introducing a new process logical-walreceiver. It will record the decoded WALs to
the persistent storage and workers consume and then remove them. It may have huge
impact for other features and should not be accepted...

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#12

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#10)

RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

Tx-2 (on publisher)
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#13

amit.kapila16@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#12)

Re: Exit walsender before confirming remote flush in logical replication

On Wed, Dec 28, 2022 at 8:19 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In logical replication, it can happen today as well without
time-delayed replication. Basically, say apply worker is waiting to
acquire some lock that is already acquired by some backend then it
will have the same behavior. I have not verified this, so you may want
to check it once.

Right, I could reproduce the scenario with following steps.

1. Construct pub -> sub logical replication system with streaming = off.
2. Define a table on both nodes.

```
CREATE TABLE tbl (id int PRIMARY KEY);
```

3. Execute concurrent transactions.

Tx-1 (on subscriber)
BEGIN;
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

Tx-2 (on publisher)
INSERT INTO tbl SELECT i FROM generate_series(1, 5000) s(i);

4. Try to shutdown publisher but it will be failed.

```
$ pg_ctl stop -D publisher
waiting for server to shut down............................................................... failed
pg_ctl: server does not shut down
```

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

--
With Regards,
Amit Kapila.

#14

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#13)

RE: Exit walsender before confirming remote flush in logical replication

Dear Amit,

Thanks for the verification. BTW, do you think we should document this
either with time-delayed replication or otherwise unless this is
already documented?

I think this should be documented at "Shutting Down the Server" section in runtime.sgml
or logical-replicaiton.sgml, but I cannot find. It will be included in next version.

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa.
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1]https://wiki.postgresql.org/wiki/Streaming_Replication[2]https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal, but I could not find any design about it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#15

amit.kapila16@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#11)

Re: Exit walsender before confirming remote flush in logical replication

On Wed, Dec 28, 2022 at 8:18 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

Dear Amit,

Firstly I considered 2, but I thought 1 seemed to be better.
PSA the updated patch.

I think even for logical replication we should check whether there is
any pending WAL (via pq_is_send_pending()) to be sent. Otherwise, what
is the point to send the done message? Also, the caller of
WalSndDone() already has that check which is another reason why I
can't see why you didn't have the same check in function WalSndDone().

I did not have strong opinion around here. Fixed.

BTW, even after fixing this, I think logical replication will behave
differently when due to some reason (like time-delayed replication)
send buffer gets full and walsender is not able to send data. I think
this will be less of an issue with physical replication because there
is a separate walreceiver process to flush the WAL which doesn't wait
but the same is not true for logical replication. Do you have any
thoughts on this matter?

Yes, it may happen even if this work is done. And your analysis is correct that
the receive buffer is rarely full in physical replication because walreceiver
immediately flush WALs.

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

--
With Regards,
Amit Kapila.

#16

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#14)

Re: Exit walsender before confirming remote flush in logical replication

At Wed, 28 Dec 2022 09:15:41 +0000, "Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com> wrote in

Another thing we can investigate here why do we need to ensure that
there is no pending send before shutdown.

I have not done yet about it, will continue next year.
It seems that walsenders have been sending all data before shutting down since ea5516,
e0b581 and 754baa.
There were many threads related with streaming replication, so I could not pin
the specific message that written in the commit message of ea5516.

I have also checked some wiki pages [1][2], but I could not find any design about it.

[1]: https://wiki.postgresql.org/wiki/Streaming_Replication
[2]: https://wiki.postgresql.org/wiki/Synchronous_Replication_9/2010_Proposal

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit. Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#17

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Amit Kapila (#15)

Re: Exit walsender before confirming remote flush in logical replication

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply. The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#18

[1]: https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#17)

RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san, Amit,

At Fri, 13 Jan 2023 16:41:08 +0530, Amit Kapila <amit.kapila16@gmail.com>
wrote in

Okay, but what happens in the case of physical replication when
synchronous_commit = remote_apply? In that case, won't it ensure that
apply has also happened? If so, then shouldn't the time delay feature
also cause a similar problem for physical replication as well?

As written in another mail, WalSndDone doesn't honor
synchronous_commit. In other words, AFAIS walsender finishes not
waiting remote_apply. The unapplied recods will be applied at the
next startup.

I didn't confirmed that behavior for myself, though..

If Amit wanted to say about the case that sending data is pending in physical
replication, the walsender cannot stop. But this is not related with the
synchronous_commit: it is caused because it must sweep all pending data before
shutting down. We can reproduce the situation with:

1. build streaming replication system
2. kill -STOP $walreceiver
3. insert data to primary server
4. try to stop the primary server

If what you said was not related with pending data, walsender can be stopped even
if the synchronous_commit = remote_apply. As Horiguchi-san said, such a condition
is not written in WalSndDone() [1]https://github.com/postgres/postgres/blob/master/src/backend/replication/walsender.c#L3121. I think the parameter synchronous_commit does
not affect walsender process so well. It just define when backend returns the
result to client.

I could check by following steps:

1. built streaming replication system. PSA the script to follow that.

Primary config.

```
synchronous_commit = 'remote_apply'
synchronous_standby_names = 'secondary'
```

Secondary config.

```
recovery_min_apply_delay = 1d
primary_conninfo = 'user=postgres port=$port_N1 application_name=secondary'
hot_standby = on
```

2. inserted data to primary. This waited the remote apply

psql -U postgres -p $port_primary -c "INSERT INTO tbl SELECT generate_series(1, 5000)"

3. Stopped the primary server from another terminal. It could be done.
The terminal on step2 said like:

```
WARNING: canceling the wait for synchronous replication and terminating connection due to administrator command
DETAIL: The transaction has already committed locally, but might not have been replicated to the standby.
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
connection to server was lost
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#19

[1]: https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#16)

RE: Exit walsender before confirming remote flush in logical replication

Dear Horiguchi-san,

If I'm grabbing the discussion here correctly, in my memory, it is
because: physical replication needs all records that have written on
primary are written on standby for switchover to succeed. It is
annoying that normal shutdown occasionally leads to switchover
failure. Thus WalSndDone explicitly waits for remote flush/write
regardless of the setting of synchronous_commit.

AFAIK the condition (sentPtr == replicatedPtr) seemed to be introduced for the purpose[1]https://github.com/postgres/postgres/commit/985bd7d49726c9f178558491d31a570d47340459.
You meant to say that the conditon (!pq_is_send_pending()) has same motivation, right?

Thus apply delay
doesn't affect shutdown (AFAICS), and that is sufficient since all the
records will be applied at the next startup.

I was not clear the word "next startup", but I agreed that we can shut down the
walsender in case of recovery_min_apply_delay > 0 and synchronous_commit = remote_apply.
The startup process will be not terminated even if the primary crashes, so I
think the process will apply the transaction sooner or later.

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

I updated the patch accordingly. Also, I found that the previous version
did not work well in case of streamed transactions. When a streamed transaction
is committed on publisher but the application is delayed on subscriber, the
process sometimes waits until there is no pending write. This is done in
ProcessPendingWrites(). I added another termination path in the function.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED

#20

amit.kapila16@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#19)

Re: Exit walsender before confirming remote flush in logical replication

On Mon, Jan 16, 2023 at 4:39 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:

In logical replication apply preceeds write and flush so we have no
indication whether a record is "replicated" to standby by other than
apply LSN. On the other hand, logical recplication doesn't have a
business with switchover so that assurarance is useless. Thus I think
we can (practically) ignore apply_lsn at shutdown. It seems subtly
irregular, though.

Another consideration is that the condition (!pq_is_send_pending()) ensures that
there are no pending messages, including other packets. Currently we force walsenders
to clean up all messages before shutting down, even if it is a keepalive one.
I cannot have any problems caused by this, but I can keep the condition in case of
logical replication.

Let me try to summarize the discussion till now. The problem we are
trying to solve here is to allow a shutdown to complete when walsender
is not able to send the entire WAL. Currently, in such cases, the
shutdown fails. As per our current understanding, this can happen when
(a) walreceiver/walapply process is stuck (not able to receive more
WAL) due to locks or some other reason; (b) a long time delay has been
configured to apply the WAL (we don't yet have such a feature for
logical replication but the discussion for same is in progress).

Both reasons mostly apply to logical replication because there is no
separate walreceiver process whose job is to just flush the WAL. In
logical replication, the process that receives the WAL also applies
it. So, while applying it can stuck for a long time waiting for some
heavy-weight lock to be released by some other long-running
transaction by the backend. Similarly, if the user has configured a
large value of time-delayed apply, it can lead to a network buffer
full between walsender and receive/process.

The condition to allow the shutdown to wait for all WAL to be sent has
two parts: (a) it confirms that there is no pending WAL to be sent;
(b) it confirms all the WAL sent has been flushed by the client. As
per our understanding, both these conditions are to allow clean
switchover/failover which seems to be useful only for physical
replication. The logical replication doesn't provide such
functionality.

The proposed patch tries to eliminate condition (b) for logical
replication in the hopes that the same will allow the shutdown to be
complete in most cases. There is no specific reason discussed to not
do (a) for logical replication.

Now, to proceed here we have the following options: (1) Fix (b) as
proposed by the patch and document the risks related to (a); (2) Fix
both (a) and (b); (3) Do nothing and document that users need to
unblock the subscribers to complete the shutdown.

Thoughts?

--
With Regards,
Amit Kapila.

#21

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#20)

#22

amit.kapila16@gmail.com

over 3 years ago

In reply to: Amit Kapila (#20)

#23

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Amit Kapila (#22)

#24

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Dilip Kumar (#23)

#25

Masahiko Sawada

sawada.mshk@gmail.com

over 3 years ago

In reply to: Amit Kapila (#22)

#26

amit.kapila16@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#25)

#27

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Amit Kapila (#26)

#28

Masahiko Sawada

sawada.mshk@gmail.com

over 3 years ago

In reply to: Amit Kapila (#26)

#29

amit.kapila16@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#28)

#30

amit.kapila16@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#27)

#31

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#29)

#32

amit.kapila16@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#31)

#33

andres@anarazel.de

over 3 years ago

In reply to: Amit Kapila (#29)

#34

amit.kapila16@gmail.com

over 3 years ago

In reply to: Andres Freund (#33)

#35

andres@anarazel.de

over 3 years ago

In reply to: Amit Kapila (#34)

#36

amit.kapila16@gmail.com

over 3 years ago

In reply to: Andres Freund (#35)

#37

andres@anarazel.de

over 3 years ago

In reply to: Amit Kapila (#36)

#38

amit.kapila16@gmail.com

over 3 years ago

In reply to: Andres Freund (#37)

#39

andres@anarazel.de

over 3 years ago

In reply to: Amit Kapila (#38)

#40

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Andres Freund (#39)

#41

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#40)

#42

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#41)

#43

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#42)

#44

amit.kapila16@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#43)

#45

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#44)

#46

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#43)

#47

osumi.takamichi@fujitsu.com

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#46)

#48

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#47)

#49

Masahiko Sawada

sawada.mshk@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#47)

#50

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: osumi.takamichi@fujitsu.com (#48)

#51

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Masahiko Sawada (#49)

#52

Peter Smith

smithpb2250@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#50)

#53

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Peter Smith (#52)

#54

amit.kapila16@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#53)

#55

kuroda.hayato@fujitsu.com

over 3 years ago

In reply to: Amit Kapila (#54)

#56

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#55)

#57

amit.kapila16@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#56)

#58

Peter Smith

smithpb2250@gmail.com

over 3 years ago

In reply to: Hayato Kuroda (Fujitsu) (#53)

#59

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Amit Kapila (#57)

#60

andres@anarazel.de

over 3 years ago

In reply to: Kyotaro Horiguchi (#59)

#61

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Andres Freund (#60)

#62

greg@turnstep.com

over 1 year ago

In reply to: Kyotaro Horiguchi (#61)

#63

Vitaly Davydov

v.davydov@postgrespro.ru

8 months ago

In reply to: Greg Sabino Mullane (#62)

#64

a.silitskiy@postgrespro.ru

6 months ago

In reply to: Hayato Kuroda (Fujitsu) (#53)

#65

masao.fujii@gmail.com

6 months ago

In reply to: Andrey Silitskiy (#64)

#66

a.silitskiy@postgrespro.ru

6 months ago

In reply to: Fujii Masao (#65)

#67

masao.fujii@gmail.com

6 months ago

In reply to: Andrey Silitskiy (#66)

#68

a.silitskiy@postgrespro.ru

6 months ago

In reply to: Fujii Masao (#67)

#69

Alexander Korotkov

aekorotkov@gmail.com

5 months ago

In reply to: Andrey Silitskiy (#68)

#70

masao.fujii@gmail.com

5 months ago

In reply to: Alexander Korotkov (#69)

#71

masao.fujii@gmail.com

5 months ago

In reply to: Andrey Silitskiy (#68)

#72

a.silitskiy@postgrespro.ru

5 months ago

In reply to: Alexander Korotkov (#69)

#73

kuroda.hayato@fujitsu.com

5 months ago

In reply to: Andrey Silitskiy (#72)

#74

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Hayato Kuroda (Fujitsu) (#73)

#75

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Fujii Masao (#71)

#76

masao.fujii@gmail.com

4 months ago

In reply to: Andrey Silitskiy (#75)

#77

Vitaly Davydov

v.davydov@postgrespro.ru

4 months ago

In reply to: Fujii Masao (#76)

#78

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Vitaly Davydov (#77)

#79

masao.fujii@gmail.com

4 months ago

In reply to: Andrey Silitskiy (#78)

#80

Ronan Dunklau

ronan@dunklau.fr

4 months ago

In reply to: Fujii Masao (#79)

#81

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Fujii Masao (#79)

#82

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Ronan Dunklau (#80)

#83

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Andrey Silitskiy (#81)

#84

Michael Paquier

michael@paquier.xyz

4 months ago

In reply to: Andrey Silitskiy (#83)

#85

masao.fujii@gmail.com

4 months ago

In reply to: Michael Paquier (#84)

#86

a.silitskiy@postgrespro.ru

4 months ago

In reply to: Michael Paquier (#84)

#87

Ronan Dunklau

ronan@dunklau.fr

4 months ago

In reply to: Andrey Silitskiy (#82)

#88

a.silitskiy@postgrespro.ru

3 months ago

In reply to: Ronan Dunklau (#87)

#89

Japin Li

japinli@hotmail.com

3 months ago

In reply to: Andrey Silitskiy (#88)

#90

a.silitskiy@postgrespro.ru

3 months ago

In reply to: Japin Li (#89)

#91

Alexander Korotkov

aekorotkov@gmail.com

3 months ago

In reply to: Andrey Silitskiy (#90)

#92

greg@turnstep.com

3 months ago

In reply to: Alexander Korotkov (#91)

#93

Japin Li

japinli@hotmail.com

3 months ago

In reply to: Alexander Korotkov (#91)

#94

masao.fujii@gmail.com

3 months ago

In reply to: Japin Li (#93)

#95

a.silitskiy@postgrespro.ru

3 months ago

In reply to: Greg Sabino Mullane (#92)

#96

a.silitskiy@postgrespro.ru

3 months ago

In reply to: Fujii Masao (#94)

#97

greg@turnstep.com

3 months ago

In reply to: Andrey Silitskiy (#95)

#98

a.silitskiy@postgrespro.ru

3 months ago

In reply to: Greg Sabino Mullane (#97)

#99

greg@turnstep.com

3 months ago

In reply to: Andrey Silitskiy (#98)

#100

masao.fujii@gmail.com

2 months ago

In reply to: Andrey Silitskiy (#98)

#101

Vitaly Davydov

v.davydov@postgrespro.ru

2 months ago

In reply to: Fujii Masao (#100)

#102

Alexander Korotkov

aekorotkov@gmail.com

2 months ago

In reply to: Fujii Masao (#100)

#103

a.silitskiy@postgrespro.ru

2 months ago

In reply to: Alexander Korotkov (#102)

#104

a.silitskiy@postgrespro.ru

2 months ago

In reply to: Andrey Silitskiy (#103)

#105

masao.fujii@gmail.com

2 months ago

In reply to: Andrey Silitskiy (#103)

#106

masao.fujii@gmail.com

2 months ago

In reply to: Andrey Silitskiy (#104)

#107

masao.fujii@gmail.com

about 2 months ago

In reply to: Fujii Masao (#106)

#108

a.silitskiy@postgrespro.ru

about 2 months ago

In reply to: Fujii Masao (#107)

#109

masao.fujii@gmail.com

about 2 months ago

In reply to: Andrey Silitskiy (#108)

#110

masao.fujii@gmail.com

about 2 months ago

In reply to: Fujii Masao (#109)

#111

andres@anarazel.de

about 2 months ago

In reply to: Fujii Masao (#110)

#112

masao.fujii@gmail.com

about 2 months ago

In reply to: Andres Freund (#111)

#113

li.evan.chao@gmail.com

about 2 months ago

In reply to: Fujii Masao (#112)

#114

masao.fujii@gmail.com

about 2 months ago

In reply to: Chao Li (#113)

#115

li.evan.chao@gmail.com

about 2 months ago

In reply to: Fujii Masao (#114)

#116

Evgeny Voropaev

evgeny.voropaev@tantorlabs.com

about 1 month ago

In reply to: Andres Freund (#111)

#117

masao.fujii@gmail.com

about 1 month ago

In reply to: Chao Li (#115)

#118

masao.fujii@gmail.com

about 1 month ago

In reply to: Fujii Masao (#117)

#119

li.evan.chao@gmail.com

about 1 month ago

In reply to: Fujii Masao (#118)

#120

masao.fujii@gmail.com

about 1 month ago

In reply to: Chao Li (#119)

#121

li.evan.chao@gmail.com

about 1 month ago

In reply to: Fujii Masao (#120)

#122